Purpose: deepen this repo with a practical, operator-first arXiv map across all major categories.
How to use this map:
- Start with the "Top 3 must-read" papers in each category.
- Use "Expansion" to go deeper once you have a baseline.
- Re-check benchmark claims against current leaderboards before publishing hard numbers.
Top 3 must-read (and why):
- ReAct (2022) - Canonical think+act loop for tool-using agents.
- Reflexion (2023) - Strong pattern for self-critique and iterative improvement.
- Toolformer (2023) - Foundation for model-native tool-use behavior.
Expansion:
Top 3 must-read (and why):
- SWE-bench (2023) - Core real-repo bug-fix benchmark.
- SWE-bench Multimodal (2024) - Extends coding eval to visual software artifacts.
- SWE-bench Goes Live! (2025) - Live benchmark operations and methodology insights.
Expansion:
Top 3 must-read (and why):
- MCPMark (2025) - Current high-signal MCP stress test (real tasks, hard ceilings).
- τ-bench (2024) - Reliability framing with pass^k in interactive tool workflows.
- Towards a Science of AI Agent Reliability (2026) - Reliability science framing beyond headline accuracy.
Expansion:
Top 3 must-read (and why):
- WebArena (2023) - Realistic browser-task benchmark baseline.
- VisualWebArena (2024) - Adds visual grounding pressure to web-agent evals.
- OSWorld (2024) - Open-ended desktop environment with strong realism.
Expansion:
Top 3 must-read (and why):
- RAG (2020) - Retrieval architecture baseline for external memory.
- MemGPT (2023) - Memory tiering and virtual-context perspective.
- Self-RAG (2023) - Retrieval with self-critique control loop.
Expansion:
- Lost in the Middle (2023)
- RETRO (2021)
- kNN-LM (2020)
- Evaluating Very Long-Term Conversational Memory of LLM Agents (2024)
- Mem-Gallery (2026)
Top 3 must-read (and why):
- Chain-of-Thought (2022) - Core reasoning prompt primitive.
- Self-Consistency (2022) - Robustness boost for reasoning via sampling.
- Tree of Thoughts (2023) - Deliberate search over candidate thoughts.
Expansion:
Top 3 must-read (and why):
- Universal and Transferable Adversarial Attacks on Aligned LMs (2023) - Baseline offensive pressure test.
- StruQ (2024) - Structured prompt/query defense pattern.
- SecAlign (2024) - Alignment-based defense for prompt injection.
Expansion:
Top 3 must-read (and why):
- Whisper / Robust Speech Recognition via Large-Scale Weak Supervision (2022) - Reliable speech baseline for voice agents.
- Visual Instruction Tuning (LLaVA, 2023) - Core multimodal instruction-following pattern.
- Mem-Gallery (2026) - Long-horizon multimodal conversational memory benchmark.
Top 3 must-read (and why):
- HELM (2022) - Holistic evaluation methodology.
- MT-Bench / Chatbot Arena LLM-as-a-Judge analysis (2023) - Judge-model strengths/limits in practice.
- Systematic Evaluation of LLM-as-a-Judge (2024) - Template effects and judge reliability caveats.
Top 3 must-read (and why):
- FinRL Library (2020) - Practical DRL trading baseline.
- FinRL Framework (2021) - End-to-end automation framing for quant agents.
- FinRL-Podracer (2021) - Throughput/scalability for production-ish experiments.
Expansion:
Top 3 must-read (and why):
- Deployment of a Blockchain-Based Self-Sovereign Identity (2018) - SSI implementation grounding.
- Design Patterns for Blockchain-based Self-Sovereign Identity (2020) - Reusable SSI architecture patterns.
- Decentralized Finance (2023) - Broad DeFi systems framing.
Expansion:
- Is DeFi Actually Decentralized? (2022)
- Smart-LLaMA (2024)
- Generative LLM Usage in Smart-Contract Vulnerability Detection (2025)
Note: this watchlist is intentionally short and high-signal; verify final inclusion quality when papers leave preprint churn.
- Towards a Science of AI Agent Reliability (2026-02)
- The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol (2026-02)
- Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents (2026-01)
- Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents (2026-01)
- OmniCode: A Benchmark for Evaluating Software Engineering Agents (2026-02)
- Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation (2026-02)
- MCP-in-SoS: Risk assessment framework for open-source MCP servers (2026-03)
- Compatibility at a Cost: Systematic Discovery and Exploitation of MCP Clause-Compliance Vulnerabilities (2026-03)
Use this on the first week of each month:
- Pull candidates
- Query arXiv for each category using category keywords and date filter for last 45 days.
- Keep a raw scratch list (20-40 papers total).
- Apply inclusion gate
- Keep papers that pass at least 3 of 5:
- Clear method/benchmark contribution
- Reproducibility artifacts (code/data/eval details)
- Strong operator relevance (how to build, test, secure, deploy)
- Non-trivial novelty versus existing map
- Cross-category leverage (useful outside one niche)
- Curate final set
- Promote 1-3 papers per category max per month.
- Move low-signal or superseded papers to an archive list.
- Update repo docs
- Update this map.
- Update README only if a paper changes the practical narrative (for example, reliability ceilings, new benchmark standard).
- Log rationale
- For each promoted paper, add one-line "why it matters" in commit message or PR body.
Suggested commit format:
- docs(research): monthly arxiv refresh YYYY-MM