ArXiv Deep Research Map for Agent Cortex

Purpose: deepen this repo with a practical, operator-first arXiv map across all major categories.

How to use this map:

Start with the "Top 3 must-read" papers in each category.
Use "Expansion" to go deeper once you have a baseline.
Re-check benchmark claims against current leaderboards before publishing hard numbers.

1) Agent Frameworks and Reasoning Loops

Top 3 must-read (and why):

ReAct (2022) - Canonical think+act loop for tool-using agents.
Reflexion (2023) - Strong pattern for self-critique and iterative improvement.
Toolformer (2023) - Foundation for model-native tool-use behavior.

Expansion:

2) Coding Agents

Top 3 must-read (and why):

SWE-bench (2023) - Core real-repo bug-fix benchmark.
SWE-bench Multimodal (2024) - Extends coding eval to visual software artifacts.
SWE-bench Goes Live! (2025) - Live benchmark operations and methodology insights.

Expansion:

3) MCP, Tool Use, and Agent Reliability

Top 3 must-read (and why):

MCPMark (2025) - Current high-signal MCP stress test (real tasks, hard ceilings).
τ-bench (2024) - Reliability framing with pass^k in interactive tool workflows.
Towards a Science of AI Agent Reliability (2026) - Reliability science framing beyond headline accuracy.

Expansion:

4) Web and Computer-Use Agents

Top 3 must-read (and why):

WebArena (2023) - Realistic browser-task benchmark baseline.
VisualWebArena (2024) - Adds visual grounding pressure to web-agent evals.
OSWorld (2024) - Open-ended desktop environment with strong realism.

Expansion:

OS-Harm (2025)

5) Context Engineering and Memory

Top 3 must-read (and why):

RAG (2020) - Retrieval architecture baseline for external memory.
MemGPT (2023) - Memory tiering and virtual-context perspective.
Self-RAG (2023) - Retrieval with self-critique control loop.

Expansion:

6) Prompt and Programmatic Prompt Engineering

Top 3 must-read (and why):

Chain-of-Thought (2022) - Core reasoning prompt primitive.
Self-Consistency (2022) - Robustness boost for reasoning via sampling.
Tree of Thoughts (2023) - Deliberate search over candidate thoughts.

Expansion:

DSPy (2023)

7) Security and Robustness

Top 3 must-read (and why):

Universal and Transferable Adversarial Attacks on Aligned LMs (2023) - Baseline offensive pressure test.
StruQ (2024) - Structured prompt/query defense pattern.
SecAlign (2024) - Alignment-based defense for prompt injection.

Expansion:

8) Voice and Multimodal Agents

Top 3 must-read (and why):

Whisper / Robust Speech Recognition via Large-Scale Weak Supervision (2022) - Reliable speech baseline for voice agents.
Visual Instruction Tuning (LLaVA, 2023) - Core multimodal instruction-following pattern.
Mem-Gallery (2026) - Long-horizon multimodal conversational memory benchmark.

9) Evaluation Science and LLM-as-a-Judge

Top 3 must-read (and why):

HELM (2022) - Holistic evaluation methodology.
MT-Bench / Chatbot Arena LLM-as-a-Judge analysis (2023) - Judge-model strengths/limits in practice.
Systematic Evaluation of LLM-as-a-Judge (2024) - Template effects and judge reliability caveats.

10) Quant and Trading Agents

Top 3 must-read (and why):

FinRL Library (2020) - Practical DRL trading baseline.
FinRL Framework (2021) - End-to-end automation framing for quant agents.
FinRL-Podracer (2021) - Throughput/scalability for production-ish experiments.

Expansion:

11) Blockchain Identity, Payments, and DeFi-Adjacent Research

Top 3 must-read (and why):

Deployment of a Blockchain-Based Self-Sovereign Identity (2018) - SSI implementation grounding.
Design Patterns for Blockchain-based Self-Sovereign Identity (2020) - Reusable SSI architecture patterns.
Decentralized Finance (2023) - Broad DeFi systems framing.

Expansion:

Recent ArXiv Watchlist (last ~90 days, as of 2026-03-12)

Note: this watchlist is intentionally short and high-signal; verify final inclusion quality when papers leave preprint churn.

Monthly Refresh Workflow (template)

Use this on the first week of each month:

Pull candidates

Query arXiv for each category using category keywords and date filter for last 45 days.
Keep a raw scratch list (20-40 papers total).

Apply inclusion gate

Keep papers that pass at least 3 of 5:
- Clear method/benchmark contribution
- Reproducibility artifacts (code/data/eval details)
- Strong operator relevance (how to build, test, secure, deploy)
- Non-trivial novelty versus existing map
- Cross-category leverage (useful outside one niche)

Curate final set

Promote 1-3 papers per category max per month.
Move low-signal or superseded papers to an archive list.

Update repo docs

Update this map.
Update README only if a paper changes the practical narrative (for example, reliability ceilings, new benchmark standard).

Log rationale

For each promoted paper, add one-line "why it matters" in commit message or PR body.

Suggested commit format:

docs(research): monthly arxiv refresh YYYY-MM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArXiv Deep Research Map for Agent Cortex

1) Agent Frameworks and Reasoning Loops

2) Coding Agents

3) MCP, Tool Use, and Agent Reliability

4) Web and Computer-Use Agents

5) Context Engineering and Memory

6) Prompt and Programmatic Prompt Engineering

7) Security and Robustness

8) Voice and Multimodal Agents

9) Evaluation Science and LLM-as-a-Judge

10) Quant and Trading Agents

11) Blockchain Identity, Payments, and DeFi-Adjacent Research

Recent ArXiv Watchlist (last ~90 days, as of 2026-03-12)

Monthly Refresh Workflow (template)

FilesExpand file tree

arxiv-deep-research-map.md

Latest commit

History

arxiv-deep-research-map.md

File metadata and controls

ArXiv Deep Research Map for Agent Cortex

1) Agent Frameworks and Reasoning Loops

2) Coding Agents

3) MCP, Tool Use, and Agent Reliability

4) Web and Computer-Use Agents

5) Context Engineering and Memory

6) Prompt and Programmatic Prompt Engineering

7) Security and Robustness

8) Voice and Multimodal Agents

9) Evaluation Science and LLM-as-a-Judge

10) Quant and Trading Agents

11) Blockchain Identity, Payments, and DeFi-Adjacent Research

Recent ArXiv Watchlist (last ~90 days, as of 2026-03-12)

Monthly Refresh Workflow (template)