LLM Agent 论文日报 · 2026年03月17日

第 1 篇 / 共 5 篇

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

📄 arXiv:2603.15594v1 📥 下载 PDF

cs.AIcs.CLcs.AI📅 2026-03-16

👥 作者

Yuwen Du、Rui Ye、Shuo Tang、Xinyu Zhu、Yijun Lu、Yuzhu Cai、Siheng Chen

🏫 机构单位

Shanghai Jiao Tong University, Equal Core Contributions, Project Lead
scarcity has fundamentally hindered the progress of the broader research community in developing
two core technical innovations () Fact-grounded scalable controllable QA synthesis, which
complex, multi-hop reasoning tasks with controllable coverage and complexity. () Denoised
DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH

📝 论文摘要（原文）

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

🔭 研究背景与动机

In the era of information explosion, seeking accurate, real-time, and reliable information from the vast expanse of the internet has become a fundamental pillar of modern decision-making (Marchionini, 1995; Given et al., 2023). Consequently, the ability to perform deep search has emerged as a non-negotiable competency for frontier Large Language Model (LLM) agents (OpenAI, 2025a). The past year has witnessed a rapid rise in the development of search agents. As recently as April 10, 2025, even the most advanced LLMs, such as OpenAI’s o1 (OpenAI, 2024), struggled to surpass a score of 10 on the representative BrowseComp (Wei et al., 2025) benchmark. Yet, by March 2026, the landscape has shifted dramatically, with over ten agentic LLMs now exceeding the 50-point threshold (OpenAI, 2025b; Team et al., 2026a; Zeng et al., 2026), signaling a new era of autonomous web intelligence. However, despite this rapid progress, the training of high-performance search agents has remained a "closed- door game" played almost exclusively by well-funded corporate entities (OpenAI, 2026; Team et al., 2026a). The most capable search agents are currently dominated by proprietary models from giants such as Google and OpenAI. While prominent labs including Kimi and Minimax have contributed open-weights models, they have remained silent regarding their training data. Even within the research community, existing works either open-source the model without data (Li et al., 2025b), provide only a fraction of data (Li et al., 2025c), or fail to achieve competitive performance (Lu et al., 2025). This persistent lack of complete high-quality training data has stifled the growth of the open-source community for nearly a year. To bridge this gap, we, a purely academic team, introduce OpenSeeker, the first fully open-source search agent that achieves frontier-level performance in web search tasks. OpenSeeker is not merely an open- weights model; it is a comprehensive democratization of the search agent……

💡 核心贡献

▶ To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity.

⚙️ 方法详解

3.1 Overview & Problem Formulation Our primary objective is to synthesize a high-fidelity dataset D = {(q, y, τ ∗)} comprising complex queries q, ground truth answers y, and optimal tool-use trajectories τ ∗. This dataset aims to empower an agent πθ to master long-horizon tool invocation for deep search tasks. We model the web as a directed graph G = (V, E), where V denotes web pages and E denotes hyperlinks. The synthesis challenge is to derive pairs (q, y) from G such that solving q necessitates a trajectory τ = [a1, o1, . . . , aT , oT ] of length T ≫1, where at are search actions and ot are observations. We argue that to effectively train deep search agents, one must address two pivotal challenges: (1) High-difficulty QA: Only sufficiently complex queries compel the system to engage in a rigorous multi-turn interaction cycle involving “Reasoning →Tool Call →Tool Response”. This process is essential to generate long-horizon trajectories characterized by explicit decision points and extended tool invocation chains. (2) High-quality trajectories: The synthesis of solution paths must rely on stable and reproducible methods to ensure that the distilled training signals represent “correct and generalizable” strategies rather than accidental successes derived from stochastic sampling. To address these, we propose a fact-grounded scalable controllable QA synthesis framework and a denoised trajectory synthesis method. The QA synthesis framework operates on the premise of reverse- engineering the reasoning graph: we first identify a latent inference path within G and then construct a question q that structurally mandates traversing this path. Complementarily, our trajectory synthesis method utilizes dynamic context denoising to generate clear reasoning and precise tool calls. By subsequently training on raw trajectories, we enable the agent to intrinsically learn to denoise and extract relevant information from noisy tool responses. 3.2 Fact-Grounded Scalable Controllable QA Synthesis We engineer a pipeline to construct question-answer pairs (q, y) directly from the web graph G, as shown in Figure 2. By leveraging intrinsic connectivity, we transform static hyperlinks into dynamic reasoning paths, ensuring factual grounding and controllable complexity. This scalable framework operates in two distinct phases: Generative Construction to synthesize candidate pairs, and Dual-Criteria Verification to rigorously filter for difficulty and solvability. 4 3.2.1 Generativ……

🔬 实验与结果

📊 关键实验数据

8%
100%
21.6%
xbench: 1
xbench: 2025

4.1 Experimental Setup Implementation. We develop OpenSeeker, a deep search agent initialized from Qwen3-30B-A3B-Thinking- 2507 (Team, 2025), featuring 30B total parameters with 3B activated during prediction. The maximum tool call limit is set to 200, with any trajectory exceeding this threshold being forcibly terminated. The context window size is set to 256k. Each training sample comprises a user question q and a sequence of

第 2 篇 / 共 5 篇

Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents

📄 arXiv:2603.15566v1 📥 下载 PDF

cs.SEcs.AI📅 2026-03-16

👥 作者

Ivan Stetsenko

🏫 机构单位

Lore Repurposing Git Commit Messages as a
Independent Researcher
software industry faces an accelerating loss of institutional knowledge. Each commit
that the solution already exists in every software project the git commit message.
I propose Lore, a lightweight protocol that restructures commit messages using

📝 论文摘要（原文）

As AI coding agents become both primary producers and consumers of source code, the software industry faces an accelerating loss of institutional knowledge. Each commit captures a code diff but discards the reasoning behind it - the constraints, rejected alternatives, and forward-looking context that shaped the decision. I term this discarded reasoning the Decision Shadow. This paper proposes Lore, a lightweight protocol that restructures commit messages - using native git trailers - into self-contained decision records carrying constraints, rejected alternatives, agent directives, and verification metadata. Lore requires no infrastructure beyond git, is queryable via a standalone CLI tool, and is discoverable by any agent capable of running shell commands. The paper formalizes the protocol, compares it against five competing approaches, stress-tests it against its strongest objections, and outlines an empirical validation path.

🔭 研究背景与动机

1.1 The Decision Shadow Every commit in a software project is the visible output of an invisible process. A developer (or AI agent) encounters a problem, considers several possible approaches, evaluates the tradeoffs, selects one, and implements it. The commit captures exactly one artifact from this process: the final diff. Everything else—the problem definition, the alternatives considered, the reasons for rejection, the constraints that shaped the decision, the confidence level, the known weaknesses—evaporates. I call this lost context the Decision Shadow: the unrecorded reasoning behind why the code looks the way it does at any given point. Over time, Decision Shadows accumulate. Each one is individually small. Collectively, they produce what the industry calls “legacy code”—code that functions but whose structural rationale is lost. Peng and Wang [1] describe this phenomenon as “tacit knowledge” that “often lives in developer experience or informal artifacts rather than in code” and observe that 1 arXiv:2603.15566v1 [cs.SE] 16 Mar 2026 “current [AI] assistants cannot reliably retrieve or reconstruct this knowledge on demand.” 1.2 The Commit Message as It Exists Today The Conventional Commits specification [5], the closest approximation to an industry stan- dard, encodes a message as type(scope): short description. This tells you what happened (fix(auth): handle expired token refresh) but is nearly useless for understanding why the code evolved the way it did. The problem is structural, not motivational. The format was designed for a world where humans were the only consumers of commit history, the diff was the primary artifact, and deep context lived in people’s heads—transmitted orally through standups, pull request reviews, and hallway conversations. All three assumptions are collapsing. 1.3 Why This Matters Now Two shifts make the Decision Shadow problem urgent. Shift 1: AI agents are now primary code consumers. Tools such as Claude Code [6], GitHub Copilot [……

💡 核心贡献

▶ This paper proposes Lore, a lightweight protocol that restructures commit messages - using native git trailers - into self-contained decision records carrying constraints, rejected alternatives, agent directives, and verification metadata.

⚙️ 方法详解

🔬 实验与结果

📌 请参阅原文实验章节获取详细数据

🎯 论文中的具体示例

📌 原文摘录 / Case Study

it represents significant infrastructure investment that most projects cannot undertake. Lore identifies the same problem but proposes a radically lighter-weight treatment: enriching an artifact every project already has (commit messages) using a mechanism git already supports (trailers), queryable through a tool every agent already knows how to use (a CLI). 2.3 Git Context Controller The Git Context Controller (GCC) [3] proposes a structured context management framework for AI agents, using git-inspired operations (COMMIT, BRANCH, MERGE, CONTEXT) over a .GCC/ directory to organize agent memor

⚠️ 局限性与未来方向

Every codebase has lore. Most of it is lost. The protocol proposed here is a first step toward changing that. Disclosure of AI-Assisted Tools The author used AI-assisted tools during the research and writing process for this paper. Specifically, Claude (Anthropic, claude-opus-4-6) was used as an interactive research collab- orator for brainstorming the core thesis, iterating on protocol design decisions, conducting structured literature review, and drafting and refining text. Google NotebookLM w

第 3 篇 / 共 5 篇

Agentic workflow enables the recovery of critical materials from complex feedstocks via selective precipitation

📄 arXiv:2603.15491v1 📥 下载 PDF

cs.AI📅 2026-03-16

👥 作者

Andrew Ritchhart、Sarah I. Allec、Pravalika Butreddy、Krista Kulesa、Qingpu Wang、Dan Thien Nguyen、Maxim Ziatdinov、Elias Nakouzi

🏫 机构单位

⚠️ 未能从 PDF 中提取机构信息，请查阅原文

📝 论文摘要（原文）

We present a multi-agentic workflow for critical materials recovery that deploys a series of AI agents and automated instruments to recover critical materials from produced water and magnet leachates. This approach achieves selective precipitation from real-world feedstocks using simple chemicals, accelerating the development of efficient, adaptable, and scalable separations to a timeline of days, rather than months and years.

🔭 研究背景与动机

💡 核心贡献

▶ We present a multi-agentic workflow for critical materials recovery that deploys a series of AI agents and automated instruments to recover critical materials from produced water and magnet leachates.
▶ This approach achieves selective precipitation from real-world feedstocks using simple chemicals, accelerating the development of efficient, adaptable, and scalable separations to a timeline of days, rather than months and years.

⚙️ 方法详解

🔬 实验与结果

📌 请参阅原文实验章节获取详细数据

第 4 篇 / 共 5 篇

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

📄 arXiv:2603.15483v1 📥 下载 PDF

🏆 ICLR 2026cs.AIcs.AI📅 2026-03-16

👥 作者

Penny Chong、Harshavardhan Abichandani、Jiyuan Shen、Atin Ghosh、Min Pyae Moe、Yifan Mai、Daniel Dahlmeier

🏫 机构单位

Stanford University
yifan cs.stanford.edu
lenging to create a scalable evaluation framework. Prior works each employ their
own methods to determine task success, such as database lookups, regex match,
with its own goals, creating a scalable unified evaluation framework which reliably assesses agent

📝 论文摘要（原文）

Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user's role nor expertise in the interaction, providing incomplete insights into the agent's performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoals-such as tool signatures, and responses-as natural language grading notes, evaluated automatically with LLM-as-a-judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agent's design.

🔭 研究背景与动机

Large Language Models (LLMs) agents (Liu et al., 2023; Jang et al., 2025; Koh et al., 2024) are in- creasingly being adopted for many real-world tasks in various domains due to their potential of fully automating mundane workflows and enhancing productivity. However, evaluation of agents remains a challenge today due to the heterogeneous domains the agents operate in. As every domain comes with its own goals, creating a scalable unified evaluation framework which reliably assesses agent performance across diverse tasks is non-trivial. Existing works (Qian et al., 2024; Lu et al., 2024; Barres et al., 2025; Chang et al., 2024) each propose their own evaluation methods, e.g., checking database states, tool signatures, or exact matches which differ in scope and assumptions, making unification challenging. Moreover, since agent behavior is heavily influenced by the conversation trajectory with the user, current assessment methods that overlook the user’s role in the interaction may fail to comprehensively capture agent’s performance. Given that agents are non-deterministic and it is difficult to craft reference conversations, a common practice to interact with the agent is to dynamically simulate the user responses in the conversation loop with the agent (Yao et al., 2024). This has been adopted as a common practice for agent ∗Corresponding author. 1Code and dataset are available in the repository https://github.com/SAP-samples/agent-quality-inspect. 1 arXiv:2603.15483v1 [cs.AI] 16 Mar 2026 Accepted as a conference paper at ICLR 2026 evaluation because static user setups, where user messages are predetermined, do not work. This is because the agent’s responses to earlier predetermined user inputs may diverge from the reference conversation for which the static messages were curated. However, most works employing dynamic conversation have limitations because they do not systematically separate user persona from task instructions, thus failing to account for the impact of……

💡 核心贡献

▶ To address this, we introduce the TED framework (Talk, Evaluate, Diagnose).
▶ (2) Evaluate: We adapt existing datasets by representing subgoals-such as tool signatures, and responses-as natural language grading notes, evaluated automatically with LLM-as-a-judge.
▶ We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup.
▶ (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement.
▶ We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels.
▶ We also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agent's design.

⚙️ 方法详解

⚙️ 主要步骤：

often stop at metric reporting. To address these shortcomings, we propose the TED framework (Talk, Evaluate, Diagnose).
In the Talking stage, we decouple user personas from task instructions and introduce a user-aware agent evaluation framework based on reusable, generic persona templates enabling diverse and systematic creation of test scenarios.

🔬 实验与结果

📊 关键实验数据

🎯 论文中的具体示例

📌 原文摘录 / Case Study

Agent should enable Wifi. More examples are in Appendix A.12. 3.2.1 LLM-AS-A-JUDGE AND MAXPROGRESSRATE@k LLM-as-a-judge. We extend beyond the multi-step agent-environment setting and exact match metric (Chang et al., 2024) by evaluating agents in a multi-turn user-agent setup, where grading notes serve as subgoals to assess both intermediate and final states, tool calls, as well as the agent’s output responses. Let D = {(i, Gi) | i ∈I} be the test dataset, where i ∈I is a task instruction, Gi = {gi,1, gi,2, ..., gi,ni} be the set of grading notes associated with the task instruction i, and |Gi

第 5 篇 / 共 5 篇

Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents

📄 arXiv:2603.15473v1 📥 下载 PDF

cs.AIcs.AI📅 2026-03-16

👥 作者

Zidane Wright、Jason Tsay、Anupama Murthi、Osher Elhadad、Diego Del Rio、Saurabh Goyal、Kiran Kate、Jim Laredo、Koren Lazar、Vinod Muthusamy 等（共 11 位作者）

🏫 机构单位

IBM Research
Correspondence jason.tsay ibm.com
lar middleware that detects, repairs, and mitigates common failure
Computing methodologies Intelligent agents Software
Muthusamy, Yara Rizk, IBM Research, Correspondence jason.tsay ibm.com

📝 论文摘要（原文）

As AI agents move from demos into enterprise deployments, their failure modes become consequential: a misinterpreted tool argument can corrupt production data, a silent reasoning error can go undetected until damage is done, and outputs that violate organizational policy can create legal or compliance risk. Yet, most agent frameworks leave builders to handle these failure modes ad hoc, resulting in brittle, one-off safeguards that are hard to reuse or maintain. We present the Agent Lifecycle Toolkit (ALTK), an open-source collection of modular middleware components that systematically address these gaps across the full agent lifecycle. Across the agent lifecycle, we identify opportunities to intervene and improve, namely, post-user-request, pre-LLM prompt conditioning, post-LLM output processing, pre-tool validation, post-tool result checking, and pre-response assembly. ALTK provides modular middleware that detects, repairs, and mitigates common failure modes. It offers consistent interfaces that fit naturally into existing pipelines. It is compatible with low-code and no-code tools such as the ContextForge MCP Gateway and Langflow. Finally, it significantly reduces the effort of building reliable, production-grade agents.

🔭 研究背景与动机

The agentic paradigm has accelerated rapidly as developers build in- creasingly capable LLM-powered agents that can reason, call tools, and produce structured outputs. Yet, these systems remain funda- mentally brittle: as complexity grows, so do issues like hallucinated tool calls, silent failures, inconsistent outputs, and reasoning errors that break workflows. To address these challenges, we introduce ALTK, an open-source, framework-agnostic package that improves agent reliability, predictability, and production readiness. Agent Lifecycle Toolkit (ALTK) can integrate into any agent pipeline and add deterministic safeguards and recovery mechanisms that elevate agents from “cool demos” to dependable, enterprise-grade systems. Early agents often rely on a simple loop of repeated LLM tool calls, useful for prototypes but insufficient for enterprise reliability. Production agents need additional logic to ensure robustness, es- pecially in domains like sales where a single misinterpreted field can trigger incorrect APIs and distort downstream forecasts. Agent orchestration frameworks such as LangChain [4], LangGraph [10], CrewAI [16] offer building blocks such as tools, memory, popular agent architectures. However, they expect the developers to write custom handling of tool call errors or check for policy conformance. ALTK is a modular toolkit that comes with pre-built, hardened drop-in components to strengthen reasoning, tool execution, and output validation in agents. Rather than enforcing a particular agent framework (such as LangChain, LangGraph, or AutoGPT), its framework-agnostic design allows teams to introduce targeted reliability improvements without re-architecting their agents. ALTK currently includes 10 components, each addressing a dis- tinct failure mode in the agent lifecycle, as summarized in Figure 1. For example, the lifecycle stage of "Pre-Tool" indicates a step in the agent’s execution when the LLM has generated a tool call but the tool call is yet t……

💡 核心贡献

▶ We present the Agent Lifecycle Toolkit (ALTK), an open-source collection of modular middleware components that systematically address these gaps across the full agent lifecycle.

⚙️ 方法详解

⚙️ 主要步骤：

defining the input to the component,
instantiating and configuring the component, and

🔬 实验与结果

📊 关键实验数据

LiveAPIBench: 5

🤖 LLM Agent 论文日报