LLM Agent 论文日报 · 2026年03月18日

第 1 篇 / 共 5 篇

RECOVER: Robust Entity Correction via agentic Orchestration of hypothesis Variants for Evidence-based Recovery

📄 arXiv:2603.16411v1 📥 下载 PDF

cs.CLcs.CL📅 2026-03-17

👥 作者

Abhishek Kumar、Aashraya Sachdeva

🏫 机构单位

decode time has been a primary research focus.
which is often unavailable when using production ASR systems
cross-lingual evidence . However, all these methods face
a fundamental evidence limitation that if an entity is deleted
largely been limited to conversational or dialogue tasks ,

📝 论文摘要（原文）

Entity recognition in Automatic Speech Recognition (ASR) is challenging for rare and domain-specific terms. In domains such as finance, medicine, and air traffic control, these errors are costly. If the entities are entirely absent from the ASR output, post-ASR correction becomes difficult. To address this, we introduce RECOVER, an agentic correction framework that serves as a tool-using agent. It leverages multiple hypotheses as evidence from ASR, retrieves relevant entities, and applies Large Language Model (LLM) correction under constraints. The hypotheses are used using different strategies, namely, 1-Best, Entity-Aware Select, Recognizer Output Voting Error Reduction (ROVER) Ensemble, and LLM-Select. Evaluated across five diverse datasets, it achieves 8-46% relative reductions in entity-phrase word error rate (E-WER) and increases recall by up to 22 percentage points. The LLM-Select achieves the best overall performance in entity correction while maintaining overall WER.

🔭 研究背景与动机

💡 核心贡献

▶ To address this, we introduce RECOVER, an agentic correction framework that serves as a tool-using agent.

⚙️ 方法详解

🔬 实验与结果

📊 关键实验数据

46%
22 percent
33.2%
20.8%
94.51%
Retrieval: 1

第 2 篇 / 共 5 篇

VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

📄 arXiv:2603.16289v1 📥 下载 PDF

cs.CVcs.AIcs.CV📅 2026-03-17

👥 作者

Zhengbo Zhang、Jinbo Su、Zhaowen Zhou、Changtao Miao、Yuhan Hong、Qimeng Wu、Yumeng Liu、Feier Wu、Yihe Tian、Yuhao Liang 等（共 17 位作者）

🏫 机构单位

But existing benchmarks suffer from two limitations insufficient evaluation of vi-
only achieves an accuracy of . , while the proprietary Deep Research model, o-
deep-research only achieves an accuracy of . . The code and data can be accessed
a plethora of high-quality works has emerged in the deep research domain , , , . How-
ever, existing deep research benchmarks predominantly focus on the textual modality, thereby

📝 论文摘要（原文）

The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models' visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse-Bench

🔭 研究背景与动机

Driven by the rapid advancements in large language models (LLMs) and agent technologies, a plethora of high-quality works has emerged in the deep research domain [1, 2, 3, 4]. How- ever, existing deep research benchmarks predominantly focus on the textual modality, thereby neglecting the multimodal demands inherent in real-world retrieval scenarios. Concurrently, the evolution of multimodal large language models (MLLMs) has inspired a series of works on Multimodal Browsing Agents [5, 6]. Nevertheless, existing multimodal benchmarks still exhibit significant limitations, as shown in Figure 1. Specifically, most current benchmarks (e.g., MMSearch [7] and BrowseComp-VL [8]) merely test models’ ability to invoke tools for solving text-image queries. These tasks typically introduce *Equal contribution. †Work done during internship in Ant Group. ‡Corresponding to smxiang@nlpr.ia.ac.cn, xxxx@antgroup.com arXiv:2603.16289v1 [cs.CV] 17 Mar 2026 Issue1: Visual queries can be replaced by image search tools, and MLLM can be replaced by LLM. Issue2: The search process lacks multimodal content, and the task degenerates into a plain text deep search. VisBrowse-Bench Agentic Framework Question: What was this building before it renovated to be a museum? Ground truth: Beer factory MLLM The original question asked about a building, which is now a museum. First, I need to determine which museum is shown in the picture. image_search Whole image search, without reasoning ...Webpage title: Sapporo Beer Museum – Wikipedia URL: … Question: In what year did the person to Einstein's right rear (from Einstein's perspective) receive their PhD? Ground truth: 1926 MLLM I need to locate Einstein and identify who is to his right rear. crop_image image_search Reasoning with image Identify the person is Paul Dirac Degenerate into a plain text deep search When did Paul Dirac receive his PhD? text_sear ch Questi……

💡 核心贡献

▶ To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench.
▶ We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process.

⚙️ 方法详解

⚙️ 主要步骤：

44.4 45.8 35.7 39.1 25.9 50.0 25.0 + IS 40.2 (+1.
44.4 45.8 42.9 43.5 48.1 29.2 18.8 Gemini-3.0-Flash Direct Answer 32.5 29.6 33.3 28.6 34.8 37.0 33.3 31.2 + TS 37.9 (+5.
40.7 50.0 28.6 43.5 29.6 45.8 25.0 + IS 39.1 (+1.
40.7 45.8 32.1 43.5 37.0 45.8 25.0 Gemini-2.5-Pro Direct Answer 19.5 25.9 4.2 28.6 26.1 14.8 12.5 25.0 + TS 20.7 (+1.
40.7 4.2 25.0 26.1 18.5 16.7 6.2 + IS 26.6 (+5.

Overall Category Meida Life Art Geography Techology Sport Finance Closed-source Model Gemini-3.0-Pro Direct Answer 23.7 25.9 16.7 28.6 30.4 25.9 16.7 18.8 + TS 38.5 (+14.8) 44.4 45.8 35.7 39.1 25.9 50.0 25.0 + IS 40.2 (+1.7) 44.4 45.8 42.9 43.5 48.1 29.2 18.8 Gemini-3.0-Flash Direct Answer 32.5 29.6 33.3 28.6 34.8 37.0 33.3 31.2 + TS 37.9 (+5.4) 40.7 50.0 28.6 43.5 29.6 45.8 25.0 + IS 39.1 (+1.2) 40.7 45.8 32.1 43.5 37.0 45.8 25.0 Gemini-2.5-Pro Direct Answer 19.5 25.9 4.2 28.6 26.1 14.8 12.5 25.0 + TS 20.7 (+1.2) 40.7 4.2 25.0 26.1 18.5 16.7 6.2 + IS 26.6 (+5.9) 37.0 25.0 32.1 34.8 22.2 25.0 0.0 Gemini-2.5-Flash Direct Answer 9.5 14.8 4.2 0.0 13.0 22.2 8.3 0.0 + TS 17.2 (+7.7) 40.7 16.7 10.7 21.7 7.4 12.5 6.2 + IS 20.7 (+3.5) 37.0 12.5 14.3 26.1 29.6 16.7 0.0 GPT-5.2 Direct Answer 14.8 7.4 25.0 21.4 13.0 18.5 12.5 0.0 + TS 26.0 (+11.2) 33.3 41.7 17.9 26.1 25.9 25.0 6.2 + IS 28.4 (+2.4) 37.0 29.2 25.0 30.4 37.0 25.0 6.2 GPT-5.1 Direct Answer 13.0 14.8 20.8 21.4 17.4 11.1 0.0 0.0 + TS 16.6 (+3.6) 25.9 29.2 14.3 21.7 14.8 0.0 6.2 + IS 23.1 (+6.5) 22.2 41.7 14.3 30.4 22.2 12.5 18.8 Claude-4.6-Opus Direct Answer 27.2 37.0 25.0 28.6 26.1 29.6 20.8 18.8 + TS 42.6 (+15.4) 48.1 45.8 35.7 56.5 48.1 33.3 25.0 + IS 47.6 (+5.0) 53.6 50.0 35.7 56.5 59.3 45.8 25.0 Claude-4.6-Sonnet Direct Answer 13.0 18.5 12.5 14.3 26.1 11.1 4.2 0.0 + TS 23.7 (+10.7) 29.6 20.8 25.0 39.1 14.8 20.8 12.5 + IS 18.3 (-5.4) 25.9 20.8 3.6 30.4 18.5 20.8 6.2 Kimi-K2.5 Direct Answer 21.3 18.5 12.5 17.9 26.1 25.9 29.2 18.8 + TS 21.9 (+0.6) 29.6 16.7 14.3 34.8 29.6 16.7 6.2 + IS 41.4 (+19.5) 63.0 50.0 25.0 39.1 51.9 33.3 18.8 Qwen3-VL-Plus Direct Answer 17.8 18.5 20.8 17.9 8.7 25.9 20.8 6.2 + TS 27.8 (+10.0) 33.3 29.2 21.4 30.4 37.0 20.8 18.8 + IS 32.5 (+4.7) 37.0 50.0 25.0 39.1 40.7 16.7 12.5 Open-source Model Qwen3-VL-235B-A22B Direct Answer 10.1 11.1 20.8 3.6 8.7 14.8 8.3 0.0 + TS 11.2 (+1.0) 11.1 12.5 0.0 17.4 22.2 12.5 0.0 + IS 14.2 (+3.0) 25.9 25.0 3.6 26.1 11.1 4.2 0.0 Deep Research Model o3-Deep-Research Direct Answer 41.1 55.6 41.7 21.4 52.2 48.1 37.5 25.0 4.2. Results and Analysis The main results on VisBrowse-Bench are shown in Table 2, and we make the analysis as follow: Inherent challenge. Under the direct answer method, all models perform poorly, demonstrat- ing the challenging nature of our benchmark in complex real-world search tasks, where the model’s parameterized knowledge is insufficient to handle complex queries requiring dynami- cally acquired evidence. 9 Kimi-K2.5 55.7% 5.7%……

🔬 实验与结果

📊 关键实验数据

47.6%
41.1%
30%

4.1. Experimental Setups Evaluated Models. We evaluate both closed-source MLLMs, open-source MLLMs and Deep Research Models on VisBrowse-Bench. Closed-source models include Gemini family: Gemini- 3.0-Pro, Gemini-3.0-Flash, Gemini-2.5-Pro and Gemini-2.5-Flash [21]; GPT family: GPT-5.2 and GPT-5.1 [22]; Claude family: Claude-4.6-Opus and Claude-4.6-Sonnet [23]; Kimi-K2.5 [24] and Qwen3-VL-Plus [25]. Open-source model includes Qwen3-VL-235B-A22B-Instruct [25]. Deep Research model includes o3-Deep-Research [26]. Implement Details. To quantify the impact of tool usage on performance, we evaluate each model under three progressively enhanced tool-use methods: • Direct Answer: Models answer the question relying on internal parametric knowledge, without external tool access. • + Text Search (+ TS): Models can only use text_search and webpage_visit tools to support evidence acquisition. • + Image Search (+ IS): Models can use all the tools in the Section 3.5 framework to collect visual and textual evidence. Evaluation Metrics. We use accuracy (%) as the metric to evaluate the model’s performance on VisBrowse-Bench. Through the regular expression matching method, we extract the model’s final answer. We use GPT-5.1 as the judge model and employ the LLM-as-Judge method to compare the model’s answer with the ground truth and determine its correctness. Details of the judgment model’s prompt are in the Appendix 5.1. 8 Table 2: Main results of the VisBrowse-Bench across different categories. The evaluation metric is accuracy (%). Each model is evaluated using three methods: Direct answer, + TS and + IS. The green numbers represent performance improvement compared to the previous method, and red numbers represent performance degradation. The bold numbers represent the best accuracy overall or in each category, and underlined numbers represent the second-best accuracy. Model

🎯 论文中的具体示例

📌 原文摘录 / Case Study

category in VisBrowse-Bench are shown in Figure 3. To ensure high-quality and data diversity, the question and answer pairs in each category were obtained through collaborative work by two experts in the field. Each question is an image–text bundle: a textual query paired with a small set of images containing initial entity information. In this section, we will outline the details of VisBrowse-Bench. The entire pipeline includes the development of data design criteria, data collection, and a strict validation process. 4 Sport Question: The athlete in the red shirt in the picture has taken a ph

第 3 篇 / 共 5 篇

Adaptive Theory of Mind for LLM-based Multi-Agent Coordination

📄 arXiv:2603.16264v1 📥 下载 PDF

cs.AIcs.AI📅 2026-03-17

👥 作者

Chunjiang Mu、Ya Zeng、Qiaosheng Zhang、Kun Shao、Chen Chu、Hao Guo、Danyang Jia、Zhen Wang、Shuyue Hu

🏫 机构单位

School of Cybersecurity, Northwestern Polytechnical University
Shanghai Artificial Intelligence Laboratory
Huawei Noah s Ark Lab
School of Statistics and Mathematics, Yunnan University of Finance and Economics
QiYuan Lab

📝 论文摘要（原文）

Theory of Mind (ToM) refers to the ability to reason about others' mental states, and higher-order ToM involves considering that others also possess their own ToM. Equipping large language model (LLM)-driven agents with ToM has long been considered to improve their coordination in multiagent collaborative tasks. However, we find that misaligned ToM orders-mismatches in the depth of ToM reasoning between agents-can lead to insufficient or excessive reasoning about others, thereby impairing their coordination. To address this issue, we design an adaptive ToM (A-ToM) agent, which can align in ToM orders with its partner. Based on prior interactions, the agent estimates the partner's likely ToM order and leverages this estimation to predict the partner's action, thereby facilitating behavioral coordination. We conduct empirical evaluations on four multi-agent coordination tasks: a repeated matrix game, two grid navigation tasks and an Overcooked task. The results validate our findings on ToM alignment and demonstrate the effectiveness of our A-ToM agent. Furthermore, we discuss the generalizability of our A-ToM to non-LLM-based agents, as well as what would diminish the importance of ToM alignment.

🔭 研究背景与动机

Multi-agent coordination involves the precise alignment of actions among multiple agents to enable effective joint be- havior, and is widely applied in areas such as autonomous driving (Zhang et al. 2024b), swarm robotics (Kegeleirs and Birattari 2025), and distributed control (Ge et al. 2025). A key challenge in this area is zero-shot coordination, where agents need to coordinate with previously unseen partners without prior joint training or communication (Hu et al. 2020). Large language models (LLMs) have been widely used to construct zero-shot coordination agents, as they pos- sess strong decision-making and generalization capabilities, *This work was done during his internship at Shanghai Artifi- cial Intelligence Laboratory. †Corresponding author. Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. and can be deployed without task-specific training (Agashe et al. 2023; Zhang et al. 2024a; Liu et al. 2024). Effective coordination with unseen partners requires the ability to model and anticipate their behavior. Recent re- search has incorporated explicit Theory of Mind (ToM) into the architecture of LLM-based agents, enabling them to model others by reasoning about others’ beliefs, desires, and intentions (Li et al. 2023a; Agashe et al. 2023). ToM-based workflow designs and prompting techniques have demon- strated clear effectiveness and become an important compo- nent in the design of agents in multi-agent problems. More generally, since other agents may also possess ToM capabil- ities, it is necessary to equip LLM-based agents with higher- order ToM to reason about others’ reasoning (e.g., “I be- lieve that you believe...”) (de Weerd, Verbrugge, and Ver- heij 2014; Wellman 2018). However, it has been found that higher ToM orders do not necessarily improve performance, in either cooperative or competitive multi-agent tasks (Li et al. 2023a; Shao et al. 2024; Zhang et al. 2025). Prior work empirically……

💡 核心贡献

▶ Theory of Mind (ToM) refers to the ability to reason about others' mental states, and higher-order ToM involves considering that others also possess their own ToM.
▶ Equipping large language model (LLM)-driven agents with ToM has long been considered to improve their coordination in multiagent collaborative tasks.
▶ However, we find that misaligned ToM orders-mismatches in the depth of ToM reasoning between agents-can lead to insufficient or excessive reasoning about others, thereby impairing their coordination.

⚙️ 方法详解

⚙️ 主要步骤：

is the discount factor. Denote the policies of two agents by π1 : S →∆(A
The joint policy of two agents is denoted as π = (π1, π
Although LLM-based agents do not require a reward function for training, we still use the discounted expected return (i.e., the value function) to define the rationality of LLM-based agents. At each
that maximize their joint value function: a∗= arg max a∈A1×A2 Qπ(s, a),

3.1 Problem Formulation We consider a fully cooperative decision-making problem involving two agents within a given Markovian environ- ment, where the actions of agents require coordination to achieve an optimal outcome. The environment can be for- malized as a tuple M = ⟨S, A1, A2, T , R, γ⟩. S is the shared state space. A1 and A2 are the action spaces of the two agents. T : S ×A1×A2 →∆(S) is the transition func- tion. R : S × A1 × A2 →R is a shared reward function that ensures that both agents receive the same reward. γ ∈[0, 1) is the discount factor. Denote the policies of two agents by π1 : S →∆(A1) and π2 : S →∆(A2). The joint policy of two agents is denoted as π = (π1, π2). Although LLM-based agents do not require a reward function for training, we still use the discounted expected return (i.e., the value function) to define the rationality of LLM-based agents. At each step, two rational LLM-based agents aim to select the joint actions a∗:= (a∗ 1, a∗ 2) that maximize their joint value function: a∗= arg max a∈A1×A2 Qπ(s, a), (1) where Qπ(s, a) = Eπ " ∞ X t=0 γt R(st, at) s0 = s, a0 = a # . (2) In some states, there are multiple optimal joint actions sat- isfying a∗,1 = a∗,2 = · · · = max Qπ(s, a). Two agents must coordinate to agree on the same optimal joint action. However, it can be particularly challenging in the absence of communication or prior agreement (Boutilier 1999). 3.2. ToM Modeling The order of ToM is the depth of recursive reasoning an agent uses to model its partner’s behavior. For convenience, we refer to an agent with k-th order ToM as a ToM-k agent. We now define the decision-making process of an agent i with different orders of ToM reasoning as follows. without loss of generality, we assume i = 2. ToM-0 agent. A ToM-0 agent treats its partner as part of the environment state. Its decision depends solely on the en- vironment state: π(0) i (s) := arg max a∈Ai Qπ(s, a). (3) ToM-1 agent. A ToM-1 agent assumes that its partner j is a ToM-0 agent. Its first-order belief b(1) i is partner j’s pre- dicted action apred j : b(1) i := apred j , where apred j = π(0) j (s). (4) The agent then selects the action that best coordinates with apred j : π(1) i (s, b(1) i ) := arg max a∈Ai Qπ(s, apred j , a). (5) ToM-2 agent. Similar to a ToM-1 agent, a ToM-2 agent first infers the partner’s action apred j as its second-order be- lief. Differently, the ToM-2 agent thinks that its partner j is a ToM-1 agent and agent j thinks agent i is a ToM-0 agent: b……

🔬 实验与结果

📌 请参阅原文实验章节获取详细数据

🎯 论文中的具体示例

📌 原文摘录 / Case Study

consider a fully cooperative decision-making problem involving two agents within a given Markovian environ- ment, where the actions of agents require coordination to achieve an optimal outcome.

⚠️ 局限性与未来方向

Frontiers in Robotics and AI, 12: 1607978. Kleiman-Weiner, M.; Ho, M. K.; Austerweil, J. L.; Littman, M. L.; and Tenenbaum, J. B. 2016. Coordinate to cooper- ate or compete: abstract goals and joint intentions in social interaction. In Proceedings of the annual meeting of the cog- nitive science society, volume 38. Li, H.; Chong, Y.; Stepputtis, S.; Campbell, J. P.; Hughes, D.; Lewis, C.; and Sycara, K. 2023a. Theory of Mind for Multi-Agent Collaboration via Large Language Models. In Proceedings

第 4 篇 / 共 5 篇

CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation

📄 arXiv:2603.16215v1 📥 下载 PDF

cs.MAcs.AIcs.MA📅 2026-03-17

👥 作者

Gengxin Sun、Ruihao Yu、Liangyi Yin、Yunqi Yang、Bin Zhang、Zhiwei Xu

🏫 机构单位

CoMAI A Collaborative Multi-Agent Framework for Robust and
Shandong University
Institute of Automation, Chinese
based on large language models (LLMs), which exhibit limited con-
trollability and are susceptible to vulnerabilities such as prompt

📝 论文摘要（原文）

Ensuring robust and fair interview assessment remains a key challenge in AI-driven evaluation. This paper presents CoMAI, a general-purpose multi-agent interview framework designed for diverse assessment scenarios. In contrast to monolithic single-agent systems based on large language models (LLMs), CoMAI employs a modular task-decomposition architecture coordinated through a centralized finite-state machine. The system comprises four agents specialized in question generation, security, scoring, and summarization. These agents work collaboratively to provide multi-layered security defenses against prompt injection, support multidimensional evaluation with adaptive difficulty adjustment, and enable rubric-based structured scoring that reduces subjective bias. Experimental results demonstrate that CoMAI achieved 90.47% accuracy, 83.33% recall, and 84.41% candidate satisfaction. These results highlight CoMAI as a robust, fair, and interpretable paradigm for AI-driven interview assessment.

🔭 研究背景与动机

In the context of intensifying global competition for talent, re- cruitment and interviewing have become critical mechanisms for educational institutions and enterprises to identify high-caliber ∗Both authors contributed equally to this research. †Corresponding author. ‡Corresponding author. Question Generation ? Security Scoring Summarization CoMAI … Figure 1: Overview of CoMAI, a collaborative multi-agent interview framework that orchestrates specialized agents through a centralized controller. candidates. Despite their widespread use, traditional manual inter- views suffer from inherent limitations that undermine both rigor and fairness. They rely heavily on interviewers’ subjective judg- ments, which are prone to personal biases and emotional influ- ences, thereby compromising the consistency and impartiality of outcomes. Conducting interviews on a large scale also entails sub- stantial labor and time costs, limiting efficiency and scalability. In addition, candidates’ performance is often influenced by exter- nal conditions and contingent factors, introducing randomness and instability into evaluation results. The lack of transparency in the process further makes it difficult for candidates to understand the evaluation criteria and weakens comparability across differ- ent cohorts. Moreover, traditional interviews are unable to adapt dynamically to candidates’ individual characteristics or real-time performance, thereby lacking adaptability and personalized support. Consequently, conventional interview formats frequently fall short of meeting the multifaceted requirements of elite talent assessment. Driven by the rapid advancement of artificial intelligence and large language models (LLMs) [19], AI-based interviewing systems have been introduced to meet the increasing demand for talent evaluation [29, 37, 41]. These systems reduce operational costs and arXiv:2603.16215v1 [cs.MA] 17 Mar 2026 provide standardized interview experiences for large numbers of candidat……

💡 核心贡献

▶ This paper presents CoMAI, a general-purpose multi-agent interview framework designed for diverse assessment scenarios.

⚙️ 方法详解

⚙️ 主要步骤：

Monolithic architectures are poorly suited for concurrent usage and are vul- nerable to cascading failures when a single module malfunctions;
Rigid structures constrain adaptability across diverse interview scenarios, leading to weak generalization;
We propose CoMAI, a scalable and resilient multi-agent ar- chitecture, to improve fault tolerance and maintain stable performance under concurrent usage.
A layered security strategy is incorporated to defend against adversarial manipulations such as prompt injection, ensuring robustness in sensitive assessment scenarios.

🔬 实验与结果

📊 关键实验数据

90.47%
100%
44.23%
retrieval: 3
Retrieval: 12

🎯 论文中的具体示例

📌 原文摘录 / Case Study

CoMAI organizes the interview process through specialized agents responsible for question generation, security monitoring, scoring, and summarization, all coordinated by a centralized finite-state controller (CFSC) [40]. This design departs from monolithic single-agent architectures and ensures both methodological rigor and practical applicability in high-stakes evaluation contexts. Significantly, the framework operates without requiring additional training or fine-tuning and can be readily adapted to diverse underlying models. The main contributions of this work are as follows: (1) We propose

⚠️ 局限性与未来方向

["Weak explanation of terminology"], "suggestions": ["Practice concise communication"] } Summary Agent Schema { "final_grade": "A", "final_decision": "accept", "overall_score": 9, "summary": "Candidate shows strong potential ...", "strengths": ["Analytical thinking", " Communication"], "weaknesses": ["Limited collaboration evidence"], "recommendations": { "for_candidate": "Improve collaboration skills", "for_program": "Provide mentorship in teamwork" }, "confidence_level": "high", "detailed_anal

第 5 篇 / 共 5 篇

Open-Source Reproduction and Explainability Analysis of Corrective Retrieval Augmented Generation

📄 arXiv:2603.16169v1 📥 下载 PDF

cs.IRcs.AIcs.CL📅 2026-03-17

👥 作者

Surya Vardhan Yalavarthi

🏫 机构单位

College of Engineering and Applied Science
University of Cincinnati
the original implementation relies on proprietary components including the Google Search API
and closed model weights, limiting reproducibility. In this work, we present a fully open-source
including domain transfer limitations on science questions. All code and results are available at

📝 论文摘要（原文）

Corrective Retrieval Augmented Generation (CRAG) improves the robustness of RAG systems by evaluating retrieved document quality and triggering corrective actions. However, the original implementation relies on proprietary components including the Google Search API and closed model weights, limiting reproducibility. In this work, we present a fully open-source reproduction of CRAG, replacing proprietary web search with the Wikipedia API and the original LLaMA-2 generator with Phi-3-mini-4k-instruct. We evaluate on PopQA and ARC-Challenge, demonstrating that our open-source pipeline achieves comparable performance to the original system. Furthermore, we contribute the first explainability analysis of CRAG's T5-based retrieval evaluator using SHAP, revealing that the evaluator primarily relies on named entity alignment rather than semantic similarity. Our analysis identifies key failure modes including domain transfer limitations on science questions. All code and results are available at https://github.com/suryayalavarthi/crag-reproduction.

🔭 研究背景与动机

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language tasks, yet they remain susceptible to hallucinations — generating factually incorrect content with apparent confidence [7]. Retrieval-Augmented Generation (RAG) [2] addresses this limitation by grounding generation in externally retrieved documents. However, RAG assumes that retrieved documents are relevant, which is frequently not the case in practice. Corrective Retrieval Augmented Generation (CRAG) [1] addresses this assumption by introducing a lightweight retrieval evaluator that assesses document quality and triggers one of three corrective actions: Correct, Incorrect, or Ambiguous. When retrieval quality is low, CRAG falls back to web search to obtain better context. This corrective mechanism has been shown to significantly improve generation accuracy across multiple benchmarks. Despite CRAG’s strong results, reproducing the original system is difficult. The original implementation relies on the Google Search API (a paid commercial service), proprietary LLaMA-2 fine-tuned weights, and deprecated OpenAI API calls. These barriers prevent researchers from building on this work without significant resources. In this paper, we make three contributions: 1. We present a fully open-source reproduction of CRAG, replacing all proprietary components with free alternatives: the Wikipedia API for web search and Phi-3-mini-4k-instruct as the generator. 1 arXiv:2603.16169v1 [cs.IR] 17 Mar 2026 2. We evaluate our reproduction on two datasets — PopQA [3] and ARC-Challenge [4] — demonstrating that open-source components achieve comparable performance. 3. We contribute the first explainability analysis of CRAG’s T5-based retrieval evaluator using SHAP [6], identifying that the evaluator relies primarily on named entity alignment and exhibits systematic domain transfer failure on science questions.

💡 核心贡献

▶ In this work, we present a fully open-source reproduction of CRAG, replacing proprietary web search with the Wikipedia API and the original LLaMA-2 generator with Phi-3-mini-4k-instruct.

⚙️ 方法详解

⚙️ 主要步骤：

54.9 53.7 Table 2: Accuracy (%) on PopQA and ARC-Challenge. Our reproduction uses Phi-3-mini as the generator. Original results are from [1] using LLaMA-2-hf-7b. The large ARC gap between CRAG
The % column shows what fraction of questions triggered each action; Accuracy (%) shows correctness within that action subset. The Correct action achieves 78.1% accuracy, a 26.7 percentage point improvement over vanilla RAG

PopQA ARC-Challenge Vanilla RAG 51.4 84.8 CRAG (ours) 54.4 85.2 CRAG + Wikipedia (ours) 53.2 — CRAG (original, LLaMA-2) 54.9 53.7 Table 2: Accuracy (%) on PopQA and ARC-Challenge. Our reproduction uses Phi-3-mini as the generator. Original results are from [1] using LLaMA-2-hf-7b. The large ARC gap between CRAG (85.2%) and vanilla RAG (84.8%) is consistent with Phi-3-mini’s strong parametric science knowledge. The similar PopQA results across generators (54.4% vs. 54.9%) suggest CRAG’s correction mechanism is the primary driver of performance rather than generator-specific capabilities. Our open-source CRAG reproduction achieves 54.4% on PopQA, closely matching the original system’s 54.9% despite using a different generator. On ARC-Challenge, our system achieves 85.2% versus 84.8% for vanilla RAG, a modest but consistent improvement. 5.4 Action Distribution Analysis Table 3 shows accuracy broken down by triggered action on PopQA. 5 Action Count % Accuracy (%) Correct 754 54.4 78.1 Ambiguous 379 27.4 19.3 Incorrect 252 18.2 36.1 Table 3: Action distribution and per-action accuracy on PopQA (n=1,385). The % column shows what fraction of questions triggered each action; Accuracy (%) shows correctness within that action subset. The Correct action achieves 78.1% accuracy, a 26.7 percentage point improvement over vanilla RAG (51.4%), demonstrating that the T5 evaluator effectively identifies high-quality retrievals. The Ambiguous action achieves only 19.3% without web search, confirming that web search is essential for this action. Adding Wikipedia search improves Ambiguous accuracy to 23.0%, a 4.7 percentage point gain. On ARC-Challenge, the T5 evaluator classifies 88.3% of questions as Ambiguous, reflecting its training distribution bias toward biographical entity questions rather than science questions. Despite this, CRAG maintains competitive performance due to Phi-3-mini’s strong parametric knowledge of science. 6 6 Explainability Analysis To understand what the T5 retrieval evaluator has learned, we apply SHAP to analyze token-level attributions across all three action types. We select 9 representative samples (3 per action type) and compute SHAP values using the PartitionExplainer with a text masker. We note that these findings are based on qualitative case studies; future work should validate patterns across larger samples with aggregate statistics. 6.1 SHAP Results Figure 1 shows token attributions for all 9 samples. Three consistent patterns emerge acr……

🔬 实验与结果

📊 关键实验数据

82.3%
54.4%
85.2%
78.1%
51.4%
PopQA: 3

5.1 Datasets and Evaluation PopQA [3] is an open-domain question answering dataset of 14,267 entity-centric questions. Following the original paper, we evaluate on the long-tail subset of 1,385 questions whose Wikipedia page views are below 100 per month. We use string match accuracy as the evaluation metric, checking whether any gold answer alias appears in the model prediction. ARC-Challenge [4] is a multiple-choice science question benchmark containing 1,172 test questions. We evaluate accuracy by checking whether the correct answer text or answer key letter appears in the model prediction. 5.2 Baselines We compare our CRAG reproduction against a vanilla RAG baseline that uses the top-1 retrieved document directly without any scoring or correction. For PopQA, retrieved documents come from the Contriever retrieval results provided by the original CRAG repository. For ARC-Challenge, we retrieve Wikipedia documents using our multi-stage Wikipedia search pipeline. 5.3 Main Results Table 2 presents our main results on both datasets.

🎯 论文中的具体示例

📌 原文摘录 / Case Study

is Henry Feilden’s occupation?” paired with his Wikipedia article, the token Henry contributes +0.150 and occupation contributes +0.109. This demonstrates that the evaluator learned to match named entities between the question and document. Entity mismatch is the strongest Incorrect signal. When a document is irrelevant, the named entity token from the question becomes a strong negative driver. For the same question paired with an irrelevant document about mitochondria, Henry contributes −0.280 and biology terms (mitochondria, cell) contribute negatively. The evaluator detects irrelevance thro

⚠️ 局限性与未来方向

This work has several limitations that should be considered when interpreting our results. Single-run evaluation. All results are based on a single experimental run without confidence intervals. The 0.5% gap between our PopQA result (54.4%) and the original (54.9%) is within typical run-to-run variance and should not be interpreted as a meaningful difference. SHAP sample size. Our explainability analysis is based on 9 qualitative case studies. While the patterns are consistent across samples, la

🤖 LLM Agent 论文日报