Entity recognition in Automatic Speech Recognition (ASR) is challenging for rare and domain-specific terms. In domains such as finance, medicine, and air traffic control, these errors are costly. If the entities are entirely absent from the ASR output, post-ASR correction becomes difficult. To address this, we introduce RECOVER, an agentic correction framework that serves as a tool-using agent. It leverages multiple hypotheses as evidence from ASR, retrieves relevant entities, and applies Large Language Model (LLM) correction under constraints. The hypotheses are used using different strategies, namely, 1-Best, Entity-Aware Select, Recognizer Output Voting Error Reduction (ROVER) Ensemble, and LLM-Select. Evaluated across five diverse datasets, it achieves 8-46% relative reductions in entity-phrase word error rate (E-WER) and increases recall by up to 22 percentage points. The LLM-Select achieves the best overall performance in entity correction while maintaining overall WER.
The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models' visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse-Bench
⚙️ 主要步骤:
category in VisBrowse-Bench are shown in Figure 3. To ensure high-quality and data diversity, the question and answer pairs in each category were obtained through collaborative work by two experts in the field. Each question is an image–text bundle: a textual query paired with a small set of images containing initial entity information. In this section, we will outline the details of VisBrowse-Bench. The entire pipeline includes the development of data design criteria, data collection, and a strict validation process. 4 Sport Question: The athlete in the red shirt in the picture has taken a ph
Theory of Mind (ToM) refers to the ability to reason about others' mental states, and higher-order ToM involves considering that others also possess their own ToM. Equipping large language model (LLM)-driven agents with ToM has long been considered to improve their coordination in multiagent collaborative tasks. However, we find that misaligned ToM orders-mismatches in the depth of ToM reasoning between agents-can lead to insufficient or excessive reasoning about others, thereby impairing their coordination. To address this issue, we design an adaptive ToM (A-ToM) agent, which can align in ToM orders with its partner. Based on prior interactions, the agent estimates the partner's likely ToM order and leverages this estimation to predict the partner's action, thereby facilitating behavioral coordination. We conduct empirical evaluations on four multi-agent coordination tasks: a repeated matrix game, two grid navigation tasks and an Overcooked task. The results validate our findings on ToM alignment and demonstrate the effectiveness of our A-ToM agent. Furthermore, we discuss the generalizability of our A-ToM to non-LLM-based agents, as well as what would diminish the importance of ToM alignment.
⚙️ 主要步骤:
📌 请参阅原文实验章节获取详细数据
consider a fully cooperative decision-making problem involving two agents within a given Markovian environ- ment, where the actions of agents require coordination to achieve an optimal outcome.
Frontiers in Robotics and AI, 12: 1607978. Kleiman-Weiner, M.; Ho, M. K.; Austerweil, J. L.; Littman, M. L.; and Tenenbaum, J. B. 2016. Coordinate to cooper- ate or compete: abstract goals and joint intentions in social interaction. In Proceedings of the annual meeting of the cog- nitive science society, volume 38. Li, H.; Chong, Y.; Stepputtis, S.; Campbell, J. P.; Hughes, D.; Lewis, C.; and Sycara, K. 2023a. Theory of Mind for Multi-Agent Collaboration via Large Language Models. In Proceedings
Ensuring robust and fair interview assessment remains a key challenge in AI-driven evaluation. This paper presents CoMAI, a general-purpose multi-agent interview framework designed for diverse assessment scenarios. In contrast to monolithic single-agent systems based on large language models (LLMs), CoMAI employs a modular task-decomposition architecture coordinated through a centralized finite-state machine. The system comprises four agents specialized in question generation, security, scoring, and summarization. These agents work collaboratively to provide multi-layered security defenses against prompt injection, support multidimensional evaluation with adaptive difficulty adjustment, and enable rubric-based structured scoring that reduces subjective bias. Experimental results demonstrate that CoMAI achieved 90.47% accuracy, 83.33% recall, and 84.41% candidate satisfaction. These results highlight CoMAI as a robust, fair, and interpretable paradigm for AI-driven interview assessment.
⚙️ 主要步骤:
CoMAI organizes the interview process through specialized agents responsible for question generation, security monitoring, scoring, and summarization, all coordinated by a centralized finite-state controller (CFSC) [40]. This design departs from monolithic single-agent architectures and ensures both methodological rigor and practical applicability in high-stakes evaluation contexts. Significantly, the framework operates without requiring additional training or fine-tuning and can be readily adapted to diverse underlying models. The main contributions of this work are as follows: (1) We propose
["Weak explanation of terminology"], "suggestions": ["Practice concise communication"] } Summary Agent Schema { "final_grade": "A", "final_decision": "accept", "overall_score": 9, "summary": "Candidate shows strong potential ...", "strengths": ["Analytical thinking", " Communication"], "weaknesses": ["Limited collaboration evidence"], "recommendations": { "for_candidate": "Improve collaboration skills", "for_program": "Provide mentorship in teamwork" }, "confidence_level": "high", "detailed_anal
Corrective Retrieval Augmented Generation (CRAG) improves the robustness of RAG systems by evaluating retrieved document quality and triggering corrective actions. However, the original implementation relies on proprietary components including the Google Search API and closed model weights, limiting reproducibility. In this work, we present a fully open-source reproduction of CRAG, replacing proprietary web search with the Wikipedia API and the original LLaMA-2 generator with Phi-3-mini-4k-instruct. We evaluate on PopQA and ARC-Challenge, demonstrating that our open-source pipeline achieves comparable performance to the original system. Furthermore, we contribute the first explainability analysis of CRAG's T5-based retrieval evaluator using SHAP, revealing that the evaluator primarily relies on named entity alignment rather than semantic similarity. Our analysis identifies key failure modes including domain transfer limitations on science questions. All code and results are available at https://github.com/suryayalavarthi/crag-reproduction.
⚙️ 主要步骤:
is Henry Feilden’s occupation?” paired with his Wikipedia article, the token Henry contributes +0.150 and occupation contributes +0.109. This demonstrates that the evaluator learned to match named entities between the question and document. Entity mismatch is the strongest Incorrect signal. When a document is irrelevant, the named entity token from the question becomes a strong negative driver. For the same question paired with an irrelevant document about mitochondria, Henry contributes −0.280 and biology terms (mitochondria, cell) contribute negatively. The evaluator detects irrelevance thro
This work has several limitations that should be considered when interpreting our results. Single-run evaluation. All results are based on a single experimental run without confidence intervals. The 0.5% gap between our PopQA result (54.4%) and the original (54.9%) is within typical run-to-run variance and should not be interpreted as a meaningful difference. SHAP sample size. Our explainability analysis is based on 9 qualitative case studies. While the patterns are consistent across samples, la