LLM Agent 论文日报 · 2026年03月19日

第 1 篇 / 共 5 篇

When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution

📄 arXiv:2603.17445v1 📥 下载 PDF

cs.AIcs.CLcs.AI📅 2026-03-18

👥 作者

Yi Nian、Haosen Cao、Shenzhe Zhu、Henry Peng Zou、Qingqing Luan、Yue Zhao

🏫 机构单位

University of Southern California
University of Toronto
University of Illinois Chicago
Independent Researcher
metadata-independent framework that enables

📝 论文摘要（原文）

When a multi-agent system produces an incorrect or harmful answer, who is accountable if execution logs and agent identifiers are unavailable? Multi-agent language systems increasingly rely on structured interactions such as delegation and iterative refinement, yet the final output often obscures the underlying interaction topology and agent contributions. We introduce IET (Implicit Execution Tracing), a metadata-independent framework that enables token-level attribution directly from generated text and a simple mechanism for interaction topology reconstruction. During generation, agent-specific keyed signals are embedded into the token distribution, transforming the text into a self-describing execution trace detectable only with a secret key. At detection time, a transition-aware scoring method identifies agent handover points and reconstructs the interaction graph. Experiments show that IET recovers agent segments and coordination structure with high accuracy while preserving generation quality, enabling privacy-preserving auditing for multi-agent language systems.

🔭 研究背景与动机

The adoption of autonomous agents is increasing rapidly; industry forecasts indicate that 40% of enterprise applications will feature task-specific AI agents by 2026 (Gartner, 2025; Zou et al., 2025). Despite this growth, recent evaluations of multi- agent frameworks report failure rates between 41% and 87% in complex tasks (Cemri et al., 2025; Miao et al., 2025). This operational opacity creates a gap in accountability when systems produce incorrect or harmful content.

💡 核心贡献

▶ We introduce IET (Implicit Execution Tracing), a metadata-independent framework that enables token-level attribution directly from generated text and a simple mechanism for interaction topology reconstruction.

⚙️ 方法详解

⚙️ 主要步骤：

Let Lj ∈R|V| denote the logits produced by the base language model at token position j. For an ac- tive agent ak ∈A identified in the execution trace, we generate text by applying a keyed distributional modulation operator W conditioned on the agent identity: ˜Lj = W(ak, Lj).
where W induces a statistically detectable bias de- termined by ak. Specifically, the agent identity ak induces two distinct keys: kp = hp(ak), k(j) π = hπ(ak, yj−n+1:j−

2.1 Problem Formulation Given an interaction log L = (y1, y2, . . . , yT ), our goal is to assign each token yt to an agent a ∈A. We define this as finding an attribution function ˆg(t) that partitions the log into segments. Attribution via Sequential Scoring. We assume a scoring function f(t, a) that evaluates the align- ment between the context at time t and agent a. The attribution ˆg(t) remains assigned to the active agent ak until a transition to a new agent a′ ∈A \ {ak} is identified by a detection operator D: D(f, ak, a′, t, Ht) > τ. (1) Here, Ht incorporates the temporal history of scores, and τ is a sensitivity threshold. This for- mulation treats the interaction log as a sequence of discrete turns, where boundaries are triggered by detecting in the relative competitive scores between agents ak and a′. Robustness to Metadata Loss. To evaluate the restorative utility of our method, we consider an ad- versarial obfuscator Aobf that simulates metadata- independent scenarios by stripping agent identifiers and segment boundaries. Let Φ(G, L) be the per- formance of a downstream model G (e.g., error attribution) given a full-metadata log L. Our objec- 2 Agent A Key A Agent B Key B Agent C Key C Text Generation With Signal Injection Base LM Logits Original Distribution Biased Distribution Inject Signal Biased Logits Modulation Operator Key i Detectable signal Attribution Unrecoverable by Simple Way Plain Output Text Local Consistency Scoring Competitive Change-Point Detection Agent A Agent B Agent C Agent A Agent B Agent C Recovered Segment Attribution Handoff Handoff Internal Execution Path Bitmask Handoff Adjacency Matrix (maintain when executing) Decoding for Typology Recovery Agent A Agent B Agent C Handoff Handoff Final Recovery Score each token position per agent using secret keys Sliding-window Smoothed § § § Compress to a 1-dim margin Boundary 1 Boundary 2 Decode Matrix Recover Encode After B->C handoff After A->B handoff Output Text Bitwise OR Figure 2: Overview. tive is to ensure the attribution consistency: min ˆg ∥Φ(G, Rec(Aobf(L), ˆg)) −Φ(G, L)∥(2) where Rec is the reconstruction function powered by our estimated attribution ˆg. This formulation characterizes our method as a robust backbone that maintains the diagnostic power of downstream tasks even under severe information obfuscation. Structural Validation. To validate the logical consistency of the recovered attribution, we con- struct an estimated interaction topology ˆG = (A, ˆE). An e……

🔬 实验与结果

📊 关键实验数据

94%
23.81%

MAMA topol- ogy llama3.1_num484_nopii.csv MAMA repository Topology-based multi-agent experiments Who & When Algorithm-Generated, Hand-Crafted Agents Failure Attribu- tion repository Failure attribution under metadata corruption Table 3: Datasets used in different stages of our experiments. {"ranges": {"0": [[0, 64]], "1": [[64, 128], [192, 256]], "2": [[128, 192]], "3": [[256, 320]]}} ,→ Use speaker ids 0..K-1 in order of first appearance in the conversation. ,→ Each [l, r] is a token range in [start, end) format. Each listed unit is a contiguous chunk of {unit_tokens} tokens, except the final chunk of a message which may be shorter. ,→ ,→ Every range boundary must align exactly to a listed unit boundary. ,→ That means every start and every end value must be chosen from the provided unit boundary values. ,→ Do not invent token boundaries inside a unit. Think of the task as assigning each unit to exactly one speaker, then merging consecutive units with the same speaker. ,→ ,→ No unit may belong to more than one speaker. No unit may be left unassigned. The top-level object must contain exactly one key: "ranges". ,→ Inside "ranges", there must be exactly K keys: "0", "1", ..., "K-1". ,→ No speaker key may appear outside the "ranges" object. Ranges for each speaker must be sorted and non-overlapping. ,→ For each speaker, merge all adjacent or touching ranges. If one range ends at x and the next range starts at x, they must be merged into a single range. ,→ Use the minimum number of ranges possible for each speaker. ,→ Across all speakers, ranges must exactly cover all tokens from 0 to T with no gaps and no overlaps. ,→ Every speaker from 0 to K-1 definitely appears in this trace. ,→ Therefore, no speaker may have an empty range list. Any output where a speaker has [] is invalid. Any output where one speaker covers all tokens is invalid. ,→ Before answering, verify that every speaker has at least one non-empty range with positive length. ,→ If you are uncertain, still output your best guess in the required JSON format. ,→ An answer like 'Okay, let me think' is invalid. Any text before '{' or after '}' is invalid. LLM Baseline Prompt (User Prompt). User prompt template used in the LLM baseline for speaker-range assignment. Instance-specific fields are filled dynamically at inference time. There are K={k} speakers. There are N={num_units} units, indexed from 0 to {num_units - 1}. ,→ There are T={total_tokens} tokens in total. Each unit is a contiguous chunk of {u……

🎯 论文中的具体示例

📌 原文摘录 / Case Study

consider an ad- versarial obfuscator Aobf that simulates metadata- independent scenarios by stripping agent identifiers and segment boundaries.

第 2 篇 / 共 5 篇

Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

📄 arXiv:2603.17419v1 📥 下载 PDF

cs.CRcs.AIcs.CR📅 2026-03-18

👥 作者

Saikat Maiti

🏫 机构单位

Recent empirical research has
with cryptographically structured metadata envelopes and explicit untrusted content labeling.
teaming research to the specific defensive control that addresses it, demonstrating coverage

📝 论文摘要（原文）

Autonomous AI agents powered by large language models are being deployed in production with capabilities including shell execution, file system access, database queries, and multi-party communication. Recent red teaming research demonstrates that these agents exhibit critical vulnerabilities in realistic settings: unauthorized compliance with non-owner instructions, sensitive information disclosure, identity spoofing, cross-agent propagation of unsafe practices, and indirect prompt injection through external resources [7]. In healthcare environments processing Protected Health Information, every such vulnerability becomes a potential HIPAA violation. This paper presents a security architecture deployed for nine autonomous AI agents in production at a healthcare technology company. We develop a six-domain threat model for agentic AI in healthcare covering credential exposure, execution capability abuse, network egress exfiltration, prompt integrity failures, database access risks, and fleet configuration drift. We implement four-layer defense in depth: (1) kernel level workload isolation using gVisor on Kubernetes, (2) credential proxy sidecars preventing agent containers from accessing raw secrets, (3) network egress policies restricting each agent to allowlisted destinations, and (4) a prompt integrity framework with structured metadata envelopes and untrusted content labeling. We report results from 90 days of deployment including four HIGH severity findings discovered and remediated by an automated security audit agent, progressive fleet hardening across three VM image generations, and defense coverage mapped to all eleven attack patterns from recent literature. All configurations, audit tooling, and the prompt integrity framework are released as open source.

🔭 研究背景与动机

💡 核心贡献

▶ This paper presents a security architecture deployed for nine autonomous AI agents in production at a healthcare technology company.

⚙️ 方法详解

⚙️ 主要步骤：

. . . . . . . . . . . . . . . . . . . . 12 6.2 Generation 2: openclaw-hardened (February 16, 202
. . . . . . . . . . . . . . . . 12 6.3 Generation 3: openclaw-hardened-v2 (March 9, 202

🔬 实验与结果

📌 请参阅原文实验章节获取详细数据

第 3 篇 / 共 5 篇

Bootstrapping Coding Agents: The Specification Is the Program

📄 arXiv:2603.17399v1 📥 下载 PDF

cs.SEcs.LGcs.SE📅 2026-03-18

👥 作者

Martin Monperrus

🏫 机构单位

known from compiler construction, and instantiates the meta-circular property
available in the meta-circular repository .
constraints it must respect. The specification is words long and is available at the repository
M. Monperrus is with KTH Royal Institute of Technology, Stockholm, Sweden. E-mail monperrus kth.se
Meta-Circularity

📝 论文摘要（原文）

A coding agent can bootstrap itself. Starting from a 926-word specification and a first implementation produced by an existing agent (Claude Code), a newly generated agent re-implements the same specification correctly from scratch. This reproduces, in the domain of AI coding agents, the classical bootstrap sequence known from compiler construction, and instantiates the meta-circular property known from Lisp. The result carries a practical implication: the specification, not the implementation, is the stable artifact of record. Improving an agent means improving its specification; the implementation is, in principle, regenerable at any time.

🔭 研究背景与动机

Coding agents are programs that accept natural-language task descriptions and produce or modify source code. Teams now deploy them to write tests, perform refactoring, and implement features from natural language task descriptions and requirements [5, 6]. Compiler writers discovered decades ago that a new language implementation passes a meaningful milestone when it can compile itself. This property, called self-hosting, is not merely a curiosity: it validates that the implementation is expressive enough to describe its own behavior. A self-hosting compiler is also a fixed point of the compilation process: it reproduces itself under its own translation. The same milestone has now been reached for AI coding agents. Given only a natural-language specification, a coding agent can implement itself. This article describes the experiment, draws the analogy to classical results in computer science, and examines the implications for software engineering practice. 2 The Bootstrapping Experiment The experiment consists of three steps, each building on the previous one. All artifacts are publicly available in the meta-circular repository [8]. 2.1 Step 1: Specification A large language model (LLM) was prompted to write a specification for a coding agent: a program that receives a task description in natural language and produces or modifies source code to perform the task. The resulting document defines the agent’s interface, its expected behavior, and the constraints it must respect. The specification is 926 words long and is available at the repository listed in the references [8]. It covers three areas: the agent’s interface (command-line arguments, environment variables, and API interaction), its behavioral constraints (how it handles multi-turn conversations, tool use, and error conditions), and the tool loop (the cycle of receiving a task, calling ∗M. Monperrus is with KTH Royal Institute of Technology, Stockholm, Sweden. E-mail: monperrus@kth.se Preprint. arXiv:2603.1739……

💡 核心贡献

▶ A coding agent can bootstrap itself.
▶ Starting from a 926-word specification and a first implementation produced by an existing agent (Claude Code), a newly generated agent re-implements the same specification correctly from scratch.
▶ This reproduces, in the domain of AI coding agents, the classical bootstrap sequence known from compiler construction, and instantiates the meta-circular property known from Lisp.

⚙️ 方法详解

⚙️ 主要步骤：

Self-Implementation The newly generated agent was given the same specification and asked to implement it again, using the same prompt as
It succeeded. The agent implemented itself: $ python agent.py \ "implement the spec in a single python file" The output is a new agent.py that satisfies the same specification. Both the first-generation and second-generation programs were verified manually against the specification (Fig.
, to the coding-agent bootstrap

🔬 实验与结果

📌 请参阅原文实验章节获取详细数据

⚠️ 局限性与未来方向

The bootstrap experiment demonstrates a property; it does not resolve all questions about the technique’s generality. Complexity scaling. The 926-word specification is simple. Whether the technique scales to specifi- cations of 10,000 or 100,000 words is an open question. The Attractor case (34,900 words) provides evidence that longer specifications remain tractable, but verification difficulty grows: the test suite must cover a larger behavioral surface, and the specification itself may harbor

第 4 篇 / 共 5 篇

WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

📄 arXiv:2603.17357v1 📥 下载 PDF

🏆 ICLR 2026cs.CRcs.AIcs.CR📅 2026-03-18

👥 作者

Nathan Zhao

🏫 机构单位

Stanford University
nathanzh stanford.edu
data, and scalable generation through VLM-based UI reproduction. Experiments
and trained model to support privacy-preserving computer use research.
Recent systems such as Claude Computer Use Anthropic () and Gemini . Comanici et al.

📝 论文摘要（原文）

Computer use agents create new privacy risks: training data collected from real websites inevitably contains sensitive information, and cloud-hosted inference exposes user screenshots. Detecting personally identifiable information in web screenshots is critical for privacy-preserving deployment, but no public benchmark exists for this task. We introduce WebPII, a fine-grained synthetic benchmark of 44,865 annotated e-commerce UI images designed with three key properties: extended PII taxonomy including transaction-level identifiers that enable reidentification, anticipatory detection for partially-filled forms where users are actively entering data, and scalable generation through VLM-based UI reproduction. Experiments validate that these design choices improve layout-invariant detection across diverse interfaces and generalization to held-out page types. We train WebRedact to demonstrate practical utility, more than doubling text-extraction baseline accuracy (0.753 vs 0.357 mAP@50) at real-time CPU latency (20ms). We release the dataset and model to support privacy-preserving computer use research.

🔭 研究背景与动机

Computer use agents—language models that operate graphical user interfaces through vision and action—represent a significant capability advance toward general-purpose AI assistants. Unlike traditional web automation that operates on structured HTML or APIs, vision-based systems observe rendered web pages as images and produce mouse and keyboard actions to accomplish user goals. Recent systems such as Claude Computer Use Anthropic (2024) and Gemini 2.5 Comanici et al. (2025) demonstrate purely vision-driven agents that can book flights, complete checkout flows, navigate e-commerce sites, and manage user accounts across arbitrary websites without access to DOM structure. As these systems scale from research prototypes to production deployments serving millions of sessions, their visual-first architecture creates fundamental privacy problems: every screenshot observation contains rendered PII, and standard cloud-hosted inference exposes sensitive user data during routine operation. The privacy challenges span both training and inference. Training data collected from real websites inevitably contains PII that models memorize and leak Lukas et al. (2023); Nasr et al. (2023), while cloud-hosted inference routinely exposes user screenshots. Existing mitigations are insufficient: sandboxed benchmarks Zhou et al. (2023); Xie et al. (2024) use fabricated data that does not transfer to real sessions, crowdsourced datasets Deng et al. (2023); L`u et al. (2024) lack real-time authenticated content, agentic pipelines Wang et al. (2025d) lack visual PII detection, and federated approaches Wang et al. (2025c;b) still leak information through gradient updates. Critically, no public benchmark exists for visual PII detection in web interfaces. Text-based PII systems Microsoft (2024b); ai4Privacy (2023); Selvam & Ghosh (2025) operate on extracted strings, missing rendered content where sensitivity derives from visual context rather than surrounding words. Document-focused datasets Bula……

💡 核心贡献

▶ We introduce WebPII, a fine-grained synthetic benchmark of 44,865 annotated e-commerce UI images designed with three key properties: extended PII taxonomy including transaction-level identifiers that enable reidentification, anticipatory detection for partially-filled forms where users are actively entering data, and scalable generation through VLM-based UI reproduction.

⚙️ 方法详解

⚙️ 主要步骤：

, while WEBREDACT-LARGE reaches 0.842 mAP@50. Detailed failure mode analysis for OCR+LLM systems appears in Appendix D.1. WEBREDACT processes images in ∼20ms on mid-range consumer CPUs (Intel i5/AMD Ryzen
, meeting real-time constraints for 30fps redaction, while WEBREDACT-LARGE requires ∼312ms (∼3fps). Both models use OpenVINO for CPU inference. Text-based methods are substantially slower: Tesseract Smith
OCR extraction alone
TestCross-Page achieves 0.797 mAP@50, indicating models learn layout-invariant features within a company’s design system. TestCross-Company degrades to 0.753 when generalizing to Amazon’s distinct visual style. TestCross-Type shows the largest degradation
, revealing that page-type-specific patterns transfer less effectively than company-specific design conventions. Fill state diversity. Training on full screenshots alone achieves 0.771 mAP@50 (Table 1

MAP@50 LATENCY OCR + PRESIDIO 0.183 1.3S LAYOUTLMV3 + GPT-4O-MINI 0.357 2.9S WEBREDACT (OURS) 0.753 20MS WEBREDACT-LARGE (OURS) 0.842 312MS real-time CPU inference, and WEBREDACT-LARGE at 1280×1280 resolution for higher accuracy when near-real-time constraints can be relaxed. 3.3 RESULTS Table 1 presents main results on TestCross-Company, evaluating generalization to entirely new visual styles without seeing Amazon’s design system during training. We evaluate text-based baselines only on full-filled images, as these approaches cannot identify empty or partially-filled input fields. Even on this favorable subset, both WEBREDACT variants substantially outperform text-based methods: WEBREDACT achieves 0.753 mAP@50, more than double the best text-based baseline (LayoutLMv3 at 0.357), while WEBREDACT-LARGE reaches 0.842 mAP@50. Detailed failure mode analysis for OCR+LLM systems appears in Appendix D.1. WEBREDACT processes images in ∼20ms on mid-range consumer CPUs (Intel i5/AMD Ryzen 5), meeting real-time constraints for 30fps redaction, while WEBREDACT-LARGE requires ∼312ms (∼3fps). Both models use OpenVINO for CPU inference. Text-based methods are substantially slower: Tesseract Smith (2007) OCR extraction alone (453ms, excluding classification) is slower than WEBREDACT-LARGE’s full detection pipeline. 3.4 DATASET ABLATIONS We conduct ablations to validate WEBPII’s design choices; full tables appear in Appendix D. Split strategies. We compare three split strategies (Table 13). TestCross-Page achieves 0.797 mAP@50, indicating models learn layout-invariant features within a company’s design system. TestCross-Company degrades to 0.753 when generalizing to Amazon’s distinct visual style. TestCross-Type shows the largest degradation (0.728), revealing that page-type-specific patterns transfer less effectively than company-specific design conventions. Fill state diversity. Training on full screenshots alone achieves 0.771 mAP@50 (Table 14). Adding empty screenshots without partials degrades performance to 0.758, while full+partial achieves 0.797 (+0.026). Combining all three states achieves 0.825 mAP@50, demonstrating that partial fills provide essential intermediate visual grounding. Progressive fill density. Increasing partial-fill stages per layout from 1 to 5 improves performance from 0.758 to 0.802 mAP@50 (Table 15), with the strongest gains on partial-fill test images (0.774 to 0.835). Each additional stage exposes the model to new intermediate form states, p……

🔬 实验与结果

📊 关键实验数据

78.7%

3.1 EXPERIMENTAL SETUP 3.1.1 DATA SPLITS The diversity of WEBPII enables evaluation of generalization at different levels. We render each of the 408 unique layouts with 25 data injection variants (different PII values, addresses, and product information), and for layouts with input fields, generate the fill states described in Section 2.4: full, partial, and empty. We evaluate three split strategies: TestCross-Page holds out 20% of layouts randomly (82 layouts, 298 fill states), testing whether models learn layout-invariant features within a company’s design system. TestCross-Company holds out all Amazon layouts (56 layouts, 152 fill states) while training on 11 other companies (352 layouts, 1,416 fill states), evaluating generalization to entirely new visual styles and brand identities. TestCross-Type holds out all receipt pages (20 layouts, 50 fill states) while training on 18 other page types (388 layouts, 1,518 fill states), measuring whether detection strategies transfer across functionally different page categories with distinct UI patterns. For all splits, we ensure no data leakage: the specific PII values, addresses, and product information in test images never appear in training. 3.2 BASELINE METHODS 3.2.1 TEXT-BASED METHODS We evaluate text-based baselines using a two-stage pipeline: (1) OCR extraction to obtain text spans with bounding boxes, (2) classification to identify sensitive content. This approach mirrors existing PII detection systems that operate on extracted text rather than raw pixels. We compare three approaches: Presidio Microsoft (2024b), an NER-based system using pattern matching and named entity recognition; GPT-4o-mini for LLM-based classification; and document understanding models LayoutLMv3 Huang et al. (2022) and Donut Kim et al. (2021) that encode visual layout alongside text. All document understanding pipelines use GPT-4o-mini for final classification to ensure fair comparison. We evaluate multiple OCR engines Tesseract Smith (2007), EasyOCR JaidedAI (2024), and PaddleOCR Cui et al. (2025), reporting the highest-accuracy configuration (LayoutLMv3) and fastest configuration (Tesseract + Presidio). Note that text-based approaches can only evaluate text content—product images are excluded as OCR provides no signal for visual elements. Appendix D.1 presents detailed ablations across all OCR engines, language models, and classification approaches. 3.2.2 WEBREDACT We train object detection models on WEBPII as WEBREDACT, targeti……

🎯 论文中的具体示例

📌 原文摘录 / Case Study

annotations. 2 THE WEBPII DATASET E-commerce interfaces present PII challenges distinct from documents or scene text. While an email address in a scanned form appears as static pixels, the same email in a web UI may be rendered through JavaScript, styled with CSS, and wrapped in interactive elements. Moreover, web forms require anticipatory detection—identifying sensitive fields before users finish typing, as privacy interventions should trigger during entry rather than after completion. Beyond traditional PII, these interfaces expose extended identifiers—order IDs, tracking numbers, delivery

第 5 篇 / 共 5 篇

Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

📄 arXiv:2603.17307v1 📥 下载 PDF

cs.CVcs.AIcs.CV📅 2026-03-18

👥 作者

Haiyang Yan、Hongyun Zhou、Peng Xu、Xiaoxue Feng、Mengyi Liu

🏫 机构单位

Institute of Automation, Chinese Academy of Sciences
School of Future Technology, University of Chinese Academy of Sciences
research on LVU agents demonstrates that simple task de-
composition and collaboration mechanisms are insufficient
these limitations. By emulating human cognition patterns,

📝 论文摘要（原文）

Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.

🔭 研究背景与动机

Long-form video understanding (LVU) is becoming in- creasingly important for a wide range of real-world appli- cations, such as sports commentary, intelligent surveillance, and film analysis [1, 2]. Effective LVU requires robust mul- Work done during an internship at Kuaishou Technology. *Equal contribution. †Corresponding author: Mengyi Liu (liu- mengyi@kuaishou.com). LLM Answer Multi-round … Question planning Localization Reflection Perception Subtitle Context length Reflection Agent LLM Answer Question Planning Agent Answer Comment Subtask Sub-solution Subtitle Agent Analyzing subtitles Visual Perception Agent Perceiving visual information Answer credible (a) (b) Grounding Agent Grounding problem-

💡 核心贡献

▶ In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations.

⚙️ 方法详解

We propose Symphony, a multi-agent system composed of functionally specialized agents, as illustrated in Fig.1 (b). In Section 3.1, we provide an overview of the system. Section 3.2 details the collaboration mechanism among the agents, and Section 3.3 introduces our novel grounding agent. 3.1. Overview Cognitive psychology traditionally decomposes human cognitive abilities into core dimensions: perception, at- tention, reasoning, language, and decision-making [40]. Building upon this framework, we propose a capability- dimension decoupled paradigm for LVU task decomposi- tion, implemented through a MAS. In our architecture, the planning and reflection agents jointly manage reasoning and decision-making; the grounding agent simulates the func- tion of attention by highlighting key video segments; the subtitle agent analyzes textual subtitles to fulfill the lan- guage processing component; and the visual perception agent performs perceptual tasks. In contrast to modality- based partitioning, which incurs high interaction costs from tight inter-module dependencies, our approach minimizes inter-agent coupling, significantly reducing the cost of in- formation integration. This strategic allocation of cognitive load across specialized modules effectively mitigates capac- ity overload in monolithic architectures, enhancing accu- racy and scalability in complex LVU tasks. Specifically, the Planning Agent acts as the central co- ordinator, responsible for global task planning, multi-agent scheduling, information integration, and ultimately generat- ing the answer. To efficiently and comprehensively identify question-relevant segments and potential clues within the video, the Grounding Agent selects either a VLM-based relevance scoring tool or a CLIP-based retrieval tool de- pending on the analysis of question complexity. The Subti- Algorithm 1: Reflection-enhanced Dynamic Collaboration Input: Question Q, max attempts M Initialize: trajectory τ ←∅, state St ←{τ, Q} A={G (Grounding), V (Visual Perception), S (Subtitle)} m, n ←0 while m < M do while n < M do at ←PlanningAgent(St) if at = TERMINATE then break ot ←Execute at using agent ∈A Collect observation ot and update τ ←τ ∪{(at, ot)} n ←n + 1 C, Valid ←ReflectionAgent(St) if Valid then break S ←S ∪{C}, m ←m + 1 A ←PlanningAgent.answer(S) return A tle Agent processes video subtitles and performs semantic analysis to enable capabilities such as entity recognition, sentiment analysis, and topic modeling. The Visual P……

🔬 实验与结果

📊 关键实验数据

5.0%
6.5%
2.2%
retrieval: 18

4.1. Dataset We comprehensively evaluated the performance of Sym- phony and other state-of-the-art (SOTA) methods on four representative LVU datasets. LVBench [5] features videos with an average duration of 68 minutes, emphasizing six core capability dimensions: Temporal Grounding (TG), Summarization (Sum), Reasoning (Rea), Entity Recogni- tion (ER), Event Understanding (EU), and Key Information Retrieval (KIR). LongVideoBench [50] comprises 3,763 videos along with their subtitles and introduces referen- tial reasoning tasks to evaluate fine-grained information re- trieval and cross-fragment logical reasoning. MLVU [51] encompasses diverse video types and is designed with nine varied tasks, including reasoning, captioning, recognition, and summarization. Video-MME [52] establishes a multi- modal evaluation framework spanning six broad domains, rigorously assessing spatio-temporal composite reasoning capabilities. For our experiments, we exclusively utilized the “long” duration subset. 4.2. Implementation Details Our planning and reflection agents leverage DeepSeek R1 [53] as the reasoning model, while the subtitle agent em- ploys DeepSeek V3 [54]. The visual perception agent and grounding agent utilize Doubao Seed 1.6 VL [46] as VLM. Input sequences are constrained to a maximum of 40 frames, with resolutions capped at 720p. For the VLM- based scoring tool, we set the duration T = 60 and sample 30 frames from each segment. For MLVU and LVBench without subtitles, we used Whisper-large-v3 [55] to extract subtitles. We set the number of scheduling rounds for agents and the maximum number of tool calls within each agent to 15. For the reflection agent, the maximum number of scheduling rounds was set to 3. Baselines. We comprehensively evaluated Symphony against diverse SOTA methods in LVU. The baselines in- clude VLMs, agent-based frameworks, long-context-based LongVILA [9], RAG-based VideoRAG [16], and token- compression-based AdaRETAKE [34]. Unless specified, all results are sourced from published literature. For eval- uations of Seed 1.6 VL, videos were uniformly sampled at 256 frames. For a fair comparison with DVD [4], we use the same reasoning model and vision model as ours. LVBench VideoMME LongvideoBench MLVU 30 35 40 45 50 55 60 65 70 75 80 85 Score 47.7 63.9 60.7 53.8 58.1 68.4 66.1 61.3 61.3 56.8 64.7 61.2 66.8 61.7 67.2 64.3 51.3 72.8 66.5 57.2 56.1 76.5 71.8 66.4 68.2 75.3 76.8 78.8 71.8 78.1 77.1 81.0 Qwen2.5VL-72B Seed1.6VL DVD+Qwen2.5VL-72B DVD+……

🤖 LLM Agent 论文日报