📋 今日目录
  1. Conflict-Based Search for Multi Agent Path Finding with Asynchronous Actions
  2. RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large La...
  3. Agent Control Protocol: Admission Control for Agent Actions
  4. ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents
  5. Memento-Skills: Let Agents Design Agents
第 1 篇 / 共 5 篇
Conflict-Based Search for Multi Agent Path Finding with Asynchronous Actions
cs.AIcs.AI📅 2026-03-19
👥 作者
Xuemian Wu、Shizhe Zhao、Zhongqiang Ren
🏫 机构单位
  • Shanghai Jiao Tong University
  • always take a time unit, which may limit the use of MAPF plan-
  • wait durations. This paper proposes a new method, Conflict-Based
  • conflict resolution techniques to improve the scalability of CBS-
  • AA further. Our test results show that our method can reduce the
📝 论文摘要(原文)

Multi-Agent Path Finding (MAPF) seeks collision-free paths for multiple agents from their respective start locations to their respective goal locations while minimizing path costs. Most existing MAPF algorithms rely on a common assumption of synchronized actions, where the actions of all agents start at the same time and always take a time unit, which may limit the use of MAPF planners in practice. To get rid of this assumption, Continuous-time Conflict-Based Search (CCBS) is a popular approach that can find optimal solutions for MAPF with asynchronous actions (MAPF-AA). However, CCBS has recently been identified to be incomplete due to an uncountably infinite state space created by continuous wait durations. This paper proposes a new method, Conflict-Based Search with Asynchronous Actions (CBS-AA), which bypasses this theoretical issue and can solve MAPF-AA with completeness and solution optimality guarantees. Based on CBS-AA, we also develop conflict resolution techniques to improve the scalability of CBS-AA further. Our test results show that our method can reduce the number of branches by up to 90%.

🔭 研究背景与动机
Multi-Agent Path Finding (MAPF) seeks collision-free paths for multiple agents from their respective start locations to their respec- tive goal locations while minimizing path costs. The environment is often represented by a graph, where vertices represent the lo- cations that the agent can reach, and edges represent actions that transit the agent between two locations. MAPF is NP-hard to solve optimally [19], and a variety of MAPF planners were developed, ranging from optimal planners [14, 16], bounded sub-optimal plan- ners [2, 7], to unbounded sub-optimal planners [4, 9]. A common underlying assumption in these planners is that each action of an agent, either waiting in place or moving to an adjacent vertex, takes the same duration, i.e., a time unit, and the actions of all agents are synchronized, i.e., the action of each agent starts at the same discrete time step. This assumption limits the application of MAPF planners, especially when the agent speeds are different or an agent has to vary its speed when going through different edges (Fig. 1). To bypass this synchronous action assumption, MAPF variants such as Continuous-Time MAPF [1], MAPF with Asynchronous Actions (MAPF-AA) [12], MAPF𝑅[17] were proposed. The major idea in those variants is that, the actions of agents can take different amounts of time, and as a result, the agents may not start and end each of their actions at the same discrete time steps. Among the Figure 1: A motivating example of MAPF-AA where the yel- low car moves fast and the green truck moves slowly in con- tinuous time. The circled numbers show the time points: e.g., in (a), the truck moves from B1 to B2 during the time range [0.0, 2.3]. This work considers the agent to occupy both ends of an edge when the agent goes through it. As a result, a constraint (as shown in (d)) at B2 with time range [0.0, 4.6] is imposed on the yellow car to avoid collision as shown in (c). exact algorithms that can find optimal solutions, Continuous-time C……
💡 核心贡献
  • This paper proposes a new method, Conflict-Based Search with Asynchronous Actions (CBS-AA), which bypasses this theoretical issue and can solve MAPF-AA with completeness and solution optimality guarantees.
⚙️ 方法详解
This section proposes Conflict Based Search with Asynchronous Action (CBS-AA), which finds an optimal solution for MAPF-AA. We first modify CCBS to effectively resolve conflicts and call this modified method Constraint on Single Action (CSA). Then, we use DO to propagate constraints and resolve conflicts efficiently, which we call Constraint on Multiple Actions (CMA). Overview. CBS-AA (Alg. 1) is similar to CBS with three processes modified: LowLevelPlan, DetectConflict and GenerateConstraints. LowLevelPlan adapts Safe Interval Path Planning (SIPP) [11] to han- dle continuous-time. The numbers associated with safe intervals and constraints are all positive real numbers. We cut safe intervals into continuous time intervals according to the constraints, rather than a set of discrete time steps. DetectConflict detects conflicts in the continuous time range. When there is overlap in the time inter- vals for two agents to occupy a same vertex, a conflict is returned. In GenerateConstraints, we propose two different conflict resolution methods for MAPF-AA, CSA and CMA, as detailed later. Algorithm 1 CBS-AA INPUT: 𝐺= (𝑉, 𝐸) OUTPUT: a conflict-free joint path 𝜋in 𝐺. 1: Ω𝑐←∅, 𝜋,𝑔←LowLevelPlan(Ω𝑐) 2: Add 𝑃𝑟𝑜𝑜𝑡,1 = (𝜋,𝑔, Ω𝑐) to OPEN 3: while OPEN ̸= ∅do 4: 𝑃= (𝜋,𝑔, Ω𝑐) ←OPEN.pop() 5: 𝑐𝑓𝑡←DetectConflict(𝜋) 6: if 𝑐𝑓𝑡= 𝑁𝑈𝐿𝐿then return 𝜋 7: Ω←GenerateConstraints(𝑐𝑓𝑡) 8: for all 𝜔𝑖∈Ωdo 9: Ω′ = Ω𝑐∪{𝜔𝑖} 10: 𝜋′,𝑔′ ←LowLevelPlan(Ω′) 11: Add 𝑃′ = (𝜋′,𝑔′, Ω′) to OPEN 12: end for 13: end while 14: return failure Figure 3: Three Conflict Types. (a): IN-IN; (b): OUT-IN; (c): WAIT-IN 4.1 Conflict Detection and Classification For a vertex 𝑣, there are three types of actions:   IN : {𝐴𝑖|𝑣𝑡(𝐴𝑖) = 𝑣} OUT : {𝐴𝑖|𝑣𝑓(𝐴𝑖) = 𝑣} WAIT : {𝐴𝑖|𝑣𝑓(𝐴𝑖) = 𝑣𝑡(𝐴𝑖) = 𝑣} (1) If agent 𝑖wants to go through 𝑣, it must perform these three actions 𝐴𝑖 𝐼∈IN, 𝐴𝑖 𝑊∈WAIT and 𝐴𝑖 𝑂∈OUT at 𝑣in sequence. Let 𝜏(𝐴𝑖, 𝑣) denote the time interval during which the transition of 𝑖occupies vertex 𝑣based on Def. 1. If agent 𝑖performs 𝐴𝑖 𝐼at 𝑡, 𝜏(𝐴𝑖 𝐼, 𝑣) = (𝑡,𝑡+ 𝜏(𝐴𝑖 𝐼)], 𝜏(𝐴𝑖 𝑊, 𝑣) = [𝑡+ 𝜏(𝐴𝑖 𝐼),𝑡+ 𝜏(𝐴𝑖 𝐼) + 𝜏(𝐴𝑖 𝑊)] and 𝜏(𝐴𝑖 𝑂, 𝑣) = [𝑡+ 𝜏(𝐴𝑖 𝐼) + 𝜏(𝐴𝑖 𝑊),𝑡+ 𝜏(𝐴𝑖 𝐼) + 𝜏(𝐴𝑖 𝑊) + 𝜏(𝐴𝑖 𝑂)]. If 𝑖does not need to wait at 𝑣, the duration 𝜏(𝐴𝑖 𝑊) is 0 and the wait action 𝐴𝑖 𝑊occupies 𝑣only at one time point 𝑡+ 𝜏(𝐴𝑖 𝐼). Two agent 𝑖and 𝑗are in conflict if there is a 𝑣such that 𝜏(𝐴𝑖, 𝑣)∩ 𝜏(𝐴𝑗, 𝑣) ̸= ∅. Let ⟨𝐴𝑖,𝐴𝑗, 𝑣⟩denote a duration conflict between agents 𝑖and 𝑗that both occupy the same vertex 𝑣during their actions 𝐴𝑖,𝐴𝑗. F……
🔬 实验与结果

📌 请参阅原文实验章节获取详细数据

🎯 论文中的具体示例
📌 原文摘录 / Case Study

𝑤= (𝐵, 𝐵, 2.0) means waiting at vertex 𝐵for a duration of 2.0, and CCBS adds a constraint ⟨𝑖,𝑎𝑖 𝑤, [𝑡𝑖,𝑡𝑖 𝑢)⟩to prohibit agent 𝑖from performing 𝑎𝑖 𝑤at time 𝑡∈[𝑡𝑖,𝑡𝑖 𝑢). But agent 𝑖still can preform wait action 𝑎𝑖 𝑤1 = (𝐵, 𝐵, 2.01),𝑎𝑖 𝑤2 = (𝐵, 𝐵, 2.001),𝑎𝑖 𝑤3 = (𝐵, 𝐵, 2.0001)... at time 𝑡∈[𝑡𝑖,𝑡𝑖 𝑢). So, CCBS may not terminate when there is a wait action. In the open-sourced implementation1, CCBS makes a change when transferring constraints about wait action to low-level solver. For the previous constraint ⟨𝑖,𝑎𝑖 𝑤, [𝑡𝑖,𝑡𝑖 𝑢)⟩, CSIPP divides the safe interval of 𝐵to two parts: [0,𝑡𝑖) and [𝑡𝑖 𝑢, ∞

⚠️ 局限性与未来方向

This paper focuses on MAPF-AA, and develops a new exact algo- rithm CBS-AA for MAPF-AA with solution optimality guarantees, based on the popular CBS framework. CBS-AA introduces new con- flict resolution techniques for agents with asynchronous actions, which improves the runtime efficiency of the algorithm. Experi- mental results demonstrate the advantages of our new approaches in different settings against several baseline methods. For future work, one can also consider speed and uncertainty in

第 2 篇 / 共 5 篇
RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
cs.AIcs.CLcs.LG📅 2026-03-19
👥 作者
Xiao Feng、Bo Han、Zhanke Zhou、Jiaqi Fan、Jiangchao Yao、Ka Ho Li、Dahai Yu、Michael Kwok-Po Ng
🏫 机构单位
  • TMLR Group, Hong Kong Baptist University TCL Corporate Research (HK) Co., Ltd
  • Cooperative Medianet Innovation Center, Shanghai Jiao Tong University
  • Department of Mathematics, Hong Kong Baptist University
  • RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reason-
  • RewardFlow is publicly available at https github.com tmlr-group RewardFlow.
📝 论文摘要(原文)

Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

🔭 研究背景与动机
Large Language Models (LLMs) have demonstrated strong reasoning capabilities, making them compelling foundations for autonomous agents that solve real-world tasks by interacting with external environments, including computer control (Gou et al., 2025), GUI operation (Qin et al., 2025), and robotic manipulation (Liu et al., 2023). In this setting, agentic reinforcement learning (RL) plays a central role in strengthening both capability and reliability by optimizing expected strategy under environment-provided rewards. Such agent-environment interaction unfolds over multiple turns, producing long-horizon rea- soning trajectories that pass through many intermediate states. However, optimization is often hindered by the sparse-reward structure of agentic environments: most provide no state-wise feed- back during execution, yielding only a terminal evaluation upon task termination with completion, failure, or truncation. As a result, agentic RL is driven by a coarse, trajectory-level signal rather than fine-grained, state-level guidance, weakening credit assignment and can lead to insufficient training. Prior work seeks to recover state-wise (process) rewards but typically relies on training separate reward models with human-annotated data (Lightman et al., 2023; Wang et al., 2025a). This dependence incurs substantial data and computational costs, limiting optimization efficiency and scalability. These limitations motivate our central question: How can we objectively estimate process rewards for intermediate states in agentic tasks without training reward models? arXiv:2603.18859v1 [cs.AI] 19 Mar 2026 RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with LLMs Reasoning trajectories State graph You are in the middle of a room. Your task is to: examine the book with the desklamp within 3 steps. Graph Construction Task Complete Task Failed Task Complete Task Failed Task Complete Go to desk Use desklamp Take book from desk Go to bed Open drawer Ta……
💡 核心贡献
  • To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks.
⚙️ 方法详解
ALFWorld WebShop Sokoban Pick Look Clean Heat Cool Pick2 All Score Succ. Score Succ. Qwen2.5-1.5B-Instruct Prompting Base 5.9 5.5 3.3 9.7 4.2 0 4.1 23.1 5.2 - - RL Training RLOO 56.2 46.7 62.5 50.0 43.5 27.8 49.2 80.9 54.7 - - RL Training GRPO 62.9 53.3 50.0 40.0 45.8 38.1 50.0 73.7 42.2 - - RL Training GiGPO 59.4 46.7 62.5 44.4 43.5 56.3 53.1 75.4 55.5 - - RL Training RewardFlow 77.4 64.3 84.0 62.5 86.4 25.0 68.8 78.3 60.9 - - Qwen2.5-(VL)-3B-Instruct Prompting Base 36.4 42.9 9.1 7.1 5.3 4.5 16.4 8.0 1.6 0.5 14.1 RL Training RLOO 78.1 46.7 75.0 31.3 43.5 33.3 55.5 75.2 59.4 1.0 22.7 RL Training GRPO 79.8 57.1 75.8 21.4 31.6 36.4 56.2 69.5 53.9 1.3 26.6 RL Training GiGPO 82.1 50.0 76.9 53.3 60.9 50.0 64.8 80 59.4 1.2 21.9 RL Training RewardFlow 94.6 70.0 90.0 36.4 70.0 70.0 78.9 81.8 60.9 2.2 49.2 Qwen2.5-(VL)-7B-Instruct Prompting Base 33.3 13.3 10.7 0 4.3 0 10.9 26.4 7.8 0.9 18.8 RL Training RLOO 90.0 85.7 88.0 50.0 85.7 37.5 75.0 84.2 72.7 1.0 21.9 RL Training GRPO 92.3 53.3 90.5 77.8 69.6 47.6 75.0 70.3 84.8 1.0 23.4 RL Training GiGPO 84.6 80.0 96.0 71.4 73.9 84.0 82.8 88.4 72.7 1.4 34.4 RL Training RewardFlow 100 78.6 100 81.3 81.8 85.0 89.8 84.4 73.4 3.0 62.4 Table 2: Performance of RewardFlow on DeepResearch benchmarks. Following the training and evaluation setup of GiGPO (Feng et al., 2025b), RewardFlow is trained on NarrativeQA and Hot- potQA. Results are reported as average accuracy rate (%). Across evaluation Question Answering benchmarks, we indicate †as in-distribution datasets while * indicates out-of-distribution datasets. Type
🔬 实验与结果
📊 关键实验数据
  • 12.3%
  • 89.8%
  • 62.4%
  • 34.4%
  • 60.9%
第 3 篇 / 共 5 篇
Agent Control Protocol: Admission Control for Agent Actions
cs.CRcs.AIcs.CR📅 2026-03-19
👥 作者
Marcelo Fernandez
🏫 机构单位
  • an autonomous agent can do, under what conditions, with what limits, and with complete
📝 论文摘要(原文)

Agent Control Protocol (ACP) is a formal technical specification for governance of autonomous agents in B2B institutional environments. ACP is the admission control layer between agent intent and system state mutation: before any agent action reaches execution, it must pass a cryptographic admission check that validates identity, capability scope, delegation chain, and policy compliance simultaneously. ACP defines the mechanisms of cryptographic identity, capability-based authorization, deterministic risk evaluation, verifiable chained delegation, transitive revocation, and immutable auditing that a system must implement for autonomous agents to operate under explicit institutional control. ACP operates as an additional layer on top of RBAC and Zero Trust, without replacing them. The v1.13 specification comprises 36 technical documents organized into five conformance levels (L1-L5). It includes a Go reference implementation of 22 packages covering all L1-L4 capabilities, 51 signed conformance test vectors (Ed25519 + SHA-256), and an OpenAPI 3.1.0 specification for all HTTP endpoints. It defines more than 62 verifiable requirements, 12 prohibited behaviors, and the mechanisms for interoperability between institutions. Specification and implementation: https://github.com/chelof100/acp-framework-en

🔭 研究背景与动机
Agent Control Protocol (ACP) is a formal technical specification for governance of autonomous agents in B2B institutional environments. ACP is the admission control layer between agent intent and system state mutation: before any agent action reaches execution, it must pass a cryptographic admission check that validates identity, capability scope, delegation chain, and policy compliance simultaneously. ACP defines the mechanisms of cryptographic identity, capability-based authorization, deterministic risk evaluation, verifiable chained delegation, transitive revocation, and immutable auditing that a system must implement for autonomous agents to operate under explicit institutional control. ACP operates as an additional layer on top of RBAC and Zero Trust, without replacing them. The v1.13 specification comprises 36 technical documents organized into five conformance levels (L1-L5). It includes a Go reference implementation of 22 packages covering all L1-L4 capabilities, 51 signed conformance test vectors (Ed25519 + SHA-256), and an OpenAPI 3.1.0 specification for all HTTP endpoints. It defines more than 62 verifiable requirements, 12 prohibited behaviors, and the mechanisms for interoperability between institutions. Specification and implementation: https://github.com/chelof100/acp-framework-en
💡 核心贡献
  • Agent Control Protocol (ACP) is a formal technical specification for governance of autonomous agents in B2B institutional environments.
  • ACP is the admission control layer between agent intent and system state mutation: before any agent action reaches execution, it must pass a cryptographic admission check that validates identity, capability scope, delegation chain, and policy compliance simultaneously.
  • ACP defines the mechanisms of cryptographic identity, capability-based authorization, deterministic risk evaluation, verifiable chained delegation, transitive revocation, and immutable auditing that a system must implement for autonomous agents to operate under explicit institutional control.
⚙️ 方法详解

⚙️ 主要步骤:

  1. . . . . . . . . . . . . . . . . . . . . 8 3.5 Verifiable Chained Delegation (ACP-DCMA-1.
  2. . . . . . . . . . . . . . . . . . . . 9 3.6 Execution Token (ACP-EXEC-1.
  3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.7 Audit Ledger (ACP-LEDGER-1.
  4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Inter-Institutional Trust 10 4.1 Institutional Trust Anchor (ACP-ITA-1.
  5. . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Mutual Recognition (ACP-ITA-1.
🔬 实验与结果

📌 请参阅原文实验章节获取详细数据

第 4 篇 / 共 5 篇
ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents
cs.AIcs.AI📅 2026-03-19
👥 作者
Hao Zhang、Mingjie Liu、Shaokun Zhang、Songyang Han、Jian Hu、Zhenghui Jin、Yuchi Zhang、Shizhe Diao、Ximing Lu、Binfeng Xu 等(共 13 位作者)
🏫 机构单位
  • Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that
  • ProRL Agent is open-sourced at ProRL Agent and integrated as part of NVIDIA NeMo Gym.
  • responsibilities leads to two major limitations.
  • is GPU-intensive, centered on forward and backward passes, and gradient synchronization. Coupling
  • NVIDIA. All rights reserved.
📝 论文摘要(原文)

Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open-sourced and integrated as part of NVIDIA NeMo Gym.

🔭 研究背景与动机
Recent advances in reinforcement learning from verifiable rewards (RLVR) for large language models (LLMs) are increasingly shifting from single-turn to multi-turn agentic tasks (Cao et al., 2025a; Gao et al., 2025; Guo et al., 2025; Hu et al., 2025; Luo et al., 2025a). Unlike single-turn tasks, multi-turn agentic tasks typically involves interacting with external environments, such as code repositories (Jimenez et al., 2023), web-browser (Zhou et al., 2023), or even full computer operating systems (Xie et al., 2024) via iterative tool use. As a result, they often produce trajectories that often span dozens of turns and tens of thousands of tokens. Training such agents with RL requires repeatedly rolling out policies in these environments and using the resulting trajectories for optimization. As task scale and complexity grow, rollout generation becomes a major bottleneck due to the heterogeneous environments and non-instantaneous feedback inherent in agentic tasks. For example, a single rollout in software engineering tasks often involves many sequential environment interactions, each of which may incur highly variable latency depending on the execution result or environment response. In response, a number of agentic RL training frameworks have recently emerged (Cao et al., 2025b; Jiang et al., 2025; Liu et al., 2025b; Luo et al., 2025c; Sheng et al., 2025; Tan et al., 2025; Xi et al., 2026). A counterintuitive design in existing frameworks is the tightly coupling agentic rollout with the RL training stack, with agent lifecycle handled within the trainer. This couples two modules with fundamentally different responsibilities leads to two major limitations. 1. Conflicting system requirements: Rollout and policy training have fundamentally different resource and operational characteristics. Rollout is I/O-intensive, involving sandbox creation, long-lived tool sessions, and asynchronous coordination across hundreds of concurrent instances. Training, by contrast, is GPU……
💡 核心贡献
  • Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service.
⚙️ 方法详解

⚙️ 主要步骤:

  1. ✗ ✗ ✓ Agent Lightning (Luo et al., 2025c) ✗ ✗ ✗ rLLM (Tan et al., 202
🔬 实验与结果
We next present the experimental results of ProRL Agent across different tasks. We also perform in-depth investigations to provide a better understanding of our infrastructure. 4.1. Experimental Setup Unless otherwise specified, we adopt DAPO (Yu et al., 2025) as the default RL algorithm which filters out instances that are either too easy (resolved ratio 100%) or too hard (resolved ratio 0%). We use a batch size of 32, a mini-batch size of 8, and generate 8 rollouts per instance. Rollouts with errors are excluded from gradient computation. The KL coefficient is set to 1 × 10−4 and the learning rate to 1 × 10−6. All RL training is performed on 32 NVIDIA H100 GPUs. Table 2: Comparison of performance on SWE-Bench Verified across models of different scales. We report the reproduced performance and, where available, the reported results from prior work. Across all model sizes Size Model Reproduced Reported 4B Qwen3-4B-Instruct-2507 14.8 – ProRL Agent-4B (RL) 21.2 – 8B Qwen3-8B 9.6 – SkyRL-Agent-8B-v0 – 9.4 ProRL Agent-8B (RL) 18.0 – 14B Qwen3-14B 15.4 – SkyRL-Agent-14B-v0 – 21.6 ProRL Agent-14B (RL) 23.6 – 4.2. Main Results on Software Engineering We primarily evaluate ProRL Agent on software engineering tasks. Specifically, we train Qwen3-4B-Instruct- 2507, Qwen3-8B, and Qwen3-14B on the 293-instance subset of SWE-Gym used in SkyRL-v0 (Cao et al., 2025a). For the thinking models, Qwen3-8B and Qwen3-14B, we enable thinking mode during training. The results are reported in Table 2. As shown in Table 2, ProRL Agent consistently improves performance across all model sizes. Compared with SkyRL-v0 (Cao et al., 2025a), the gains are particularly notable for the 8B model, where ProRL Agent achieves nearly a 2× improvement on SWE-Bench Verified. These results suggest that our infrastructure provides a more effective and stable foundation for RL training on software engineering agents. 4.3. Generality Across Agent Domains Beyond software engineering agents, we further demonstrate the generality of ProRL Agent by conduct RL training on other domains. STEM Agent. We further train a STEM agent designed to solve complex question-answering tasks across science, technology, engineering, and mathematics. Its primary tool is web search, which enables retrieval of external knowledge for open-domain reasoning. In addition, the agent is equipped with the Bash and IPython tools provided by our infrastructure, allowing it to write and execute code for numerical computation and symb……
第 5 篇 / 共 5 篇
Memento-Skills: Let Agents Design Agents
cs.AIcs.CLcs.LG📅 2026-03-19
👥 作者
Huichi Zhou、Siyuan Guo、Anjie Liu、Zhongwei Yu、Ziqin Gong、Bowen Zhao、Zhixun Chen、Menglong Zhang、Yihang Chen、Jinsong Li 等(共 17 位作者)
🏫 机构单位
  • available at https github.com Memento-Teams Memento-Skills.
📝 论文摘要(原文)

We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity's Last Exam} demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.

🔭 研究背景与动机
We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity's Last Exam} demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.
💡 核心贡献
  • We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience.
  • Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}.
⚙️ 方法详解

⚙️ 主要步骤:

  1. πµ(a | s, Mt) = X c∈Mt µ(c | s, Mt) pLLM(a | s, c),
🔬 实验与结果
📊 关键实验数据
  • 26.2%
  • 10%
  • 66.0%
  • 38.7%
  • 30.8%
🎯 论文中的具体示例
📌 原文摘录 / Case Study

each deployment interaction Figure 2: The three paradigms of LLM adaptation. Pre-training and fine-tuning update the model parameters θ and require large data and compute budgets. Deployment-time learning (this work) keeps θ frozen and instead accumulates experience in an external skill memory M, enabling continual adaptation from live interactions at zero retraining cost. State st (New ticket) READ ct ∼µ(·|st, Mt) LLM Act at ∼pLLM(·|st, ct) Environment feedback rt, st+1 WRITE Mt+1 ←Write(Mt, st, at, rt) Skill Memory Mt next Figure 3: Overview of the Read–Write Reflective Learning loop. Given