Multi-Agent Path Finding (MAPF) seeks collision-free paths for multiple agents from their respective start locations to their respective goal locations while minimizing path costs. Most existing MAPF algorithms rely on a common assumption of synchronized actions, where the actions of all agents start at the same time and always take a time unit, which may limit the use of MAPF planners in practice. To get rid of this assumption, Continuous-time Conflict-Based Search (CCBS) is a popular approach that can find optimal solutions for MAPF with asynchronous actions (MAPF-AA). However, CCBS has recently been identified to be incomplete due to an uncountably infinite state space created by continuous wait durations. This paper proposes a new method, Conflict-Based Search with Asynchronous Actions (CBS-AA), which bypasses this theoretical issue and can solve MAPF-AA with completeness and solution optimality guarantees. Based on CBS-AA, we also develop conflict resolution techniques to improve the scalability of CBS-AA further. Our test results show that our method can reduce the number of branches by up to 90%.
📌 请参阅原文实验章节获取详细数据
𝑤= (𝐵, 𝐵, 2.0) means waiting at vertex 𝐵for a duration of 2.0, and CCBS adds a constraint ⟨𝑖,𝑎𝑖 𝑤, [𝑡𝑖,𝑡𝑖 𝑢)⟩to prohibit agent 𝑖from performing 𝑎𝑖 𝑤at time 𝑡∈[𝑡𝑖,𝑡𝑖 𝑢). But agent 𝑖still can preform wait action 𝑎𝑖 𝑤1 = (𝐵, 𝐵, 2.01),𝑎𝑖 𝑤2 = (𝐵, 𝐵, 2.001),𝑎𝑖 𝑤3 = (𝐵, 𝐵, 2.0001)... at time 𝑡∈[𝑡𝑖,𝑡𝑖 𝑢). So, CCBS may not terminate when there is a wait action. In the open-sourced implementation1, CCBS makes a change when transferring constraints about wait action to low-level solver. For the previous constraint ⟨𝑖,𝑎𝑖 𝑤, [𝑡𝑖,𝑡𝑖 𝑢)⟩, CSIPP divides the safe interval of 𝐵to two parts: [0,𝑡𝑖) and [𝑡𝑖 𝑢, ∞
This paper focuses on MAPF-AA, and develops a new exact algo- rithm CBS-AA for MAPF-AA with solution optimality guarantees, based on the popular CBS framework. CBS-AA introduces new con- flict resolution techniques for agents with asynchronous actions, which improves the runtime efficiency of the algorithm. Experi- mental results demonstrate the advantages of our new approaches in different settings against several baseline methods. For future work, one can also consider speed and uncertainty in
Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.
Agent Control Protocol (ACP) is a formal technical specification for governance of autonomous agents in B2B institutional environments. ACP is the admission control layer between agent intent and system state mutation: before any agent action reaches execution, it must pass a cryptographic admission check that validates identity, capability scope, delegation chain, and policy compliance simultaneously. ACP defines the mechanisms of cryptographic identity, capability-based authorization, deterministic risk evaluation, verifiable chained delegation, transitive revocation, and immutable auditing that a system must implement for autonomous agents to operate under explicit institutional control. ACP operates as an additional layer on top of RBAC and Zero Trust, without replacing them. The v1.13 specification comprises 36 technical documents organized into five conformance levels (L1-L5). It includes a Go reference implementation of 22 packages covering all L1-L4 capabilities, 51 signed conformance test vectors (Ed25519 + SHA-256), and an OpenAPI 3.1.0 specification for all HTTP endpoints. It defines more than 62 verifiable requirements, 12 prohibited behaviors, and the mechanisms for interoperability between institutions. Specification and implementation: https://github.com/chelof100/acp-framework-en
⚙️ 主要步骤:
📌 请参阅原文实验章节获取详细数据
Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open-sourced and integrated as part of NVIDIA NeMo Gym.
⚙️ 主要步骤:
We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity's Last Exam} demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.
⚙️ 主要步骤:
each deployment interaction Figure 2: The three paradigms of LLM adaptation. Pre-training and fine-tuning update the model parameters θ and require large data and compute budgets. Deployment-time learning (this work) keeps θ frozen and instead accumulates experience in an external skill memory M, enabling continual adaptation from live interactions at zero retraining cost. State st (New ticket) READ ct ∼µ(·|st, Mt) LLM Act at ∼pLLM(·|st, ct) Environment feedback rt, st+1 WRITE Mt+1 ←Write(Mt, st, at, rt) Skill Memory Mt next Figure 3: Overview of the Read–Write Reflective Learning loop. Given