Recursive Agent Optimization: training a model to call itself
Ciprian · · 8 min read What the paper studies
Recursive Agent Optimization is a reinforcement learning method for training language model agents that delegate sub-tasks to new instantiations of themselves. The question: can a model be taught to call itself, decide what to hand off, and combine the answers, in a way that beats a single agent given the same compute and the same context budget? RAO is the training procedure; recursive agents are what the procedure produces.
Methodology
RAO trains agents with reinforcement learning to operate recursively at inference time. An agent receives a task, decides whether to solve it directly or split it into sub-tasks, spawns child agents (new instantiations of the same model) for each sub-task, and aggregates their returns. The training procedure teaches three policies inside one model: when to delegate, what to send the child, and how to use what comes back. Comparisons in the paper are against single-agent baselines on the same underlying model.
Full specifications, benchmark names, dataset sizes, reward design details, and numeric performance deltas, appear in the full PDF.
Findings
First, recursive agents trained with RAO process inputs longer than the underlying model’s context window. A parent agent hands chunks of the input to children, each child runs in its own fresh context, and the parent aggregates. This is inference-time scaling: compute and context grow with the recursion tree, not with the model.
Second, the trained agents generalize to tasks harder than training saw. A hard instance becomes a recursion of easier instances. The training signal selects for splits the model can solve at depth.
Third, training efficiency improves. Decomposition has a structural advantage: shorter sub-tasks generate reward signals more frequently per rollout. Instead of learning from a single success-or-failure at the end, the agent gets feedback on each sub-task completion. This denser reward stream, more gradient updates per sample, makes RL on long-horizon tasks tractable.
Fourth, wall-clock time decreases versus single-agent systems on the same problem. Children run in parallel; the parent waits on the slowest branch, not the sum of branches.
Fifth, RAO teaches communication, not only decomposition. The model learns what to put in the child’s prompt and how to interpret the child’s return. When humans hand-wire recursive agents, the communication usually fails. The parent gives the child too little context, or too much, or the wrong thing, and the child confidently returns garbage.
Limitations
No numeric deltas, task names, or model sizes appear in the abstract. The training relies on RL, which has the usual difficulties: reward design, sample efficiency, and exploration on long-horizon tasks. Credit assignment becomes harder with recursion: a wrong decomposition at depth one breaks every sub-tree below. Fixing it requires gradients to flow back through the parent’s split decision, across multiple levels. Wall-clock gains assume the host can run children concurrently, which is a deployment constraint, not a model property. Cost and total token spend are not addressed in the abstract: parallel children burn more tokens than one agent, even if elapsed time drops. Generalization may not hold on tasks where decomposition is not natural: tightly coupled reasoning, problems with global constraints, or tasks where each sub-decision depends on the others.
Real-world application
For teams building code-generation and code-review systems on top of LLMs, the paper’s claims map to specific engineering moves.
Context-window scaling. If your code-review agent fails on large pull requests because the diff plus the relevant files exceed the context window, the standard fix is map-reduce over hunks written as Python glue. RAO suggests training the agent to do the splitting and the reading itself. The benefit: the model learns which hunks to group, not the human. Action: log the cases your hand-written splitter gets wrong (reviews that miss cross-file bugs, reviews that flag spurious diffs because they lost context) and use those as training data for a recursive variant.
Hard-task generalization. Codebases contain rare bugs that look nothing like the training distribution. A recursive agent that can split a 2000-line refactor into a tree of 50-line edits is more likely to land each leaf, even when the root task is harder than anything in training. Action: stop benchmarking your code agent only on whole-task success. Add a metric for sub-task success at depth, and a metric for whether the parent’s split was sensible. Both are training signal.
Wall-clock budget. A reviewer that fans out across files and merges findings finishes in the time of the slowest file, not the sum. For interactive tools (IDE assistants, PR-review bots that block merges), response time shifts from five minutes to 30 seconds. Action: instrument your agent’s call graph. Measure what fraction of work is on the critical path versus parallelizable. If most steps are sequential, your agent will not benefit from recursion regardless of how it is trained.
Training efficiency. Shorter sub-tasks produce denser reward, the agent gets feedback on each sub-task, not just the root outcome. For teams fine-tuning code agents with RL (increasingly common as SWE-bench-style suites mature), train the recursive policy directly rather than layering recursion on top of a flat policy later. Action: when you collect rollouts, log the recursion structure, not only the final answer. Sub-task-level reward produces tighter credit assignment than root-level reward: you can identify which splits worked, which merges went wrong, and update those decisions specifically.
Communication learning. RAO learns what to communicate between parent and child, not just whether to split. The RL training optimizes the parent’s prompt construction. The parent learns: include function signatures and call sites for this sub-task, exclude unrelated files; include imports and types, exclude build logs; include test coverage for this change, not the entire repository. Hand-written templates are static; they can’t adapt to sub-task type, code style, or depth. When humans hand-wire recursive agents, the communication fails: the parent over-specifies (too much context, confuses the child) or under-specifies (missing context). Action: if you maintain prompt templates now, treat them as initialization. Log failures where instruction was the bottleneck, a child that returned garbage from insufficient context, or from too much irrelevant information. Build preference pairs: instruction formats leading to correct outputs versus to errors. Feed to RL training, so the objective optimizes communication directly.
For platform teams running these agents in production, three operational consequences follow. First, spawning child agents quickly becomes the bottleneck. If each child waits for a fresh model instance to load, recursive agents time out. This typically means maintaining a warm pool (model instances running, ready to receive requests) or batching child requests (queuing them to reuse a shared pool of instances). Second, observability must track trees, not lines. A flat log of LLM calls is illegible when the agent is recursive; each recursive task can span hundreds of calls across the tree. Solution: hierarchical tracing where each parent spawns children with explicit parent-child links, and the parent’s split decision is a recorded event. Third, cost reporting must attribute tokens to the full tree. Billing only on root tasks under-counts spend, each parallel child adds tokens. Finance needs to see the tree structure, not just the root outcome.
For evaluation, single-task success rate is no longer sufficient. A recursive agent has three failure modes: over-splitting (decomposing a task that should be solved directly), under-splitting (trying to handle more than the child’s context allows), and bad merges (correct sub-answers, wrong combination). Each needs its own metric. These metrics don’t appear in the paper, but anyone deploying RAO-style agents will invent them.
For tool design, recursion biases toward small, composable tools. A recursive agent that can call any tool at any depth needs tools that behave the same regardless of who calls them and at what depth. Stateful tools (long-lived shells, build sessions, mutable workspaces) are awkward in this setting because concurrent children might modify the same resource. Stateless tools (read file, search, type-check a snippet) compose without locking. If your code agent depends on a shared workspace, decide whether children get scoped sub-workspaces or share the parent’s, and document the contract.
For safety and review of agent output, recursion compounds error. If a leaf is wrong with probability p and the root depends on n leaves, root success drops with depth. The countermeasure is per-level verification: a child returns an answer plus a check the parent can run cheaply (a unit test, a type-check, a regex assertion, a file hash). What to communicate includes verifiable returns, not only natural-language summaries. A child’s confidence is not a substitute for a check.
Code-agent vendors have avoided this question: when should an agent recurse? The answer cannot be always (recursion costs tokens) and cannot be never (flat agents exceed the context window on real-world repositories). RAO gives a learned answer. For a vendor without RAO, a reasonable engineering proxy is task length: if the input or expected output exceeds half the context window, recurse; otherwise run flat. Measure on your own data; the right cutoff depends on the model, the task, and the cost of a child.
For a code-generation-and-review product specifically, the practical near-term moves are: add tree-aware tracing now, even if your current agent is flat; benchmark your hand-written splitter against the cases it loses on; and reserve training-data budget for sub-task examples, so you have data ready when you adopt recursive agents.
References
- Apurva Gandhi, Satyaki Chakraborty, Xiangjun Wang, Aviral Kumar, Graham Neubig. Recursive Agent Optimization. arXiv:2605.06639. https://arxiv.org/abs/2605.06639