Google Research tried to answer the question of how to design agent systems for optimal performance by running a controlled evaluation of 180 agent configurations. From this, the team derived what they call the “first quantitative scaling principles for AI agent systems”, showing that multi-agent coordination does not reliably improve results and can even reduce performance.
The research challenges several widely held beliefs, according to its authors:
Practitioners often rely on heuristics, such as the assumption that “more agents are better”, believing that adding specialized agents will consistently improve results.
Instead, they argue that the benefits hold only for certain classes of tasks, as adding more agents often leads to a performance ceiling and, in some cases, can even hurt performance.
The study evaluates five architectures, including single-agent, independent multi-agent, orchestrated, peer-to-peer, and hybrid systems, and finds that parallelizable tasks, where work can be divided into independent chunks, benefit greatly from multi-agent coordination. For example
On parallelizable tasks like financial reasoning […] centralized coordination improved performance by 80.9% over a single agent.
On the other hand, sequential reasoning tasks, like planning in PlanCraft, tend to suffer when multiple agents are introduced:
every multi-agent variant we tested degraded performance by 39-70%. In these scenarios, the overhead of communication fragmented the reasoning process, leaving insufficient “cognitive budget” for the actual task.
The research also highlights a tool-use bottleneck, meaning that as tasks require more tool usage, such as APIs, web actions, and other external resources, coordination costs increase. These costs can outweigh the benefits of multi-agent systems and become a key factor in deciding whether to adopt a multi-agent architecture or not.
Another notable finding is that independent agents can amplify errors up to ~17× when mistakes propagate unchecked. In contrast,centralized coordination limits error propagation to roughly 4.4×, since the orchestrator validates and manages outputs before passing them along.
As a final note, the researchers also developed a predictive model to choose the right architecture:
Instead of guessing whether to use a swarm of agents or a single powerful model, developers can now look at the properties of their task, specifically its sequential dependencies and tool density, to make principled engineering decisions.
The model correctly identifies the best approach for about 87% of unseen task configurations and sports a coefficient of determination (R^2) of 0.513.
Reacting to Google’s research on Hacker News, zkmon argued that the study lacks strong grounding and provides no clear rationale for why certain architectures yield the observed differences. Similarly, gopalv notes that while single-agent systems are likely not resilient to errors, introducing a coordinator is not necessarily the solution:
We found the orchestrator is not the core component, but a specialized evaluator for each action to match the result, goal and methods at the end of execution to report back to the orchestrator on goal adherence.
kioku points out that an 8% improvement gained from the use of a coordinator may not be enough to justify the added complexity and cost of introducing a coordination layer.
