Researchers from Standford, Princeton, and Cornell have developed a new benchmark to better evaluate coding abilities of large language models (LLMs). Called CodeClash, the new benchmark pits LLMs against each other in multi-round tournaments to assess their capacity to achieve competitive, high-level objectives beyond narrowly defined, task-specific problems.
Evaluating coding LLMs on well-specified tasks, such as fixing a bug, implementing an algorithm, or writing a test, is not sufficient to evaluate their ability to solve real-world software development challenges, the researchers argue.
Instead of maintenance tasks, developers are driven by high-level goals like improving user retention, increasing revenue, or reducing costs. This requires fundamentally different capabilities; engineers must recursively decompose these objectives into actionable steps, prioritize them, and make strategic decisions about which solutions to pursue.
To bring the LLM evaluation process closer to real-world, goal-oriented software engineering, the researchers developed CodeClash, a benchmark designed to mirror the iterative nature of the development cycle, where changes are proposed, deployed, and refined based on real-world feedback before moving to the next step in the process. In CodeClash, LLMs compete to build the best codebase capable of achieving a high-level objective:
multiple LM systems compete to build the best codebase for achieving a high- level objective over the course of a multi-round tournament. These codebases implement solutions that compete in a code arena, such as BattleSnake (grid-based survival), Poker (no-limit Texas Hold’em), and RoboCode (tank combat).
Each step consists of two phases: an edit phase, where LLMs edit the codebase, and a competition phase where the codebases are evaluated against one another in a code arena. The code arena determines winners based on objectives like score maximization, resource acquisition, or survival.
From the outset, LM agents receive only a brief description of the setting. While information like arena mechanics, example bots, and recommended strategies are available in the starter codebase, models must take initiative to proactively discover them.
At the end of each round, competition logs are addded to a logbase for the LLMs to extract insights and better prepare for the next round with the goal of improving the codebase both overall and relative to their opponents.
Using this approach, the research team ran 1680 tournaments involving 8 LLMs, including Claude Sonnet 4.5, GPT 5, Gemini 2.5 Pro, Qwen3-Coder, Grok Code Fast, and others. No single model consistently outperformed the others across all arenas, although models from Anthropic and OpenAI showed a slight overall advantage. These trends held for both one-on-one and multi-agent competitions, albeit with greater volatility in the latter case. For example, winners of 6-player tournaments capture only 28.6% of total points versus 78.0% in one-on-one challenges.
The researchers also evaluated the models’ ability to analyze codebases generated by other LLMs. In this setting, GPT 5 proved to be the best model overall, outperforming Claude Sonnet 4.5. However, the analysis suggests that inspecting opponents’ code does not translate automatically into a competitive advantage.
Although the study is compelling, the researchers acknowledge that the current arenas are smaller than typical real-world systems. Consequently, future research will aim to handle larger codebases and support multiple competitive objectives.
