Google DeepMind’s AlphaGeometry2 (AG2) AI model solved 84% of the geometry problems from the last 25 years of International Math Olympiads (IMO), outperforming the average human gold-medalist performance.
AlphaGeometry2 is a new iteration of DeepMind’s earlier geometry AI, AlphaGeometry (AG1), which could only solve 54% of the IMO problems. Both models operate by using a domain-specific formal language to describe the problems and a symbolic deductive engine to generate proofs. The new model’s improvements include a more powerful LLM based on Gemini, which translates the natural language form of the problem into formal language. AG2 solved 42 of the 50 IMO geometry problems from the years 2000 to 2024, while the average gold medalist solves about 41. Flagship commercial reasoning LLMs, such as OpenAI’s o1 and Gemini Thinking, cannot solve any of the problems. According to DeepMind,
Despite achieving an impressive 84% solve rate on all 2000-2024 IMO geometry problems, there is still room for improvement…AG2 has not solved all IMO and IMO [short list] problems. We hypothesize that breaking problems into subproblems and applying reinforcement learning approaches could close this gap. Finally, in this paper we reported progress on building a fully automated geometry problem solving system, which takes input in natural language and outputs a solution reliably without any hallucinations. Despite good initial results, we think the auto-formalization can be further improved with more formalization examples and supervised fine-tuning.
AG2, like AG1, solves geometry problems by stating them in a formal language which consists of predicates: for example, acompute a b c d means “Find the angle between AB and CD.” AG2’s predicates can cover 88% of the IMO problems; the model will not attempt to solve the other problems.
But first, the problems written in natural language must be expressed in this formal language. To do this, DeepMind uses a Gemini LLM with few-shot prompting: the prompts contain “several dozens” of examples of problem translation. This approach is “very consistent and makes almost no mistakes” on the easier problems.
Once the problems are specified as formal predicates, they are solved using a symbolic engine called Deductive Database Arithmetic Reasoning (DDAR). If the engine fails to find a proof, AG2 uses a language model and tree search algorithm to generate auxiliary constructions, then it re-runs the DDAR engine; this loop is repeated until a proof is found.
Writing on X, Berkeley CS PhD student Yuxi Liu said,
AlphaGeometry2 is pretty cool, but clearly not bitter-lessoned. It has a very 1950s auto theorem proving feel, with handcrafted representation language, logical inference engine, etc…They are just doing autoformalization (succeeding 30/39) and proposing auxiliary constructions during tree search. Many of them require just a single auxiliary construction! Though there are cursed examples that required 12.
Oxford University ML researcher Simon Frieder also wrote on X:
AlphaGeometry2 was published, 2.5 months since we released Newclid without much fanfare (in true scientist style! :D) and two months after TongGeometry. It seems no code was provided for AG2. So now we have two closed systems, AlphaGeometry2 and TongGeometry that we cannot compare. Newclid…is fully open-source, fixed many AlphaGeometry bugs and slightly improved it in terms of performance – and we also have GeoGebra support for better input.
Although the AG2 code has not been released, the code for AG1 is available on GitHub.