By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Google DeepMind Introduces QuestBench to Evaluate LLMs in Solving Logic and Math Problems
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Google DeepMind Introduces QuestBench to Evaluate LLMs in Solving Logic and Math Problems
News

Google DeepMind Introduces QuestBench to Evaluate LLMs in Solving Logic and Math Problems

News Room
Last updated: 2025/04/22 at 7:02 PM
News Room Published 22 April 2025
Share
SHARE

Google DeepMind’s QuestBench benchmark helps in evaluating if LLMs can pinpoint the single, crucial question needed to solve logic, planning, or math problems. DeepMind team recently published an article on QuestBench which is a set of underspecified reasoning tasks solvable by asking at most one question.

Large language models (LLMs) are increasingly being applied to reasoning tasks such as math, logic and planning/coding. These applications largely assume that tasks are well-defined where all necessary information has been provided. But in real world applications, queries to LLMs are often underspecified, only solvable through acquiring missing information. Users may omit crucial details in math problems, and robots in factories might operate in environments with partial observability. In such cases, LLMs need the ability to proactively gather missing information by asking clarifying questions.

Deepmind team’s work investigates whether LLMs can identify and acquire the missing information necessary to solve reasoning tasks by generating accurate clarifying questions for underspecified reasoning tasks. The goal is to rigorously evaluate an LLM’s ability to identify the minimal necessary question to ask and quantify axes of difficulty levels for each problem.

They formalize this information-gathering problem as an underspecified Constraint Satisfaction Problem (CSP) which is a group of mathematical questions defined as a set of objects whose state must satisfy a number of constraints or limitations. The key idea is that many reasoning tasks can be modeled as determining the value of a target variable given a set of variables and constraints. A problem is underspecified if and only if the value of the target variable cannot be inferred from the given information. This formalization helps pinpoint the difference between semantic ambiguity and underspecification. Semantic ambiguity is where multiple valid interpretations exist, but each yields a solvable answer. And underspecification is when a problem is unsolvable without additional information. The scope of QuestBench effort focuses on underspecification where the user has not provided enough information for the language model to fulfill the request. This situation can arise because users may not know what information the model lacks, or what information is necessary to complete the task.

The team evaluated LLMs’ ability to address underspecification in structured reasoning tasks with a clearly defined ground truth. For each task, the model needs to ask exactly one question, allowing for reliable evaluation of LLMs’ information gathering capabilities. They also evaluated the accuracy of breadth-first-search up to a depth “n”. There are four task categories which include: 1) Logic-Q for logical reasoning tasks with one missing proposition; 2) Planning-Q for planning problems defined in Planning Domain Definition Language (PDDL), with partially observed initial states; 3) GSM-Q which are basically human-annotated grade school math problems with one missing variable assignment; and 4) GSME-Q: also human-annotated GSM-Q word problems but are translated into equations.

The datasets used for QuestBench include constructing 1-sufficient CSPs in logical reasoning (Logic-Q), planning (Planning-Q), and math (GSM-Q/GSME-Q) domains. Each problem instance is composed of a user request, the full set of question choices and a subset including correct questions. They evaluated whether the models can pick out a correct question from the question choices.

QuestBench’s evaluation included several state-of-the-art (SOTA) LLMs like GPT-4o, GPT-4-o1 Preview, Claude 3.5 Sonnet, Gemini 1.5 Pro, Gemini 2.0 Flash Thinking Experimental and open-sourced Gemma models. They tested the benchmark framework in various settings like zero-shot (ZS), chain-of-thought (CoT), and four-shot (4S) settings.

Team also conducted studies to assess LLMs’ ability to reason in the presence of sufficient information and detect whether the problem is underspecified, and found that these abilities are correlated with identifying the right question to ask in the benchmark, but to varying degrees in varying domains. SOTA and near-SOTA LLMs are relatively good at identifying missing information in simple algebra problems, but struggle with more complex tasks involving logic and planning.

In terms of specific conclusions of the study, language models demonstrated strong performance on GSM-Q and GSME-Q domains with over 80% accuracy. This could be due to these domains having a smaller number of variables and constraints, and requiring shallower search depth than the other two domains. But all the models tested struggled to perform beyond 50% on Logic-Q and Planning-Q domains. Neither chain of thought nor few-shot examples resulted in significant gains across all models in either domain. To investigate these discrepancies, they also analyzed the correlation between model accuracy and axes of difficulty in QuestBench, finding differing trends between domains. LLMs are more sensitive to the search depth in Logic-Q than in Planning-Q, suggesting that models may be utilizing strategies similar to backwards search when solving Logic-Q, but not when solving Planning-Q.

LLM evaluation benchmarks are important to understand a specific model’s strengths and limitations, to help with the fine tuning process, and also as a reference to decide which model to use for a specific use case. There are several LLM evaluation frameworks available for assessing the language models performance based on different criteria.

For more information on this study, check out the website, the research paper PDF, and the GitHub project, which is available under Apache 2.0 license model, for code to generate QuestBench data and evaluate LLMs on it. If you want to run the evaluation in your local environment, the steps include the installation of Conda environment, downloading the datasets, and running evaluations.

 

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Verizon's postpaid phone customers take a hike after prices are raised
Next Article Top Obsidian Templates for Productivity & Workflow Management
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Fortnite says its now offline on Apple’s iOS around the world
News
Google’s glorious G glow-up spreads its rainbow across Android
News
Darren Till vs Darren Stewart – Misfits Boxing 21 LIVE RESULTS: Fight updates
News
Apple Pay and Apple Cash experienced an outage on Friday
News

You Might also Like

Fortnite says its now offline on Apple’s iOS around the world

3 Min Read
News

Google’s glorious G glow-up spreads its rainbow across Android

3 Min Read
News

Darren Till vs Darren Stewart – Misfits Boxing 21 LIVE RESULTS: Fight updates

3 Min Read
News

Apple Pay and Apple Cash experienced an outage on Friday

2 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?