Google DeepMind’s QuestBench benchmark helps in evaluating if LLMs can pinpoint the single, crucial question needed to solve logic, planning, or math problems. DeepMind team recently published an article on QuestBench which is a set of underspecified reasoning tasks solvable by asking at most one question.
Large language models (LLMs) are increasingly being applied to reasoning tasks such as math, logic and planning/coding. These applications largely assume that tasks are well-defined where all necessary information has been provided. But in real world applications, queries to LLMs are often underspecified, only solvable through acquiring missing information. Users may omit crucial details in math problems, and robots in factories might operate in environments with partial observability. In such cases, LLMs need the ability to proactively gather missing information by asking clarifying questions.
Deepmind team’s work investigates whether LLMs can identify and acquire the missing information necessary to solve reasoning tasks by generating accurate clarifying questions for underspecified reasoning tasks. The goal is to rigorously evaluate an LLM’s ability to identify the minimal necessary question to ask and quantify axes of difficulty levels for each problem.
They formalize this information-gathering problem as an underspecified Constraint Satisfaction Problem (CSP) which is a group of mathematical questions defined as a set of objects whose state must satisfy a number of constraints or limitations. The key idea is that many reasoning tasks can be modeled as determining the value of a target variable given a set of variables and constraints. A problem is underspecified if and only if the value of the target variable cannot be inferred from the given information. This formalization helps pinpoint the difference between semantic ambiguity and underspecification. Semantic ambiguity is where multiple valid interpretations exist, but each yields a solvable answer. And underspecification is when a problem is unsolvable without additional information. The scope of QuestBench effort focuses on underspecification where the user has not provided enough information for the language model to fulfill the request. This situation can arise because users may not know what information the model lacks, or what information is necessary to complete the task.
The team evaluated LLMs’ ability to address underspecification in structured reasoning tasks with a clearly defined ground truth. For each task, the model needs to ask exactly one question, allowing for reliable evaluation of LLMs’ information gathering capabilities. They also evaluated the accuracy of breadth-first-search up to a depth “n”. There are four task categories which include: 1) Logic-Q for logical reasoning tasks with one missing proposition; 2) Planning-Q for planning problems defined in Planning Domain Definition Language (PDDL), with partially observed initial states; 3) GSM-Q which are basically human-annotated grade school math problems with one missing variable assignment; and 4) GSME-Q: also human-annotated GSM-Q word problems but are translated into equations.
The datasets used for QuestBench include constructing 1-sufficient CSPs in logical reasoning (Logic-Q), planning (Planning-Q), and math (GSM-Q/GSME-Q) domains. Each problem instance is composed of a user request, the full set of question choices and a subset including correct questions. They evaluated whether the models can pick out a correct question from the question choices.
QuestBench’s evaluation included several state-of-the-art (SOTA) LLMs like GPT-4o, GPT-4-o1 Preview, Claude 3.5 Sonnet, Gemini 1.5 Pro, Gemini 2.0 Flash Thinking Experimental and open-sourced Gemma models. They tested the benchmark framework in various settings like zero-shot (ZS), chain-of-thought (CoT), and four-shot (4S) settings.
Team also conducted studies to assess LLMs’ ability to reason in the presence of sufficient information and detect whether the problem is underspecified, and found that these abilities are correlated with identifying the right question to ask in the benchmark, but to varying degrees in varying domains. SOTA and near-SOTA LLMs are relatively good at identifying missing information in simple algebra problems, but struggle with more complex tasks involving logic and planning.
In terms of specific conclusions of the study, language models demonstrated strong performance on GSM-Q and GSME-Q domains with over 80% accuracy. This could be due to these domains having a smaller number of variables and constraints, and requiring shallower search depth than the other two domains. But all the models tested struggled to perform beyond 50% on Logic-Q and Planning-Q domains. Neither chain of thought nor few-shot examples resulted in significant gains across all models in either domain. To investigate these discrepancies, they also analyzed the correlation between model accuracy and axes of difficulty in QuestBench, finding differing trends between domains. LLMs are more sensitive to the search depth in Logic-Q than in Planning-Q, suggesting that models may be utilizing strategies similar to backwards search when solving Logic-Q, but not when solving Planning-Q.
LLM evaluation benchmarks are important to understand a specific model’s strengths and limitations, to help with the fine tuning process, and also as a reference to decide which model to use for a specific use case. There are several LLM evaluation frameworks available for assessing the language models performance based on different criteria.
For more information on this study, check out the website, the research paper PDF, and the GitHub project, which is available under Apache 2.0 license model, for code to generate QuestBench data and evaluate LLMs on it. If you want to run the evaluation in your local environment, the steps include the installation of Conda environment, downloading the datasets, and running evaluations.