Hallucination rate
Accurate LLM outputs are set as a general goal. However, measuring how accurate the results really are is difficult. One approach is to ask the LLM to summarize a document. Another model then evaluates how well the summary matches the original. Even if it may not be possible to detect all subtle inaccuracies, it can provide information about serious deviations – i.e. hallucinations.
Some researchers have also created complex test sets with curated answers. Those LLMs that deliver the expected outputs receive the highest scores. Common benchmarks for measuring the hallucination rate include:
Toxicity and bias score
Things get a little more difficult when a metric needs to be developed to capture toxic or biased outputs. Here too, the underlying definitions can vary greatly. However, that hasn’t stopped some specialists from developing appropriate tools.
These are able to identify some of the most obvious warning signs when it comes to unwelcome results. The better known solutions in this area include:
The more complex and “agentic” AI models become, the more likely they are to access various other tools – for example via the Model Context Protocol (MCP). However, if MCP is not used, it makes sense to track how accurately the model behaves when selecting tools – for example, how often it calls the most suitable tool for a specific task. This metric can be obtained, for example, via the Berkeley Function Calling Leaderboard (BFCL).
Prompt sensitivity
This value indicates the extent to which even minor changes to the wording of the prompt cause the AI model to produce different results. This is similar to a derivative in stochastics, but is usually calculated experimentally from a collection of test prompts. There are a number of different approaches to measuring this metric, each with a different focus.
Some test sets are based on minor rewordings of the query that are semantically identical. Others combine different ways of formulating the problem – for example with the help of examples. Some of the better-known approaches in this area include:
Semantic similarity and conciseness
Some metrics evaluate model output by comparing it to a set of gold standard answers. To do this, the answers are often fed into a vector embedding model and compared with a retrieval augmented generation (RAG) database. In this way, you can not only record how concise or not the output is. But also to what extent this can be influenced by changing parameters – such as the “temperature”. A common tool for measuring the semantic similarity and conciseness of LLMs is BERTScore.
Grounding-Score
For AI systems that combine an LLM with a vector-based search tool for RAG, the effectiveness of this combination is measured using a benchmark such as the grounding score. If this is used, the AI model is provided with additional data from the vector search.
The benchmark then measures how close the model remains to this additional information – i.e. the proportion to which the output is generated from the source documents and the training data. This works, for example:
Model variability
Most LLMs include some level of random entropy, which can be controlled using the Temperature parameter. Model variability is a measure of how much the AI’s responses change from session to session. Some applications – such as chatbots – require a certain degree of variability because randomness brings “life” to the answers. In other cases, such as law or medicine, too much variability in answers can undermine user trust.
Format-Compliance-Rate
Some use cases require AI models to generate data in strict formats such as JSON or CSV. For example, if this information is to be fed into a pipeline for further processing. The format compliance rate tests a range of common formats and measures how often the LLM returns semantically correct data. In particular, agentic AI systems that combine multiple models and tools rely on LLMs that achieve good results on this benchmark.
Instruction Following
Some prompts contain very specific instructions whose compliance can be measured empirically. An example would be a prompt that instructs the AI to create exactly 300 words or a poem in rhyming couplets. Instruction-following tests rely on a collection of example prompts and are designed to produce outputs that are easy to measure. Specific examples of this are:
Plan stability
Agentic models operate based on a plan. Some are intelligent enough to adjust or discard this plan as their work progresses. How often the plan is adjusted can be measured by plan stability. A low value could mean that the agent plans poorly – or is simply flexible. Maybe both.
Self-correction value
Some AI agents are also able to delve deeper into the subject matter and recognize their own mistakes. The self-correction metric measures how often the model makes a mistake and then detects it – either on its own or after being prompted to do so.
Jailbreak-Resistance
Users are always trying to come up with new, smart ways to trick AI models into leaving their guardrails behind and discussing topics they shouldn’t be discussing. In the past, some LLMs have been fooled by being told that their output is part of a work of fiction. Newer models now have more sophisticated defense mechanisms. To determine how well an LLM resists such attempts at deception, the following benchmarks are recommended:
Prompt injection vulnerability
As is well known, untrustworthy data from additional sources or skills can contain malicious instructions intended to compromise the AI model. How vulnerable a model is to such targeted prompt injection attacks can be measured using appropriate benchmarks that operate on the basis of known attack vectors. For example:
Copyright-Infringement-Score
Some LLMs tend to reproduce the data from their training corpus in such a way that the output comes close to plagiarism. This can be a bigger problem if the training material has not been licensed properly (or carefully enough). The Copyright Infringement Score measures how often an AI model may reproduce its training material a little too literally. Tools that can reveal such issues include:
RULER
The Needle-in-a-Haystack (NIAH) benchmark (PDF) measures how well an AI model can extract specific information from the overall context. The RULER benchmark extends this approach and provides the ability to vary the type and number of “needles,” the size of the “haystack,” and the complexity of the task.
GSM8K
The developers of GSM8K wanted to evaluate LLM’s ability to solve multi-step mathematical problems. To do this, they created a data set with 8,500 tasks. The focus here is explicitly on solving mathematics tasks – but this benchmark also measures the ability to build reasoning chains.
GPQA
The Benchmark Graduate-Level Google-Proof Q&A (PDF) consists of hundreds of complex questions that students in master’s programs usually deal with – especially in the natural sciences. The term “Google-proof” means that these questions cannot simply be answered by a search engine. In order to make the benchmark even more demanding, the researchers at GPQA focused primarily on questions that are often answered incorrectly by laypeople.
MMLU-Pro
The MMLU-Pro benchmark is built on the Massive Multitask Language Understanding dataset. It is designed to test the understanding of a model across a broad scientific spectrum. This benchmark includes more than 12,000 questions – including from the areas of biology, chemistry, economics and law.
MBPP
The MBPP (Mostly Basic Python Problems) benchmark was developed at Google to evaluate how well AI models solve programming tasks. Each problem consists of a statement, a reference solution and several similar test cases. The number of correct answers to these questions is a good measure of how well or poorly a model will solve simpler Python programming problems.
SWE-bench
SWE-bench also determines how well an LLM fulfills coding tasks. This benchmark was created based on issues and corresponding pull requests from a number of Python projects. Due to some limitations, the data set has now been expanded. This is manifested in three extended benchmarks:
LMSYS Chatbot Arena
LMSYS Chatbot Arena is a dynamic system that does not rely on a fixed set of test prompts. Instead, this benchmark platform feeds different AI models the identical prompt and then lets humans choose the best results. These direct comparisons result in an Elo-like rating.
Preis
Every real estate agent knows: the three most important metrics in an ad are price, price – and price. This plays a somewhat minor role in the evaluation of AI systems – but it can decide whether the project is ultimately profitable or not. If the cost per inference is a bit too high, the volume won’t make up for it.
A cheaper AI model is probably not a good idea if it delivers hallucinatory answers. Sometimes it can make sense to invest a little more and use a model that provides answers with the right “flair”. (fm)
This article is im Original published by our sister publication Infoworld.com.
