Google Stax is a framework designed to replace subjective evaluations of AI models with an objective, data-driven, and repeatable process for measuring model output quality. Google says this will allow AI developers to tailor the evaluation process to their specific use cases rather than relying on generic benchmarks.
According to Google, evaluation is key to selecting the right model for a given solution by comparing quality, latency, and cost. It is also essential for assessing how effective prompt engineering and fine-tuning efforts actually are in improving results. Another area where repeatable benchmarks are valuable is agent orchestration, where they help ensure that agents and other components work reliably together.
Stax provides data and tools to build benchmarks that combine human judgement and automated evaluators. Developers can import production-ready datasets or create their own, either by uploading existing data or by using LLMs to generate synthetic datasets. Likewise, Stax includes a suite of default evaluators for common metrics such as verbosity and summarization, while allowing the creation of custom evaluators for more specific or fine-grained criteria.
A custom evaluator can be created in a few steps, beginning with selecting the base LLM that will act as a judge. The judge is provided with a prompt instructing how to evaluate the tested model’s output. The prompt must contains definitions of the categories the judge will use for grading, each associated with a numerical score between 0.0 and 1.0. Additionally, it must include instructions on the preferred response format and may use variables to refer to the {{output}}
, {{input}}
, {{history}}
, {{expected_output}}
, and {{metadata.key}}
. To ensure the evaluator’s reliability, it should be calibrated against trusted human ratings using a classical supervised learning approach. The evaluator prompt can then be fine-tuned through an iterative process to improve consistency between its ratings and those of the trusted evaluator.
Google Stax is not the only solution available for AI model evaluation. Its competitors include OpenAI Evals, DeepEval, MLFlow LLM Evaluate, and many others, each differing significantly in approach and capabilities.
Currently, Stax supports benchmarking for a growing list of model providers, including OpenAI, Anthropic, Mistral, Grok, DeepSeek, and Google itself. In addition, it can be used with custom model endpoints. It is free to use while in beta, but Google says it may introduce a pricing model after that.
A final note on data privacy: Google states that it will neither own user data, including prompts, custom datasets, or evaluators, nor use it to train its language models. However, users should be aware that when using other providers, those providers’ data policies will also apply.