By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Google Stax Aims to Make AI Model Evaluation Accessible for Developers
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Google Stax Aims to Make AI Model Evaluation Accessible for Developers
News

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

News Room
Last updated: 2025/09/29 at 6:34 PM
News Room Published 29 September 2025
Share
SHARE

Google Stax is a framework designed to replace subjective evaluations of AI models with an objective, data-driven, and repeatable process for measuring model output quality. Google says this will allow AI developers to tailor the evaluation process to their specific use cases rather than relying on generic benchmarks.

According to Google, evaluation is key to selecting the right model for a given solution by comparing quality, latency, and cost. It is also essential for assessing how effective prompt engineering and fine-tuning efforts actually are in improving results. Another area where repeatable benchmarks are valuable is agent orchestration, where they help ensure that agents and other components work reliably together.

Stax provides data and tools to build benchmarks that combine human judgement and automated evaluators. Developers can import production-ready datasets or create their own, either by uploading existing data or by using LLMs to generate synthetic datasets. Likewise, Stax includes a suite of default evaluators for common metrics such as verbosity and summarization, while allowing the creation of custom evaluators for more specific or fine-grained criteria.

A custom evaluator can be created in a few steps, beginning with selecting the base LLM that will act as a judge. The judge is provided with a prompt instructing how to evaluate the tested model’s output. The prompt must contains definitions of the categories the judge will use for grading, each associated with a numerical score between 0.0 and 1.0. Additionally, it must include instructions on the preferred response format and may use variables to refer to the {{output}}, {{input}}, {{history}}, {{expected_output}}, and {{metadata.key}}. To ensure the evaluator’s reliability, it should be calibrated against trusted human ratings using a classical supervised learning approach. The evaluator prompt can then be fine-tuned through an iterative process to improve consistency between its ratings and those of the trusted evaluator.

Google Stax is not the only solution available for AI model evaluation. Its competitors include OpenAI Evals, DeepEval, MLFlow LLM Evaluate, and many others, each differing significantly in approach and capabilities.

Currently, Stax supports benchmarking for a growing list of model providers, including OpenAI, Anthropic, Mistral, Grok, DeepSeek, and Google itself. In addition, it can be used with custom model endpoints. It is free to use while in beta, but Google says it may introduce a pricing model after that.

A final note on data privacy: Google states that it will neither own user data, including prompts, custom datasets, or evaluators, nor use it to train its language models. However, users should be aware that when using other providers, those providers’ data policies will also apply.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Bad Bunny confirmed as Apple Music Super Bowl LX Halftime Show performer – 9to5Mac
Next Article I finally deep cleaned my keyboard and it was totally worth it
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Harrods hit by second cyber attack in six months | Computer Weekly
News
My iPhone Air Keeps Dropping Calls, but a Fix Has Finally Arrived
News
Linus Torvalds Removes The Bcachefs Code From The Linux Kernel
Computing
YouTube caves to Trump with $24.5 million settlement
News

You Might also Like

News

Harrods hit by second cyber attack in six months | Computer Weekly

4 Min Read
News

My iPhone Air Keeps Dropping Calls, but a Fix Has Finally Arrived

4 Min Read
News

YouTube caves to Trump with $24.5 million settlement

2 Min Read
News

Dell Discounts Are Already Happening at Amazon’s Prime Big Deal Days

6 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?