By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Google Stax Aims to Make AI Model Evaluation Accessible for Developers
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Google Stax Aims to Make AI Model Evaluation Accessible for Developers
News

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

News Room
Last updated: 2025/09/29 at 6:34 PM
News Room Published 29 September 2025
Share
SHARE

Google Stax is a framework designed to replace subjective evaluations of AI models with an objective, data-driven, and repeatable process for measuring model output quality. Google says this will allow AI developers to tailor the evaluation process to their specific use cases rather than relying on generic benchmarks.

According to Google, evaluation is key to selecting the right model for a given solution by comparing quality, latency, and cost. It is also essential for assessing how effective prompt engineering and fine-tuning efforts actually are in improving results. Another area where repeatable benchmarks are valuable is agent orchestration, where they help ensure that agents and other components work reliably together.

Stax provides data and tools to build benchmarks that combine human judgement and automated evaluators. Developers can import production-ready datasets or create their own, either by uploading existing data or by using LLMs to generate synthetic datasets. Likewise, Stax includes a suite of default evaluators for common metrics such as verbosity and summarization, while allowing the creation of custom evaluators for more specific or fine-grained criteria.

A custom evaluator can be created in a few steps, beginning with selecting the base LLM that will act as a judge. The judge is provided with a prompt instructing how to evaluate the tested model’s output. The prompt must contains definitions of the categories the judge will use for grading, each associated with a numerical score between 0.0 and 1.0. Additionally, it must include instructions on the preferred response format and may use variables to refer to the {{output}}, {{input}}, {{history}}, {{expected_output}}, and {{metadata.key}}. To ensure the evaluator’s reliability, it should be calibrated against trusted human ratings using a classical supervised learning approach. The evaluator prompt can then be fine-tuned through an iterative process to improve consistency between its ratings and those of the trusted evaluator.

Google Stax is not the only solution available for AI model evaluation. Its competitors include OpenAI Evals, DeepEval, MLFlow LLM Evaluate, and many others, each differing significantly in approach and capabilities.

Currently, Stax supports benchmarking for a growing list of model providers, including OpenAI, Anthropic, Mistral, Grok, DeepSeek, and Google itself. In addition, it can be used with custom model endpoints. It is free to use while in beta, but Google says it may introduce a pricing model after that.

A final note on data privacy: Google states that it will neither own user data, including prompts, custom datasets, or evaluators, nor use it to train its language models. However, users should be aware that when using other providers, those providers’ data policies will also apply.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Bad Bunny confirmed as Apple Music Super Bowl LX Halftime Show performer – 9to5Mac
Next Article I finally deep cleaned my keyboard and it was totally worth it
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Plants vs. Zombies Hybrid Edition goes viral in China · TechNode
Computing
Modal Labs raises $80M to simplify cloud AI infrastructure with programmable building blocks – News
News
Amazon Apple deal: Save $24 on Apple AirTag 4-pack
News
DNA Holdings Orchestrates Landmark $344.4M Capital Deal, Bridging Nasdaq-Listed Company With Aethir | HackerNoon
Computing

You Might also Like

News

Modal Labs raises $80M to simplify cloud AI infrastructure with programmable building blocks – News

6 Min Read
News

Amazon Apple deal: Save $24 on Apple AirTag 4-pack

3 Min Read
News

You Can Now Shop On ChatGPT Using ‘Instant Checkout’ – BGR

3 Min Read
News

Crash (exploit) and burn: Securing the offensive cyber supply chain to counter China in cyberspace 

137 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?