Google has introduced LLM-Evalkit, an open-source framework built on Vertex AI SDKs, designed to make prompt engineering for large language models less chaotic and more measurable. The lightweight tool aims to replace scattered documents and guess-based iteration with a unified, data-driven workflow.
As Michael Santoro put it, anyone who has worked with LLMs knows the pain: teams experiment in one console, save prompts elsewhere, and measure results inconsistently. LLM-Evalkit pulls these efforts into a single, coherent environment — a place where prompts can be created, tested, versioned, and compared side by side. By keeping a shared record of changes, teams can finally track what’s improving performance instead of relying on memory or spreadsheets.
The kit’s philosophy is straightforward: stop guessing, start measuring. Instead of asking which prompt “feels” better, users define a specific task, assemble a representative dataset, and evaluate outputs using objective metrics. The framework makes each improvement quantifiable, turning intuition into evidence.
This approach integrates seamlessly with existing Google Cloud workflows. Built on Vertex AI SDKs and connected to Google’s evaluation tools, LLM-Evalkit establishes a structured feedback loop between experimentation and performance tracking. Teams can run tests, compare outputs, and maintain a single source of truth for all prompt iterations — without juggling multiple environments.
At the same time, Google designed the framework to be inclusive. With its no-code interface, LLM-Evalkit makes prompt engineering accessible to a wider range of professionals — from developers and data scientists to product managers and UX writers. By reducing technical barriers, it encourages faster iteration and closer collaboration between technical and non-technical team members, turning prompt design into a truly cross-disciplinary effort.
Santoro shared his enthusiasm on LinkedIn:
Excited to announce a new open-source framework I’ve been working on — LLM-Evalkit! It’s designed to streamline the prompt engineering process for teams working with LLMs on Google Cloud.
The announcement drew attention from practitioners in the field. One user commented on LinkedIn:
This looks very good, Michael. Lack of a centralised system to track prompts over time — especially with model upgrades — is a problem we are facing. Excited to try this.
LLM-Evalkit is available now as an open-source project on GitHub, integrated with Vertex AI and accompanied by tutorials in the Google Cloud Console. New users can take advantage of Google’s $300 trial credit to explore it.
With LLM-Evalkit, Google wants to turn prompt engineering from an improvised craft into a repeatable, transparent process — one that grows smarter with every iteration.