BalanceFormCreative / Shutterstock
Currently, many companies are moving toward putting together teams to evaluate AI. The newly created positions, functions or roles will become an indispensable safety net for organizations that introduce AI tools. As more and more AI pilot projects move into widespread production, the new teams should better evaluate AI results.
Subscribe to our CIO newsletter for more exciting insights, outlooks and background information for the IT community.
The rapid rise of AI agents has led to AI assessment teams beginning to take shape in recent months, reports Yasmeen Ahmad, managing director of product management, data and AI cloud at Google Cloud. Companies that now observe the behavior of AI agents in practice would realize that evaluation is not a one-time step but must be an ongoing practice.
KI-Evaluation – mehr als nice to have
At Google, AI assessment teams are embedded in the agent development groups, where both functions occur simultaneously. “As the agent developers work, the evaluation takes place in parallel, creating a shorter iteration cycle,” says Ahmad.
“Other companies have begun to set up AI evaluation workgroups within their larger AI and IT departments,” adds Maksim Hodar, CIO of software company Innowise.
In some cases, companies would combine data architects, security officers, and compliance officers into the evaluation team rather than hiring new employees from scratch. The groups would adopt a hybrid position between programming and ethical business practices. “It’s safe to say that AI evaluation teams are currently moving from a nice-to-have to a necessity.”
Automate responsibility?
Hodar has also observed that more and more companies are moving away from isolated AI implementation and are placing a greater focus on the “safety net”. Although a number of new tools for observability and governance, for example, focus on preventing AI errors, technology alone is not a complete solution. According to Hodar, people will be needed to decide whether the AI tool is in line with company values and regulations such as the GDPR. “Technology provides information, but the evaluation team still ultimately gives the green light because accountability cannot be automated.”
Human assessment teams need data from observability tools, but the technology itself cannot provide the necessary context for AI models and agents to correct incorrect results, says Google expert Ahmad. AI agents have become very good at passing output checks in test environments, but evaluation teams are needed to track their results in real-world situations. “Agentic applications may pass the initial unit test for a specific scenario, but agentic systems are non-deterministic decision makers and therefore behave unpredictably,” says Ahmad. You can’t test all the potential behaviors they might exhibit in the real world.
Understand the context of AI errors
While an observability tool provides data on token and tool usage, as well as tool failures and reasoning errors, human “evaluators” are required. They could fix many problems and provide context for common inference errors made by agents.
“When our internal assessment teams spend time with our AI agents, a large part of what they do is explore why the reasoning logic failed in some places,” explains Ahmad. The solution is usually to provide the right context at the right levels in the agent so that it can draw better conclusions.
Testing in a complex environment
“A good evaluation team also addresses several other aspects, including governance, cultural readiness, alignment with company workflows, and measurable business impact of AI tools,” adds Noe Ramos, vice president of AI operations at Agiloft, a contract lifecycle management provider. Technology alone cannot solve all these problems.
“The biggest hurdle isn’t technical – it’s human,” Ramos says. You can buy powerful tools and still have problems if people don’t trust them, don’t understand them, or don’t see how they fit into their work.
Like Hodar and Ahmad, Ramos sees growing demand for AI evaluation teams, although these roles would emerge as a skill set rather than formalized job titles.
Sometimes less is more
“AI evaluation is ultimately not just about security, but about ensuring that AI provides clarity and certainty of action rather than more unrest,” argues Ramos. Your company puts it internally like this: “We use AI to promote clarity and action – not to overwhelm teams with more dashboards.” Her team includes a head of AI operations, an AI agent engineer, and a head of GPT and AI systems. The aim is to integrate the evaluation into Agiloft’s AI operating model.
As the maturity level of organizations increases with increasing AI use, the leap towards a disciplined use of tools requires a structured evaluation function. “In my experience, one of the biggest risks is that AI initiatives are driven by the loudest voices rather than real operational priorities,” says Ramos. Rather, AI development should focus on amplifying the most grounded ideas to maximize the impact of AI in the enterprise.
According to Ramos, in most organizations the evaluation role or function must sit at the interface between IT, security, data management and operational stakeholders. Those responsible for AI evaluation also needed a deep understanding of how the company works. “One of the reasons why AI assessment fails is that companies don’t really understand their own workflows,” said Ramos. AI can only be evaluated intelligently if workflows are mapped, bottlenecks are identified and priorities are coordinated. (ajf/jd)
