To Test Agentic AI, Apply Agents Liberally - News

Agentic artificial intelligence is the new belle of the software ball. C-level executives want their companies to use AI agents to move faster, therefore driving vendors to deliver AI agent-driven software, and every software delivery team is looking for ways to add agentic capabilities and automation to their development platforms.

By parallel coding with co-pilots, some pundits are speculating that developers could increase their code output by 10 times. But how good is that output, and does AI-generated code increase the test coverage requirements beyond the reach of humans?

Despite quality concerns and developer misgivings, there’s simply too much potential value in AI development and testing tools that can do work quickly and semi-autonomously to put the toothpaste back in the tube. We’ll eventually have to test AI agents with AI agents.

It’s no wonder a recent survey found that two-thirds of companies are either already using or planning to use multiple AI agents to test software, and that 72% believe agentic AI could test software autonomously by 2027.

Where do you start with agent-based testing?

Newer companies have the advantage of working with AI from the start, seemingly inheriting less technical debt from hand-rolled applications and tests. While startup teams can move much faster, at the same time, they may not have enough implementation experience to understand where to look for errors.

Bringing AI testing agents into the team can help, but once they are tasked with finding bugs, they may generate far more test feedback than expected. Now developers find themselves trying to separate genuine errors from false positives, which definitely cools the vibe in vibecoding.

“The only purpose of adopting agents is productivity, and the unlock for that is verifiability,” said David Colwell, vice president of artificial intelligence, Tricentis, an agentic AI-driven testing platform. “The best AI agent is not the one that can do the work the fastest. The best AI agent is the one that can prove that the work was done correctly the fastest.”

In a sense, established enterprises with long-running DevOps tool chains do have one advantage over nimbler startups: being able to roll existing requirements, documentation, customer journeys, architectural diagrams, procedures, test plans, test cases and even robotic process automation bots into a corpus of AI contextual knowledge, which can provide foundational skills for informing a swarm of specialized test agents.

“When you prompt AI to write a test, one agent will understand the user’s natural language commands, and another will start to execute against that plan and write actions into the test, while another agent understands what changed in the application and how the test should be healed,” said Andrew Doughty, founder and chief executivce of SpotQA, creator of Virtuoso QA. “And then if there is a failure, an agent can look into the history of that test object, and then triage it automatically and send it over to developers to investigate.”

Wrangling agentic test assets

While the encyclopedic knowledge and uncannily human dialogue of the latest LLMs like ChatGPT and Gemini are impressive, most of their massive data sets are not related to software testing skills at all. Besides, using enough GenAI tokens to automate testing against a high-traffic enterprise application will really hoover up a tools and infrastructure budget. That’s why leaner test agents are such a perfect fit.

“We’ve found that customers don’t need large model-based AIs to do very specific testing tasks. You really want smaller models that have been tuned and trained to do specific tasks, with fine-grained context about the system under test to deliver consistent, meaningful results,” said Matt Young, president, Functionize Inc.

Test management platforms have been around for years, coordinating the use of test automation tool chains and executing test suites according to requirements. Since most AI agents and large language models can be invoked through application programming interface controls (now with an MCP server), it stands to reason they could be orchestrated alongside conventional testing tools.

“Specialized agents for test planning, design, execution, reporting, and maintenance are still assets that need to be governed, especially in highly regulated industries,” said Alex Martins, vice president of strategy at Katalon Inc. “Give an AI agent a high-level requirement without enough details, and the resulting tests won’t be useful. We compare test cases back to requirements, often using another agent to check the work, then see if they arrive at the same conclusion. We then flag the cases that don’t match for humans to look at.”

Overcoming hallucinations with real-world feedback

We’ve all heard about AI chatbots going off the rails and responding to customer requests with completely made-up answers that can be either hilarious or a huge liability for the company that uses them. AI agents are even less mature, like teenagers who know everything, except what they don’t know.

“Your agent needs to capture a feedback loop of real-world data from staging and production, a ‘digital twin’ so the AI isn’t arguing with itself,” said Ken Ahrens, CEO of Speedscale LLC. The firm recently released a free utility called Proxymock, which agents can use as a tool to snapshot realistic environments from deployed software, in order to replay functional and regression tests.

Whether AI agents are used for coding or testing, they aim to please. If coding and integration agents aren’t given enough context to deliver a valid solution, they’ll often invent a plausible-looking piece of code that won’t work in the target environment. If you prompt a testing agent to find bugs without clear requirements, it will spit back some false positives, even when looking at perfectly constructed software.

“AI tests often hallucinate steps, skip critical edge cases, or get stuck in loops,” said Yunhao Jiao, CEO of TestSprite. “In coding agents, we frequently see mismatches between what the requirements specify and what the agent delivers — the ‘looks right, but fails on details’ problem. Some agents will even ‘game’ the system: for example, one developer shared that when they told the AI a feature didn’t work, it simply deleted the feature to satisfy the request.”

Getting past nondeterministic repeatability

A major concern for testing AI-driven software with agents is repeatability. When nondeterministic AI agents interact with different team users as well as underlying technology and peer agents, perceived errors become almost impossible to replicate.

“Repeatability involves creating the same state — and using observability, you need to collect all the data, which will allow you to go back in time to when the error condition occurred, including screen elements, logs, and AI actions,” said Prince Kohli, CEO of Sauce Labs Inc. “You might even ask the agent to ‘Tell me why you arrived at this conclusion.’ While they’ll never be perfect, you can get much closer to the truth.”

The Sauce Labs platform initiates AI test authoring agents at each pull request or production crash to provide release managers, developers and QA engineers with behavior-based test suites that simulate multiple user scenarios across different device endpoints and browsers.

Can AI be the judge of quality?

Testing agents can read code, take actions and make an abstract representation of an application, which never quite matches the human tester’s experience using the app. The difference between the two represents a gap in test coverage, which will still put a human in the testing loop.

“In our end-to-end testing platform, we’re using and consuming an application, and we’re also taking in the specifications and user stories. From that knowledge base it creates tests that can be run by agents” said Fitz Nowlan, vice president of AI and architecture at SmartBear Software. “We still require the human to decide if the representation is accurate or not, and to confirm the AI is on the right track. This is uplifting for both software developers and testers.”

Armed with co-pilots, developers are checking in code at an unprecedented rate. This is where agents can step in to help teams test applications at the same velocity, to ensure each rapid release is still aligned with customer requirements.

“Maybe agentic AI is an opportunity to not just repeat what we did with code generation, but perhaps to finally do test driven development right, like we’ve been talking about for the last 20 years,” said Itamar Friedman, CEO of Qodo. “TDD requires you to be rigorous about requirements, and with AI generated code, sometimes you don’t even know the intent of the code base. Multiple agents can review code and provide context against requirements within the developer’s IDE.”

Testing agentic AI at scale

Whether agents are talking to users or other agents, calling an API or referencing an MCP server, they still rely on TCP/IP. The performance of the internet at large is part of the ground truth of testing agentic performance.

“Some of our customers have AI agents continually running on user’s devices, and we’re testing the performance of that endpoint interface as events occur — for instance, if an open router service or a CDN in a certain region has downtime, that’s an issue,” said Matt Izzo, chief product officer at Catchpoint Systems Inc. “Other customers want to test the consistency and response times for certain prompts from locations around the world.”

The Intellyx take

As the market bubble of infinitely power- and resource-consumptive LLMs reaches its breaking point and pops, we’ll continue to find teams turning toward leaner, more specialized agents to deliver and test application functionality.

Advanced companies should devote time to building a responsible trust framework for testing agents, with employee and agent feedback and quality guardrails for managing the behavior of a fleet of AI assets and agents in their extended environment.

Still, no matter how intricate and airtight the governance of AI usage within development and testing organizations seems to be, our agentic co-workers can’t catch everything. We’ll still need humans to test.

Jason English is principal analyst and chief marketing officer at Intellyx. He wrote this article for News. At the time of writing, SmartBear and Tricentis are former Intellyx customers, and the author is an advisor to Speedscale. No other companies mentioned are Intellyx customers. ©2025 Intellyx B.V.

Image: News/Reve

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About News Media

News Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of News, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — News Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

To test agentic AI, apply agents liberally – News

Where do you start with agent-based testing?

Wrangling agentic test assets

Overcoming hallucinations with real-world feedback

Getting past nondeterministic repeatability

Can AI be the judge of quality?

Testing agentic AI at scale

The Intellyx take

Image: News/Reve

Leave a Reply Cancel reply

Stay Connected

Latest News

Conversational Editing in Google Photos Lets Anyone Edit Like a Pro With Just Their Voice

The Empire Strikes Back: How AMD’s OpenAI Deal Reshapes the AI Wars

Best gifts for photographers for Christmas 2025 | Stuff

The TechBeat: Klink Finance Disrupts Failing Web2 Ads – Launching $KLINK Token This October (10/13/2025) | HackerNoon

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Where do you start with agent-based testing?

Wrangling agentic test assets

Overcoming hallucinations with real-world feedback

Getting past nondeterministic repeatability

Can AI be the judge of quality?

Testing agentic AI at scale

The Intellyx take

Image: News/Reve

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News