Everybody is building an AI agent. They work beautifully because they are powered by models like GPT, Claude Opus, etc. which are trained on massive amounts of data. However, the key problem that is not talked about widely is about testability of these agents. You don’t want your AI agent to hallucinate in production, or do drastic things like dropping production tables or get stuck in infinite retry loops.
The traditional software testing playbook of unit tests, integration tests, end to end tests with known inputs and expected outputs fundamentally breaks down for agentic AI systems. When the system’s behavior is non-deterministic, or when the state space is infinite, or when the correct output is subjective, you would need a different approach.
At DocuSign and across the industry, you are building agents that orchestrate complex workflows like analyzing documents, making decisions, calling APIs, handling errors and adapting to user feedback. These systems fail in ways that predetermined test cases cannot catch.
Why Traditional Testing Fails for Agents
Let’s take a traditional API endpoint as an example: give input X, it returns output Y. You write a test asserting this. The test is deterministic, repeatable and either passes or fails. This approach works because the mapping from inputs to outputs is known and stable.
Now consider an agent that is tasked with “schedule a meeting with engineering team next week”. The agent would:
- Interpret “next week” based on current date
- Identify who are in the “engineering team”
- Check availability on calendars
- Handle conflicts
- Send email invites
- Confirm completion and return success
At every step, your agent makes a decision based on context and available tools. The exact sequence of API calls is not predetermined. It comes from agent’s reasoning. Two runs on the same input might take completely different paths and still be correct.
A test case that asserts “the agent calls get_calendar exactly twice, then calls create_event with these parameters” would not work. Your agent might check calendars once or multiple times if there are conflicts. It might batch send invitations or individually. These are implementation details and not correctness criterion.
Traditional testing assumes deterministic state machines, however, agents are non-deterministic. The gap is inevitable.
So how do you test something you can’t fully specify? The answer is to stop testing execution paths and start testing properties, invariants, and outcomes. This requires thinking in three distinct layers, each with its own testing strategy.
The Three Layers of Agent Testing
Effective agent testing requires thinking in layers.
Layer 1: Tool Correctness
The agent’s tools are essentially the functions it can call and they must work deterministically. If get_calendar(user_id) sometimes returns wrong data, the agent will make wrong decisions irrespective of the reasoning.
Testing of the tool is traditional unit testing:
def test_get_calendar():
calendar = get_calendar("alice@example.com")
assert calendar.owner == "alice@example.com"
assert all(isinstance(e, Event) for e in calendar.events)
assert all(e.start < e.end for e in calendar.events)
This layer is necessary but insufficient. Perfect tools do not guarantee correct agent behavior. They just eliminate one source of failure.
Layer 2: Reasoning Verification
The reasoning of the agent i.e. how it decides which tools to call and in what order, cannot be tested with assertions about the tool sequences. But the invariants and properties can be tested regardless of the path taken.
Invariant testing tests the properties that must always be true.
For the meeting scheduler example above, agent must not:
- double-book
- schedule in the past
- create conflicting meetings
- send multiple invitations per participant
These invariants don’t care about whether the agent calls get_calendar(user_id) serially or in parallel, but do care about things like double-booking someone:
def test_no_double_booking_invariant(agent, test_scenario):
result = agent.schedule_meeting(test_scenario)
for participant in result.participants:
calendar = get_calendar(participant)
events_at_time = [e for e in calendar.events
if overlaps(e, result.scheduled_time)]
assert len(events_at_time) == 1,
f"{participant} double-booked at {result.scheduled_time}"
Property testing generates random valid inputs and verifies:
from hypothesis import given, strategies as st
@given(
participants=st.lists(st.emails(), min_size=2, max_size=10),
duration=st.integers(min_value=15, max_value=120),
timeframe=st.datetimes()
)
def test_meeting_scheduling_properties(agent, participants, duration, timeframe):
result = agent.schedule_meeting(
participants=participants,
duration=duration,
preferred_time=timeframe
)
# Properties that must hold
assert result.duration == duration
assert set(result.participants) == set(participants)
assert result.scheduled_time >= datetime.now()
assert len(result.invitations_sent) == len(participants)
You are testing the outcome and not the path. The agent can reason however it wants, as long as the result satisfies the properties.
Layer 3: Behavior Evaluation
Behavior of agents is subjective and non-deterministic. “Did the agent write a helpful email?” depends on context.
For these, you need evaluation rather than testing since tests assert definite correctness but evaluations score quality.
LLMs-as-judge
def evaluate_email_quality(agent_output, context):
prompt = f"""
Evaluate this email on a scale of 1-10 for:
- Professionalism
- Clarity
- Completeness
- Tone appropriateness
Context: {context}
Email: {agent_output}
Return JSON with scores and brief justifications.
"""
evaluator = LLM(model="gpt-4")
scores = evaluator.generate(prompt)
return scores
This is controversial but is increasingly common. The key thing is that the evaluator model should be powerful as compared to the model being evaluated. You don’t want to evaluate GPT-3.5 to judge GPT-4. This trades subjective human judgement for LLMs but we are gaining automation and consistency.
Human in the loop
For critical workflows, human evaluation is still necessary. So, instead of asking them to judge, you can give them specific criteria:
evaluation_rubric = {
"correctness": "Did the agent accomplish the stated goal?",
"efficiency": "Did it use the minimum necessary API calls?",
"error_handling": "Did it gracefully handle failures?",
"user_experience": "Would a user be satisfied with this interaction?"
}
def human_eval(agent_trace, rubric):
scores = {}
for criterion, description in rubric.items():
score = input(f"{description}nScore (1-5): ")
scores[criterion] = int(score)
return scores
For instance, you can use this to track scores over time and judge if the system is regressing or improving.
Agent Testing is Upside Down
The traditional software testing goes like like: many unit tests, fewer integration tests, few end-to-end tests. For agents, it’s the reverse. Most of the testing effort goes to end-to-end scenarios because that’s where emergent failures happen. An agent might call each tool correctly and passes unit tests, but might get stuck in a loop calling the same tool repeatedly and will fail in end-to-end test.
So the testing distribution might look like: 20% tool unit tests, 30% invariant and property tests, and rest of it on end-to-end scenarios. These scenarios should cover happy paths, partial failures, ambiguous inputs and adversarial inputs.
Observability is Testing in Production
It is impossible to test all the possible scenarios before deployment, so production monitoring becomes continuous testing playground.
This can be achieved with:
- Structured logging: Each agent action should be logged with context to reconstruct reasoning.
- Metric tracking: Define metrics that indicate agent health. For example, success rate, tool call efficiency, error rate, retry rate, etc.
- Anomaly detection: Flag sessions with excessive tool calls or repetitive patterns suggesting loops, or high error rates. Anomalous sessions go into a review queue where humans label them as bugs or edge cases or expected behavior.
The Uncomfortable Truth About Coverage
Code coverage is meaningless for agents. The agent has minimal code and most of the behavior comes from LLM’s weights, which you don’t control. Every line of code could be executed and still you can have zero test coverage.
What can still be measured is scenario coverage (distinct scenarios tested), tool coverage (percentage of tools called), and state coverage (system states reached). But still the state space is exponentially large to cover all of them.
Testing agents resembles testing distributed systems more than libraries. You cannot prove correctness, but you can increase confidence through comprehensive scenarios, invariant testing, fault injection, monitoring and rapid iteration on observed failures.
Conclusion: Testing Under Uncertainty
Traditional software testing assumes that you can specify correct behavior in advance. However, agent testing assumes you cannot.
You cannot write enough tests to prove that an agent works. But you can only build confidence through other things like the ones I mentioned above. This simply means shipping agents that might fail in ways you haven’t anticipated and accepting that your test suite will never give you complete coverage. This is uncomfortable but liberating since you are not trying to enumerate all possible inputs and outputs.
The teams that succeed with agents won’t be those with the most test cases, but those that are operating comfortably under uncertainty, learning from failures and iterating quickly based on what they observe in production. The traditional QA testing strategy is fundamentally wrong because applying deterministic thinking to non-deterministic systems is counter-productive.
