Beyond Accidental Quality: Finding Hidden Bugs With Generative Testing

Key Takeaways

Traditional tests rely on enumerated examples (test-cases) written by the programmer – test cases miss out a whole class of bugs, ones that its authors are unaware of – the unknown unknowns.

Reliance on pre-determined examples leads to accidental quality – while example-based tests focus on reproducibility of known bugs, generative tests focus on discoverability of unknown bugs.

Generative tests discover bugs by programmatically generating inputs instead of relying on the programmer to list examples, allowing for larger coverage and better detection of unforeseen bugs.

Generative tests shrink failing inputs to a minimal shape, acting as a very crucial feedback to help the programmer in root cause analysis and the eventual bug-fix.

Generative testing provides a better mental model for how we assess the quality of our software. It encourages us to think more deeply about the fundamental properties of our system, instead of handpicking examples.

Automated tests are the cornerstone of modern software development. They ensure that every time we build new functionalities, we do not break existing features our users rely on.

Traditionally, we tackle this with example-based tests. We list specific scenarios (or test cases) that verify the expected behaviour. In a banking application, we might write a test to assert that transferring $100 to a friend’s bank account changes their balance from $180 to $280.

However, example-based tests have a critical flaw. The quality of our software depends on the examples in our test suites. This leaves out a class of scenarios that the authors of the test did not envision – the “unknown unknowns”.

Generative testing is a more robust method of testing software. It shifts our focus from enumerating examples to verifying the fundamental invariant properties of our system.

Invariants: The Unchanging Properties of Systems

While traditional example-based tests rely on the exhaustiveness of the test cases, generative (also called property-based) tests start with defining the most important properties of a system that must always hold true. These properties are also called invariants.

Every system has invariants. Following are some invariants in different systems:

In an API endpoint, we never want to send stack traces in the response

In a banking application, an account transfer must not change the total amount in the bank

In a meetings application, one person cannot be in two meetings at the same time

In a sorting algorithm every element in the sorted arrays should be less than or equal to the next

Once these properties are defined, generative tests try to break the property with randomized inputs. The goal is to ensure that invariants of the system are not violated for a wide variety of inputs. Essentially, it is a three step process:

Given a property (aka invariant)

Generate varying inputs

To find the smallest input for which the property does not hold

As opposed to traditional test cases, inputs that trigger a bug are not written in the test – they are found by the test engine. That is crucial because finding counter examples to code written by us is not easy or an accurate process. Some bugs simply hide in plain sight – even in basic arithmetic operations like addition.

Bugs Hide In Plain Sight – Even In Basic Arithmetic

Suppose we have a method that we use for adding amounts of money. The implementation is quite trivial:

public static double add(double x, double y) {
   return x + y;
}

Normally, a test would involve a list of cases like the following:

@Test
void exampleBasedAdditionTest() {
   assertEquals(1.5, add(1.5, 0));
   assertEquals(2.1, add(1.1, 1));
   assertEquals(5.2, add(2.1, 3.1));
   assertEquals(0, add(-1, 1));
   assertEquals(-101.8, add(-100, -1.8));
   assertEquals(-101, add(-1, -100));
   assertEquals(230.5, add(130.5, 100));
   assertEquals(231, add(100.6, 130.4));
}

All of these cases pass and there is no evidence of a bug. Crucially though, such absence of evidence is not evidence of absence of bugs. What we verified here is that addition works correctly only for the parameters we passed to it – i.e., we just showed that the add method works for the eight examples we handpicked for the test.

Compare this to a generative test in which we start by defining the fundamental properties of addition:

Adding zero is a no-op (identity): a + 0 = a

Inverse of addition gives back the identity: a + (-a) = 0

Addition is commutative: a + b = b + a

Addition is associative: a + (b + c) = (a + b) + c

Using the library Jqwik, we can write a generative test for addition:

@Property
void propertyBasedAdditionTest(
   @ForAll
   double a,


   @ForAll
   double b
) {
   assertEquals(a, add(a, 0), "Additive Identity");
   assertEquals(0, add(a, -a), "Additive Inverse");
   assertEquals(add(a, b), add(b, a), "Commutativity");
   assertEquals(add(1, add(a, b)), add(add(1, a), b), "Associativity");
}

In contrast to the first test, we did not enumerate example inputs – the system did it for us through the “@ForAll” annotation – which literally translates to “for all values of a double a”. Using different combinations of inputs, the property-based test scans the problem space and finds examples for which the defining properties of addition do not hold.

In fact, this test fails. After 151 inputs, the generative test finds examples where the property of associativity of addition did not hold:

                              |-------------------jqwik-------------------
tries = 151                   | # of calls to property
checks = 151                  | # of not rejected calls
generation = RANDOMIZED       | parameters are randomly generated
after-failure = RANDOM_SEED   | use a new random seed
when-fixed-seed = ALLOW       | fixing the random seed is allowed
edge-cases#mode = MIXIN       | edge cases are mixed in
edge-cases#total = 49         | # of all combined edge cases
edge-cases#tried = 8          | # of edge cases tried in current run
seed = 5966123421694918588    | random seed to reproduce generated values

Shrunk Sample (40 steps)
------------------------
  a: 0.02
  b: 0.11

Original Sample
---------------
  a: -1.2356852729401996E16
  b: 149.5

  Original Error
  --------------
  org.opentest4j.AssertionFailedError:
    expected: <-1.2356852729401844E16> but was: <-1.2356852729401846E16>

The root cause of this bug is in how floating point numbers are represented in binary. Some numbers like 0.11 are not exactly representable in binary form and thus are rounded off. The property of associativity fails depending on where the rounding happens, i.e., where the parenthesis lies:

1 + (0.02 + 0.11) = 1.13

(1 + 0.02) + 0.11 = 1.1300000000000001

This is a good reason to not use floating point representation for monetary data. It is unlikely that a test case would include this example in a traditional test suite, unless the author was already aware of the issues with double arithmetic. But if they were aware of the issue, they probably would have already addressed it.

Handpicked Test Cases Lead To Accidental Quality

Generative tests found the above bug without prior knowledge of its existence. While example-based tests focus on asserting certain test cases, generative tests explore the problem space uniformly, uncovering bugs that are non-obvious. It is not hard to imagine that when such bugs, however unintuitive, are shipped to users, it hurts the reputation of our product.

With example-based tests, we verify that our code is correct for a specific predefined set of input conditions. This is what we call accidental quality – the quality of our software is predicated on the selection of its test cases. As authors of the source code or as dedicated test engineers, we only write tests for the scenarios that we can think of. The reliance on a person’s knowledge highlights the fallacy of example-based testing – it does not explore the space of undiscovered bugs, or the unknown unknowns.

Example-based tests rely on handpicked test cases, giving rise to accidental quality. Undiscovered bugs are shipped to production because the author of the test cases did not know of it.

Searching and Not Just Testing for Bugs

Instead of testing the source code against a list of predefined test cases, generative tests search a problem space methodically. It starts an exploration by generating numerous random inputs that become more and more varied over time.

Unlike traditional tests, which use a fixed list of examples, generative tests explore the problem space by creating and evolving a wide variety of random inputs – essentially searching for hidden bugs.

Once it finds an input for which a property does not hold, generative testing shrinks the random inputs to provide the smallest input for which also the property does not hold. This is a key feedback to the programmer to identify the root cause.

Consider the original and shrunk samples for which the property of associativity of addition did not hold – the original sample was much harder to understand than the shrunk sample:

Original sample: -1.2356852729401996E16, 149.5

Shrunk sample: 0.02, 0.11

Generative Testing In The Real World: A Meetings API

So far we have looked at the concept of generative testing. Let us now apply generative testing to a real application that is designed to schedule meetings. It is a typical microservice with three API endpoints that:

Create meetings

Invite others to meetings

Accept or reject meeting invitations

The application aims to ensure that a person is not asked to be in two meetings at the same time. In the image below, if there is an existing meeting in yellow, then none of the red meetings can be scheduled. The source code for the application and its tests are available here.

A person cannot be in two meetings at the same time. Our example application allows a new meeting to be created only if it does not overlap any existing meeting. All of the meetings in red are disallowed because they overlap with the existing meeting.

Fixing Bugs Is Easier Than Finding Bugs

Even with a simple system like this, the user can get unexpected errors like a NullPointerException. As shown in the image below, a sample response containing an error provides no explanation as to why this error occurs. Therefore the user has no way to fix the problem.

Even simple applications have multiple fault lines – in this case, the user forgot to pass a duration parameter but the response was a 5xx error that did not explain what was missing. Instead, the “internal server error” message seems to suggest that the application had an internal issue.

While removing a NullPointerException from an API response is easily fixable, finding every problem such as this is non-trivial. In OSDI 2014, in the paper titled “Simple Testing Can Prevent Most Critical Failures”, the authors listed two important findings:

A majority of production failures could be reproduced by tests

Most failures require a sequence of user actions to manifest itself

The authors studied industry-standard systems like HBase, Cassandra and Redis and identified that such preventable bugs actually do ship to production. The findings raise two important questions:

We know tests prevent bugs, but how do we know if we have covered all cases?

If multiple user actions are required, how many possible combinations do we have to test to ensure correctness?

For both of these questions, the limitation stems from the programmer’s inability to enumerate all possible test cases – it becomes a combinatorial search problem. Generative tests solves these two problems in the following ways

Identifying properties or invariants of systems that must not be violated (for example, we should never return a stack trace to the user)

Generating arbitrarily complex combinations of inputs to the system to search for inputs that violate these properties

In the following sections, we look at how generative tests can help build more robust microservices. Like a typical microservice, in the Quick Meetings sample application there are three layers: presentation, business logic and the database layer. We now apply generative testing to each of these layers.

Searching for Bugs in the Presentation Layer

In the presentation layer we want to ensure that the application never returns what the user does not expect or understand.

This branch of the repository introduces a generative test for the following properties:

The server never returns a 5xx error

The response is always valid JSON

We would like these two properties to hold for all five endpoints in our application:

There are five simple endpoints in the demo application that serves creation of users, meetings, invites and accepting or rejecting received invitations.

The test below does this by generating different combinations of HTTP methods, URL paths, headers and bodies and then validating the two above properties:

@Property
void responsesAreAlwaysValidJson(
   @ForAll("methods") 
   String method,
   
   @ForAll("paths") 
   String path,
   
   @ForAll("contentTypes") 
   String contentTypeHeader,
   
   @ForAll("contentTypes") 
   String acceptHeader,
   
   @ForAll("bodies") 
   String body
) {


   var response = httpRequest(method, path, acceptHeader, contentTypeHeader, body);
   var responseBody = response.getBody();
   int status = response.getStatusCode().value();


   var failureMsg = "Status: %s, Body: %s".formatted(status, responseBody);
   assertThat(responseBody).withFailMessage(failureMsg).isNotBlank();
   assertThatValidJson(responseBody, failureMsg);
   assertThat(status).withFailMessage(failureMsg).isLessThan(500);
}

Bug 1: Invalid JSON Response

The above test fails with a JSONParseException because the server is returning an HTML response:

Original Error
  --------------
  com.fasterxml.jackson.core.JsonParseException:
    Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
     at [Source: (String)"<html><body><h1>Whitelabel Error Page</h1><p>This application has no explicit mapping for /error, so you are seeing this as a fallback.</p><div id='created'>Sat May 24 14:11:11 CEST 2025</div><div>There was an unexpected error (type=Method Not Allowed, status=405).</div></body></html> ; line: 1, column: 1]

Following is the shrunk counter-example that generative test finds for which the server returns an HTML:

Shrunk Sample (1 steps)
-----------------------
  method: "GET"
  path: "/user"
  contentTypeHeader: "text/html"
  acceptHeader: "text/html"
  body:
    "{
      "userId": 1,
      "name": "A",
      "duration": {
        "from": {
          "date": "2025-06-09",
          "time": "12:40"
        },
        "to": {
          "date": "2025-06-09",
          "time": "12:40"
        }
      },
      "timezone": "Asia/Kolkata"
    }
    "

The root cause for this failure is easy to miss. The request contains a “text/html” as the accept-header. In Spring, if we write an endpoint in a controller, it will always return the data that our code produces. However, when a controller is not defined for a specific path/method, Spring will generate an error response in accordance with an accept-header, which in this case is HTML, violating our property that an API endpoint always returns a valid JSON.

Although in this post we focus more on the detection of bugs, not their resolution, this particular bug can be fixed by adding a servlet configuration to override the request’s accept-header.

Bug 2: 5xx Responses For Unhandled Exceptions

An API should never knowingly return a 5xx error status in its response – especially not if the error is fixable by the user. However, the same generative test finds a different example that violates this expectation:

    Status: 500, Body: 
{"timestamp":"2025-08-15T05:00:51.642+00:00","status":500,"error":"Internal Server Error","path":"/meeting"}

As before, the test engine finds a shrunk sample for which this failure happens:

Shrunk Sample (6 steps)
-----------------------
  method: "POST"
  path: "/meeting"
  contentTypeHeader: "application/json"
  acceptHeader: "text/html"
  body: "{   "meetingId": 1,   "invitees": [] } "

The server logs include the following stack trace, indicating that a parameter called “duration”was expected in the input but was not passed. Instead of returning a 400 error with a message explaining this, the server returns a 500 with a generic payload – wrongly indicating that the problem was on the server-side.

Cannot invoke "me.mourjo.quickmeetings.web.dto.MeetingDuration.from()" because the return value of "me.mourjo.quickmeetings.web.dto.MeetingCreationRequest.duration()" is null

To fix this, we need to add global exception handlers that construct proper error messages.

Bug 3: Meetings Starting in Daylight Savings Gaps

The last test for the presentation layer is to make sure the API endpoints accept valid dates as parameters. The following test ensures that for valid parameters, the response status is 2xx.

@Property
void validMeetingRangeShouldReturn2xx(
   @ForAll("meetingArgs") 
   MeetingArgs meetingArgs
) {
   createMeetingAndExpectSuccess(
       meetingArgs.fromDate(),
       meetingArgs.fromTime(),
       meetingArgs.toDate(),
       meetingArgs.toTime(),
       meetingArgs.timezone()
   );
}

However this test also fails. The failure is triggered during daylight savings transitions – when a meeting starts during a daylight savings “gap”.

In some timezones, there are “gaps” due to daylight savings. For instance, on 30 March 2025, at 2 AM, clocks in the Netherlands were turned forward 1 hour to 3 AM. The hour between 2 AM and 3 AM does not exist and is called “a gap”.

In the example that the generative test finds below, the meeting starts at 2:30 AM and ends at 3:00 AM on 30th March 2025. However, since 2:30 AM falls in this “gap”, the default behaviour in Java date-time calculation is to move the clock forward, as stated in the documentation: “the resulting zoned date-time will have a local date-time shifted forwards by the length of the Gap”.

Although the meeting creation parameters are valid, the requested start-time of 2:30 AM falls in the gap and is considered by the Java standard library to be 3:30 AM, which is after the meeting’s end-time, triggering the bug:

Shrunk Sample (6 steps)
-----------------------
  meetingArgs:
    MeetingArgs[fromDate=2025-03-30, fromTime=02:30:00, toDate=2025-03-30, toTime=03:00:00, timezone=Europe/Amsterdam]

These kinds of peculiar cases are nearly impossible to manually enumerate in tests. This particular case would only happen when a meeting starts in a daylight savings gap but ends after the gap and if the re-calculated start-time is after the end-time.

Interestingly, this is not technically a bug because there is no obvious technical fix to daylight savings gaps. We have to decide how our application handles the situation. An option could be to disallow creation of meetings in these gaps. Alternatively, we could show a warning with the effective start-time of the meeting taking into account the clock shift.

By highlighting such edge-cases, generative tests pave the way to better software. Finding out that such a case can happen and that we need to design an experience for it is the first step to building more reliable and predictable software. Generative tests aid the discovery of such cases that otherwise could easily go unnoticed.

Searching for Bugs in the Database Layer

In the database interaction layer, we want to make sure that the queries we write are going to work for all kinds of possible data combinations in the database. As a meetings application, the system must not allow a person to be in two meetings at the same time. Because we are storing meetings in a database, we need to write a SQL query for this validation.

Bug 4: SQL Query Miscalculating Meeting Overlaps

The following query checks if a new meeting being created (:from and :to) overlaps with an existing meeting. But it has a bug – with no prior knowledge about the bug, it is quite hard to catch it at a glance, even more so to write a test for it.

SELECT ...
FROM ...
WHERE (
  (:from >= existing_meeting.from_ts AND :from <= existing_meeting.to_ts)
  OR
  (:to >= existing_meeting.from_ts AND :to <= existing_meeting.to_ts)
)

The logical reasoning behind this query is that if a new meeting starts or ends in between an existing meeting, then the new meeting is an overlapping meeting. However, it does not cover all possibilities. The bug in the above SQL query only happens when the new meeting starts before an existing meeting and ends after the existing meeting – i.e., a complete overlap:

There are four cases for a new meeting to overlap with an existing meeting. The buggy SQL query missed the fourth case.

With the above diagram, it is easy to notice that a fourth case exists. But, when the query was originally written, this clarity may not have been there in the author’s mind, introducing the bug in the first place. The query works correctly for the three out of the four scenarios above. So while it is not an entirely wrong query, it is only partially correct, demonstrating the difficulty of solving combinatorial problems – the more parameters in a query, the harder it is to ensure all cases are covered.

The following generative test found the bug in the query. For all meeting start timings and durations, it intends to verify programmatically that the result from the database correctly detects overlaps.

@Property(afterFailure = AfterFailureMode.RANDOM_SEED)
void overlappingMeetingsCannotBeCreated(


   @ForAll
   @DateTimeRange(min = "2025-02-12T10:00:00", max = "2025-02-12T11:59:59")
   LocalDateTime meeting1Start,


   @ForAll
   @IntRange(min = 1, max = 60)
   int meeting1DurationMins,


   @ForAll
   @DateTimeRange(min = "2025-02-12T10:00:00", max = "2025-02-12T11:59:59")
   LocalDateTime meeting2Start,


   @ForAll
   @IntRange(min = 1, max = 60)
   int meeting2DurationMins


) {
   var debbie = userService.createUser("debbie");
   var meeting1End = meeting1Start.plusMinutes(meeting1DurationMins);
   var meeting2End = meeting2Start.plusMinutes(meeting2DurationMins);


   // Create the first meeting
   createMeeting("Debbie's meeting", meeting1Start, debbie, meeting1End);


   // Ask the repository if the second meeting has any overlaps
   var overlappingMeetingsDb = findOverlaps(meeting2Start, debbie, meeting2End);


   // Verify programmatically if there is an overlap - check the query result matches
   if (doIntervalsOverlap(meeting1Start, meeting1End, meeting2Start, meeting2End)) {
       assertThat(overlappingMeetingsDb.size()).isEqualTo(1);
   } else {
       assertThat(overlappingMeetingsDb).isEmpty();
   }
}

The test relies on the doIntervalsOverlap method to check if the query results are correct. It takes two pairs of meeting start and end times to programmatically check if either meeting starts while the other is ongoing:

Meeting-1 starts while meeting-2 is ongoing

Meeting-2 starts while meeting-1 is ongoing

Writing this logic in a SQL query accounting for all scenarios of start and end times for meeting-1 and meeting-2 is harder than imperatively verifying if a given pair of meetings overlap.

The test fails and provides us with the following shrunk sample. Meeting-1 is an existing meeting and meeting-2 is allowed to be created although it completely overlaps meeting-1:

Shrunk Sample (9 steps)
-----------------------
  meeting1Start: 2025-02-12T10:00:01
  meeting1DurationMins: 1
  meeting2Start: 2025-02-12T10:00
  meeting2DurationMins: 2

The fix is to change the last AND clause in the SQL query for detecting overlaps (:from and :to are the starting and ending times of the new meeting about to be created):

Original Clause Updated Clause

(
  :from >= existing_meeting.from_ts 
    AND 
  :from <= existing_meeting.to_ts
) 
OR
(
  :to >= existing_meeting.from_ts 
    AND
  :to <= existing_meeting.to_ts
)

existing_meeting.from_ts <= :to
  AND
existing_meeting.to_ts >= :from

Arguably, the original clause, although incorrect, is more intuitive than the correct fixed clause. The original clause intended to detect overlapping meetings by checking if a meeting starts or ends during an existing meeting. At first sight this can seem valid and thus may have even been missed in code reviews. Our human tendency is often to prefer an understandable solution over a more correct one. Since generative tests do not require manual listing of test cases, generative tests bypass the programmer’s cognitive biases.

The Combinatorial Problem of Multiple Users taking Multiple Actions

Some of the most difficult bugs to catch through example-based tests are those that have a large number of possibilities – like scenarios involving multiple users and actions.

Bug 5: Accepting a Meeting Creates an Overlap

We first model users interacting with the system like they would in the real world. Our application has four actions: create, invite, accept and reject. Many users can exist in the system, using any of these operations.

With the help of Jqwik’s action chain, we start with an initial empty database and ask the system to combine different actions on behalf of one of its users. The detail of this modelling is available in the project repository.

@Provide
Arbitrary<ActionChain<MeetingState>> meetingActions() {
   return ActionChain.startWith(this::init)
       .withAction(new CreateAction(users))
       .withAction(new InviteAction(users))
       .withAction(new AcceptInviteAction(users))
       .withAction(new RejectInviteAction(users));
}

With the above action chain, Jqwik generates different permutations of users and actions. We then verify that no sequence comprising different users and actions violates the system’s invariant that there can never be any overlapping meetings:

@Property
void noOperationCausesAnOverlap(
@ForAll("meetingActions") 
ActionChain<MeetingState> chain
) {
   chain
       .withInvariant(MeetingState::assertNoUserHasOverlappingMeetings)
       .run();
}

The test fails – it finds a sequence of user actions that causes the overlapping invariant to fail:

Shrunk Sample (26 steps)
------------------------
  chain: ActionChain[NOT_RUN]: 4 max actions

  Invariant failed after the following actions: [
    Inputs{action=CREATE, user=alice, from=2025-06-09T10:21Z, to=2025-06-09T10:22Z}
    Inputs{action=INVITE, user=bob, meetingIdx=0}
    Inputs{action=CREATE, user=bob,   from=2025-06-09T10:21Z, to=2025-06-09T10:22Z}
    Inputs{action=ACCEPT, user=bob, meetingIdx=0}  
]

The above output highlights a minimal set of cross-user actions that triggers the bug:

Alice creates meeting 0

Bob gets invited to Alice’s meeting 0

Bob creates a meeting 1 that overlaps with Alice’s meeting 0 (so far, it is okay since Bob has not confirmed that he will attend Alice’s meeting 0)

Bob accepts the invitation to Alice’s meeting 0 (now this is a problem because the system allowed Bob to be in two meetings at the same time – i.e., Alice’s meeting 0 and Bob’s own meeting 1)

Many Inputs Can Trigger the Same Bug

There are other, more nuanced cases that trigger the same bug. For example, the following actions involve three users, instead of two in the above example. The same no-overlaps invariant is violated for different input conditions and are caught by the single test:

Invariant failed after the following actions: [
    Inputs{action=CREATE, user=alice, from=2025-06-09T10:21Z, to=2025-06-09T10:22Z}
    Inputs{action=CREATE, user=bob,   from=2025-06-09T10:21Z, to=2025-06-09T10:22Z}
    Inputs{action=INVITE, user=charlie, meetingIdx=1}
    Inputs{action=INVITE, user=charlie, meetingIdx=0}
    Inputs{action=ACCEPT, user=charlie, meetingIdx=1}
    Inputs{action=ACCEPT, user=charlie, meetingIdx=0}  
]

The above output highlights another minimal set of actions that triggers the bug:

Alice creates a meeting 0

Bob creates meeting 1 at the same time as Alice’s meeting 0

Charlie gets invited to Bob’s meeting 1

Charlie gets invited to Alice’s meeting 0

Charlie accepts invitation to Bob’s meeting 1

Charlie accepts invitation to Alice’s meeting 1 (this is a problem because now Charlie has confirmed that he will attend two meetings at the same time)

Debugging Faster With Lower Signal-to-Noise Ratio

Note that the bugs only happen for a subset of conditions in a system. The above shrunk samples are easy enough to read and explain in this article because the system went through the process of removing the noise from the randomly generated inputs after a failure was detected.

Before the shrinking happened, the test found its first failure which involved many more actions and is much harder to debug. Note how it involves the REJECT action, which does not appear in either of the shrunk samples we saw before – this is precisely because the REJECT action does not contribute to overlapping meetings.

Invariant failed after the following actions: [
    Inputs{action=CREATE, user=bob, from=2025-06-09T10:35Z, to=2025-06-09T10:56Z}
    Inputs{action=CREATE, user=bob, from=2025-06-09T11:00Z, to=2025-06-09T11:52Z}
    Inputs{action=INVITE, user=alice, meetingIdx=0}
    Inputs{action=REJECT, user=charlie, meetingIdx=1}
    Inputs{action=ACCEPT, user=bob, meetingIdx=1}
    Inputs{action=REJECT, user=charlie, meetingIdx=1}
    Inputs{action=CREATE, user=bob, from=2025-06-09T10:38Z, to=2025-06-09T11:38Z}
    Inputs{action=INVITE, user=charlie, meetingIdx=0}
    Inputs{action=ACCEPT, user=bob, meetingIdx=0}
    Inputs{action=REJECT, user=alice, meetingIdx=1}
    Inputs{action=CREATE, user=charlie, from=2025-06-09T11:10Z, to=2025-06-09T11:15Z}
    Inputs{action=INVITE, user=bob, meetingIdx=1}
    Inputs{action=CREATE, user=alice, from=2025-06-09T10:21Z, to=2025-06-09T11:21Z}
    Inputs{action=ACCEPT, user=bob, meetingIdx=2}
    Inputs{action=REJECT, user=alice, meetingIdx=2}
    Inputs{action=REJECT, user=charlie, meetingIdx=1}
    Inputs{action=ACCEPT, user=alice, meetingIdx=0}  
]

The system was able to detect this noise and was able to remove the REJECT action from the shrunk examples we saw before, saving precious time for the programmer.

Bug 6: Extending Invariants – Rejections Cause Empty Meetings

Since we now have a way to multiplex user actions, we can extend our test suite with new invariants: we should not allow meetings to exist where there is no attendee. Note how we only define a new invariant this time, reusing the existing action chain:

@Property
void noOperationCausesEmptyMeetings(
@ForAll("meetingActions") ActionChain<MeetingState> chain
) {
   chain
       .withInvariant(MeetingState::assertEveryMeetingHasOneConfirmedAttendee)
       .run();
}

This test fails with the following shrunk example highlighting the simplest case that can lead to an empty meeting:

Invariant failed after the following actions: [
    Inputs{action=CREATE, user=alice, from=2025-06-09T10:21Z, to=2025-06-09T10:22Z}
    Inputs{action=REJECT, user=alice, meetingIdx=0}  
]

Invariant-based testing allows us to question the fundamental nature of the systems we build, and not get lost in the labyrinth of test cases and their variations. We initially set out to ensure overlapping meetings could not exist. Having achieved that, we were able to make the application even more robust by extending its invariants quite easily, relying on generative tests.

Tradeoffs With Example-Based Tests

While generative testing helps discover unknown bugs, like anything in software engineering, there are some caveats to be aware of:

Cost of long runtimes: Generative tests take more time to run than example-based tests because of its exhaustive search. In a CI build, this cost of run time will slow down the build process if these tests are run on every commit.

Non-reproducibility: Generated inputs based on a random seed leads to non-deterministic failures. Sometimes a very rare failing condition will be detected by the test suite but not be reproducible in a different test run. Though libraries like Jqwik store the seed used in a particular test run to reproduce it in another run, but because its detection is non-deterministic, this may require repeated investigation effort.

Learning curve: Properties can be as narrow as “usernames should be unique” to “amount of money in the bank does not change even after account-to-account transfers”. The former is a property that can be ensured by a database and reaps little benefit as a test, while the latter is a good candidate for generative tests. Defining properties effectively requires time and often has a learning curve. Moreover, advanced features like modelling multiple users interacting with the system also involves a steep learning curve and maintenance effort.

WYSIATI: Unknown Unknowns Cannot Be Tested With Examples

Despite its caveats, the greatest benefit of having generative tests is to counter human biases. When we rely solely on example-based tests, we fall victim to a cognitive bias known as “what you see is all there is” (WYSIATI). Popularized by Daniel Kahneman, it highlights the human tendency to make judgements based only on the information immediately available to us, overlooking and often ignoring the possibility of unknown unknowns.

The primal weakness of example-based tests is prior knowledge or intuition of bug existence. A programmer from India (a non-daylight savings country) might never think of writing a test case considering daylight savings gaps, resulting in a system vulnerable to bugs that were not intuitive to its authors. The requirement that bugs need to be anticipated before they can be tested is paradoxical – if we knew a priori of a bug’s existence, we would not introduce the bug in the first place.

Generative tests counter the flaws of human knowledge by not requiring an enumeration of test cases as its starting point. It starts with a specification of the system, i.e., the system properties or its invariants. Since test cases are generated by the system, it bypasses the mental biases of human-written tests. The true power of generative tests lies in aiding the programmer to find the unknown unknowns of the system and thus ultimately preventing accidental quality of systems we build.

Beyond Accidental Quality: Finding Hidden Bugs with Generative Testing

Key Takeaways

Invariants: The Unchanging Properties of Systems

Bugs Hide In Plain Sight – Even In Basic Arithmetic

Handpicked Test Cases Lead To Accidental Quality

Searching and Not Just Testing for Bugs

Generative Testing In The Real World: A Meetings API

Fixing Bugs Is Easier Than Finding Bugs

Searching for Bugs in the Presentation Layer

Bug 1: Invalid JSON Response

Bug 2: 5xx Responses For Unhandled Exceptions

Bug 3: Meetings Starting in Daylight Savings Gaps

Searching for Bugs in the Database Layer

Bug 4: SQL Query Miscalculating Meeting Overlaps

The Combinatorial Problem of Multiple Users taking Multiple Actions

Bug 5: Accepting a Meeting Creates an Overlap

Many Inputs Can Trigger the Same Bug

Debugging Faster With Lower Signal-to-Noise Ratio

Bug 6: Extending Invariants – Rejections Cause Empty Meetings

Tradeoffs With Example-Based Tests

WYSIATI: Unknown Unknowns Cannot Be Tested With Examples

Leave a Reply

Key Takeaways

Invariants: The Unchanging Properties of Systems

Bugs Hide In Plain Sight – Even In Basic Arithmetic

Handpicked Test Cases Lead To Accidental Quality

Searching and Not Just Testing for Bugs

Generative Testing In The Real World: A Meetings API

Fixing Bugs Is Easier Than Finding Bugs

Searching for Bugs in the Presentation Layer

Bug 1: Invalid JSON Response

Bug 2: 5xx Responses For Unhandled Exceptions

Bug 3: Meetings Starting in Daylight Savings Gaps

Searching for Bugs in the Database Layer

Bug 4: SQL Query Miscalculating Meeting Overlaps

The Combinatorial Problem of Multiple Users taking Multiple Actions

Bug 5: Accepting a Meeting Creates an Overlap

Many Inputs Can Trigger the Same Bug

Debugging Faster With Lower Signal-to-Noise Ratio

Bug 6: Extending Invariants – Rejections Cause Empty Meetings

Tradeoffs With Example-Based Tests

WYSIATI: Unknown Unknowns Cannot Be Tested With Examples

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply