Scaling API Independence: Mocking, Contract Testing & Observability In Large Microservices Environments

Transcript

Tom Akehurst: This talk is about the promise of microservices versus the reality. The reason we choose to build systems in this way is fundamentally about being able to work in a decoupled and independent way for teams to be able to build and ship things and add value to their business in a way that’s not highly dependent on other people in their organization or outside of it. I suspect that those of us that have done this for a while realize that the reality of this doesn’t always measure up. That we often find ourselves in situations where we’re in a heavily interconnected environment that dependency still exists, coupling still exists.

The result of this in terms of the engineering experience and the impact on engineering productivity is that we’re not doing a lot of the things we want to be doing. We’re fighting broken environments. We’re fighting with data not being correct in the environments we’re working in. We’re waiting for people to ship new API features that we need before we can move forward. There’s a plethora of problems that leads to things moving slowly and everyone feeling frustrated and not enjoying themselves very much. What I’m going to try and convince you of in this talk is that using API mocking or API simulation can help solve these problems, even when used in the large scale microservices systems that we build nowadays, but requires some additional supporting techniques and thinking in order to make it work well.

I’ve spent lots of years building software. It’s often been in enterprise environments where there are lots of integration problems to solve. I found myself wrangling with APIs and fighting with them, in some cases, a lot of the time. For a number of years, I’ve been working on this open-source project called WireMock, which is an API mocking tool, as you might expect. I’m also the co-founder of a company building a cloud product on top of WireMock. I’m not going to talk about that. Suffice to say that the things I’m talking about in this are things that we’re actively researching and building and that we’re interested in. I’m going to mainly talk about the tools and the techniques in the abstract. When I do refer to tools, including WireMock, they’re all going to be open-source, things you can go and play around with easily yourself.

Decoupling Strategies

I was tempted to call this slide coping strategies rather than decoupling strategies. The problem we have when building these systems fundamentally stem from being coupled to other bits of the system via APIs usually. Since these are our sources of pain, there are a number of strategies that I’ve seen organizations use in order to try and mitigate this in various ways. It’s not always a true decoupling strategy, but I’m going to go through a few of these first before getting into the mocking part of the conversation. One is what I refer to as process and gatekeeping. I think this is definitely a coping strategy rather than a genuine decoupling strategy. It’s really where maybe you’ve taken your old-world environment management approach and grafted into a microservices environment.

This is where you have a small number of fixed environments that are shared by a lot of people. They’re very heavily contended, and everyone’s breaking them all the time, doing things to them. You do this with them. You put a load of process and human beings in front of them and you control very strictly what can be deployed, what data can be put in there, who can use them at a given moment and all that kind of thing. WireMock was born into this environment. I was in a team where we were in a very big kind of digital transformation and it was kind of microservices, but it was before either of those two terms existed, but it was basically that. Our product needed to talk to lots of different APIs that were being built by different bits of the organization and other vendors and all that kind of thing. It was a mess.

Nobody really knew how to do this stuff, particularly back then. We were constantly fighting with all the things that I talked about, so environments not working, APIs being shipped that didn’t conform to spec, waiting weeks to get the right data in there, all that kind of stuff. The powers that be on this project implemented this very strict regime, partly with automation and partly with human gatekeepers to try and mitigate the bad quality. This slowed everything down to a crawl. It was incredibly frustrating. The irony was that it didn’t actually produce the outcome desired. It persuaded everybody to game the system and to try and get things through this labyrinthine CD pipeline and into an environment, rather than really focusing on fundamental quality. It didn’t even work.

A slight evolution on this approach that I’ve seen a lot of organizations take is the, let’s build lots of environments. If we give every team their own environment, which is a full replica of production, then they can do what they like with it. They can put their own data in it. They can deploy code to it. They can switch it off and on again, whatever. It doesn’t affect everybody else. Then the gatekeeping goes away. Then you have these problems of quite obviously operational cost. I’ve spoken to a number of organizations who are paying more in the aggregate for their dev and test environments than they are to run their production environment. They have literally dozens of these things throughout the organization. It’s operationally complex. You’ve still got to have a platform that can support these things.

Sysadmins and platform engineers end up having a lot more work to do in these kinds of environments. There’s a cognitive overload element to it as well. If you’re running an environment and you’ve deployed everybody else’s services into it, just so you can get your work done, you have to know a lot about how those services work. You have to know about their implementation details as well as their interfaces, if you want to get data into them, if you want to fix them when they fall over, deploy new versions, all of this stuff. It takes you a long way out of your own work when you’re figuring out how to do this stuff. Again, not ideal.

Then there’s a further evolution on this approach of using fully integrated environments, which I’m calling it smart environments. This isn’t the term of art, I don’t think, but I’m using it as a bit of a catch-all for a number of things. It’s a newer class of products that help with this, where you can build and destroy ephemeral environments very quickly from configuration. There’s also this remocal strategy. Daniel is really the expert on this, if you want to find out more. This is about composing environments from bits of stuff running locally, close to you, and some of it shared, and composing the two together so that you don’t create full-scale replicas for everything. This is for hopefully fairly obvious reasons, an improvement on the produce a full environment for everybody strategy. It’s limited in its scope of applicability, you could say.

I think if you’re a tech unicorn or if you’re a scaleup and you have little to no legacy, then this can be very effective. Chances are most of us doing microservices are doing so in an environment that’s in the middle of a digital transformation that probably has several layers of legacy tech where the microservices bit is only one part of the puzzle. I think even in modern environments, you increasingly have lots of third-party integrations. There are certain types of industry vertical where there’s really heavy third-party integration as well, which isn’t going away. This strategy doesn’t really help you in that kind of situation. Good for certain types of problems, but not so much for others.

Then that brings us to this thing, so mocking, or simulation, or virtualization. To be clear here, I’m not talking about the in-code object mocking, the Mockito type thing. I’m talking about simulating an API on the network, so that your app that you’re testing is still talking over the network to it, but it’s talking to a simulation rather than the real thing. How many of you are using or have used this type of mocking that I’ve just described? How many of you have used it where you deploy this stuff into your environment and used it that way, so used it beyond just the narrow integration testing or unit-like testing thing? Doing this can help with this problem a lot. By mocking, you can build self-contained environments of the thing that you’re testing. You can do it a lot more cheaply than running copies of everything else. You can mock legacy APIs. You can mock third-party APIs. You’re not restricted in that way.

Also, the cognitive load problem I talked about, the points at which your understanding has to extend in order to get things done, with this, you bring it in to understand the contracts at your boundary, but not having to go any further than that. It’s helpful in that way as well. It’s definitely not a free lunch. There are things you need to do, work you need to do in order to make this work well at large scale. This is really what this talk is going to be about.

Let’s get into that a bit. These are really the main objections or challenges that people bring up regularly when you suggest the idea of doing mocking at scale like this. The first one is, mocks aren’t realistic enough. They’re simplistic. They don’t emulate the behavior of the real system. I test things with mocks. My builds are green. Everything looks good. Then I go to production, and something goes wrong that is revealed through a real integration. My response to that is that it’s possible to get to a high degree of realism with current mocking tools if you know how to use them, if you use them in the right way. You can observe behavior of real systems, and you can port that behavior back into your simulations very effectively if you choose to use the mocking tools this way.

The second point is just the effort involved in creating and maintaining mocks. If you’re doing it the old-fashioned way, if you’re hand-coding all of your mocks, taking API designs and turning them into code, and you’re having to do this, every team is doing this for every API that they’re building, then, yes, this can be pretty labor-intensive. It’s the kind of work that nobody really wants to do over the long term. It’s fun when you first discover the tool, and then it turns into this miserable toil after that. What I’m going to talk about in this talk really is techniques for reducing that labor and that toil down to at least a manageable level so that you can work with significant amounts of this stuff over the long term without it turning into this grind.

Related to that, so mock APIs drift away from their real counterparts over time is another concern. People tend to build mocks and really focus on how they work at the point in time that they need them and then neglect them after that. It’s partly a maintenance work issue like the previous point, but there’s also a communication element. How do you know when the thing that you’re mocking has changed? How do you know whether it’s changed in a way that you actually care about or whether it’s incidental? It’s not always all that easy to do that. These are all real problems. Again, I’m going to talk about some techniques for how we work around that problem. These are all surmountable problems. What I’m trying to say is don’t throw the baby out with the bathwater.

High-Level Concepts

I just want to introduce some high-level concepts at first to set some context before getting into tools and workflows. I want you to think of these three artifacts that we’re going to compose together in order to get our work done. I’m going to call this an observation. Essentially, this is a piece of captured traffic or a captured observation of an interaction between a client and an API. For a REST API, it would be an HTTP request and response. Then we’ve got a simulation, so based on a tool like WireMock. I’m going to use WireMock as the exemplar, but there’s dozens of open-source tools that will also do this. Simulations are behavioral. They’re actually expressing behavior rather than merely structure.

Then you’ve got contracts, which are the detailed, the most complete view of structure. They’re the syntactic description of an API, what its operations are, how its data looks, what the constraints are around its use, all of that kind of thing. It’s not behavioral. This is an important characteristic of contracts. There are a few useful things we can do when we combine these things together. We can take observations and generate simulations. We typically call this recording. We can take contracts. We can take an OpenAPI and generate a simulation as well. We can validate. We can contract test. Essentially, this means taking observations and checking that they conform with the contract. We can take observations of traffic to and from a real API or from a simulated API, so we can validate both for conformance with the contract and tie everything together that way. There’s a key point to note about this, something that I keep tripping over more and more doing this kind of stuff.

All of these aspects, it’s a bit like looking at system architecture, where you can’t tell the full story in one artifact. You have a bunch of different artifacts that are all telling useful perspectives on the architecture. It’s similar to this. OpenAPI will tell you things about data structure and format and so on like that, that you won’t find anywhere else. Then the simulation will give you behavioral information, business rules and functional rules around how the API works that the contract won’t. A good example of this would be something like rate limiting. You can put in the contract information about headers that you’ll receive and the fact you’ll get a 429 response if you exceed the rate limit. What you can’t express is, what is the actual rate limit? What behavior would I have to engage in, in order to trigger this? There’s syntactic information, but not behavioral information. There’s a whole plethora of other stuff like this that is only partially expressed.

There are also things that are just not really expressed at all, things that you can only really know by looking at the code or looking at the internals of the system. Some of these things we can make guesses about from sets of observations, but they then just become educated guesses rather than certainties. The consequence of this is that when you’re converting between things, you have to bear in mind that what you’re getting isn’t complete.

You can take an OpenAPI and generate the simulation from it and you’ll get something which is syntactically correct and really useful as a baseline. It’s a huge labor saving to do this, but it probably won’t be sufficient for the things that you need to achieve in terms of development and testing. You nearly always need to take this and start expressing the nuances of behavior that you care about, that you depend on, in order to meaningfully test your system. I think there are lots of tools that will do this sort of thing, that will do these very convenient translations, particularly from OpenAPI to a mock. I think maybe where some of the reputation for being oversimplistic for these tools comes from doing this kind of thing, that, essentially, I’m getting the same canned response every time when I test things against this and this isn’t sufficient. That’s something to bear in mind.

Tools

Now that we’ve looked at the concepts, I’m just going to talk a bit more about the tools that implement these at a high level. Let’s address this thing about naming first of all. I think of mocking, simulation, and virtualization as fundamentally the same concept. I know there are some people out there that will tell you these things are different, but fundamentally they’re doing the same thing. It’s mimicry of an API along some spectrum of realism. I think people tend to think of mocking as the things more towards the left, the simple, canned, very localized responses that we set up. Then things like service virtualization and simulation are more to the right of the spectrum where you’re investing more effort in the simulation in order to make it more realistic. Fundamentally, it’s all a single concept, and I tend to use these terms fairly interchangeably nowadays.

Next up is API observability. This is not the kind of observability that Honeycomb have popularized. This is really just the ability to go and make observations of API interactions. We can do this passively. We can just pull things off the network in an environment as things are happening, or we can do it actively where we intentionally generate interactions of the kind that we need so that we can go and then capture that traffic and look at how it played out. There’s a bunch of different ways we can do this. We can do this in code. We can instrument code directly and avoid the network altogether. It tends to be an active mode of doing things. We can do this inside a test runner while we’re running a bunch of integration tests. There’s proxying. This is probably the most common way that you see for doing it, so forward and reverse proxying.

MITM proxying when there’s TLS, when there’s HTTPS involved, which is painful because you’re essentially breaking HTTPS security in order to get in the middle of what should be an encrypted connection and reading it. It is possible to do, and there are fairly widespread tools for this. Then there’s PCAP, so packet capture, which is not in such widespread use. Again, this is unwieldy to use with encrypted data. Tools like tcpdump and Wireshark are good for this, but not so good if you got encrypted data. That doesn’t tend to happen so much. Then there’s these two new kids on the block, relatively speaking, which are interesting solutions for this, emerging solutions. There’s eBPF-based tools. eBPF allows you to deploy code in a controlled way into the network stack in a process running in your environment, and then you can pull things out of it. Crucially, you can do this beneath the encryption layer. It’s a neat solution for problems of otherwise having to break security mechanisms in order to observe traffic.

Similarly with service mesh, so service mesh routing layers that sit on top of the network and pass traffic around can also be made to take copies of events that happen in the mesh and put them somewhere so we can use them. Then, finally, there’s contract testing. I’m defining this, like I said before, in this case as taking observations and checking their validity against the contract. The observations can be of traffic to a real API or traffic to a simulated API, it doesn’t matter. There’s a variation on how you can do this, actually, which is using generation to generate snapshots at different points in time of a contract, and then diffing the contracts, and then applying a bunch of rules saying if certain types of difference appear, then this is a breaking change or this is a significant change. This is fundamentally what it is.

For clarity, a lot of people tend to think of contract testing as being synonymous with consumer-driven contracts, so tools like PACT or Spring Cloud Contract. That’s not what this is. I actually canvassed a few people in that community that I knew, to say, are people using these tools in this way to validate mocks? I got a pretty uniform no, they don’t. It’s not that.

Finally, there’s this. I think it’s technically illegal now to speak about software in any way without mentioning AI in the conversation. I’ll address the large mammal in the corner of the room. In the experimentation that we’re doing around using AI for this stuff, there’s actually some, I think, serious opportunities for it being used as a productivity lever. LLMs are quite good at generating open format, so OpenAPI being an obvious one. There’s a ton of that around the internet and therefore in the training data. WireMock open-sourced the JSON format for it, has also been around for a long time, and LLMs are fairly proficient at generating that on demand as well. What I’ve found this is particularly good for is this problem of when you’ve generated a baseline mock or simulation and you now need to do that enhancement to get from that baseline to start asking for specific data variations, behavioral variations, all that kind of thing.

Then you can get LLMs to do that kind of thing fairly effectively. You say, here’s a pre-generated example of what this looks like, vary it for me in this way. An important thing is if you’re building the plumbing for contract testing already, then this can serve as a guardrail. When your AI coding assistant goes off piste and generates something weird, you can put it into this loop where you say, go and check this against my contract, and if it’s wrong, respond to the feedback, fix the thing that wasn’t right. You can gain some confidence that you’re not generating things which are subtly wrong when you’re doing this using the same contract testing techniques we’ve just introduced. Obviously, AI is pretty good at taking sets of observations and making educated guesses about them. This is another thing which is quite labor intensive to do as a human being and quite well suited to LLMs.

Types of Workflows

Let’s talk about types of workflows. There’s fundamentally two situations you might find yourself in. One is where all of your API producers, the APIs that you’re integrating with as a consumer, are producing contracts that you can rely on that are validated or maybe they’re generated directly from code or something like that, and they’re publishing them somewhere reliably where you can go and find them. This is a great place to be because then you can subscribe to those changes and you can regenerate your own simulations from it. You can validate the interactions with your own simulations in your own tests against a contract provided by the producer. This can be very labor-saving and gives you a lot of confidence if you trust what the producer is giving you.

The last time I did this, the company I was working for was integrating with a third-party partner, and they were building a big complex API for this partnership. They didn’t really have a lot of experience doing this, and they didn’t give us a sandbox to work against at all, so we had to build a mock just to have something to test against. They kept just emailing us a Word document every two weeks, and that was the documentation. Needless to say, it wasn’t complete and it was quite often wrong as well. We did persuade them to start exporting an OpenAPI doc automatically from their code, so whenever they did a new build, they used that frameworks generation tool that would spit out an OpenAPI doc and they’d send that as well. We plugged that into WireMock.

There was an extension for WireMock, someone at Atlassian had written at the time, which would validate the traffic, do the in-code capture and validation of traffic. We just plugged that in there with the OpenAPI that they sent us, and that quickly revealed all of the discrepancies between reality and what they’d sent us in this Word doc. That meant when we did finally deploy this product, the misalignment between reality and what we’d built was much less than it would have been if we trusted the Word doc alone.

Unfortunately, most of us find ourselves in this situation. Most companies I speak to, they don’t do OpenAPI institutionally. You have a few teams here and there doing it well, and then some doing it not so well. As a consumer, you can’t reliably get hold of a contract that you can really trust, so you have to generate it yourself. This is perfectly doable. You can generate a contract representing just the paths of the API, or the parts of the API that you care about. Then you can otherwise use it in the same way I just described. You can use it to validate, interact, you can generate a simulation from the data or from the contract. Then you can ongoing refresh the contract and then use that to validate your simulation, so you can get that same confidence that way as well.

I’m just going to run through, showing you a somewhat concrete example of how you put this together. It’s a small project that I put together. It’s on GitHub, so you can go and look at it yourself. It’s really just a bag of npm projects and shell scripts, but should illustrate the point nicely. Let’s say we have this fictitious payments API that we want to integrate our app with. We want to build a mock and we want to capture a contract that we can use to validate it. We might use curl to make a test request, proxied through these tools to this payments API, and then capture the results. There are two tools in here. Obviously, WireMock, I’ve introduced already. That’s capturing observations and turning them into a simulation. Then there’s this tool called Optic, which will capture observations and build contracts, build OpenAPI. In our first step, we make this request, and we build these artifacts initially.

Then we’ve got stuff that we can work with, we can test against. Here’s what that would actually look like. Pay close attention to that amount field, which, as you’ll notice, is currently a number. Now some time has passed, and we suspect that the payments API has changed in some way. We rerun our traffic. We rerun our curl requests. This time, we just run them through Optic, and Optic will capture a new version of the OpenAPI spec. This is just recapturing. You’ll notice that now the amount field has changed to a string. I know you might think, why would anybody do this in a financial API? It happens, believe it or not. This has changed in a significant way. We want to go and validate our simulation. Remember, our simulation was built from the previous run, and it’s got that old version of that piece of data in there.

Now we bring another tool in. Prism, amongst other things, is a proxy that will validate traffic against OpenAPI. It will do the contract testing part of the equation. We now make our requests through Prism, but to our simulation this time, with the OpenAPI we’ve just generated in the previous step as an input, and that generates a validation report as output. We do this, and as we can see, we get this error message. It tells us that, you gave us the number, and the contract says it should be a string. I mentioned this diffing thing earlier on. This is something else that we can do also with Optic, although there’s a bunch of other tools that will do this now as well, where you can take the old and the new version of the OpenAPI, you can diff them, and it will give you this nice report where it will tell you the things that it thinks are significant changes or breaking changes. This is it running in code.

Hopefully, fairly obviously, you could run all this in CI quite easily. You could have scheduled jobs, or you could have some other trigger that would kick all this stuff off, automate it. Publish these reports on your CI servers, report publishing place, wherever that is, and automate a lot of this process so that you’re proactively catching these problems as they happen, rather than finding out that actually the divergence between the mock and the real thing has happened, and you’re only discovering it in staging or in production.

Futurology

I’ve shown what’s possible with the current state of the art, but I want to look a little bit into the future as well. I think there are some things happening, evolutions of tools and standards, and so on like that, that are going to make this easier and expand the range of possibilities. I’m going to talk a little bit about that now. I mentioned before that there’s still a lot of white space in what we can really adequately describe about an API, at least using standard tools. There are lots of things that we can’t say about an API using the tools at hand at the moment, and that we’re forced to either guess at through observation or go and dig around in the code in order to understand. That situation is changing, and two big contributions to that are these sibling standards, which were both released last year by the OpenAPI standards team.

The second one, Overlays. Lorna is chiefly responsible for that. Arazzo, the workflows standard, is a way of describing how you can achieve a multi-step interaction across one or more APIs. It describes things like expressing invariants while you’re doing that and also how to move data between calls. I think this is going to be really interesting, particularly as the tool support for this picks up, because it’s describing information that we didn’t previously have.

If you have a contract that also includes an Arazzo description, then there are inferences you can make or certainties even about the behavior of the API that can map straight onto how we simulate it. This should reduce some of the types of work that we have to do either manually or with an AI tool in order to behaviorally enrich our mocks. Overlays, I think are interesting. Overlays are a way of deconstructing and recomposing OpenAPI documents. They tend to get very big anyway. There’s a way that you can add these vendor extensions on, so non-standard properties to describe things that aren’t in the standard. These are very useful, again, for filling in these bits of white space.

For instance, the rate limiting example I gave earlier. There are organizations that will add non-standard rate limiting information in there that will tell you what the actual rate limiting parameters that will trigger a condition are. Overlays make it a lot easier to work with these without your OpenAPI description becoming completely unwieldy. This is a bit more speculative, but I’m hoping that this will spur more innovation around these small formats and standardizing on these as ways of describing things like rate limiting, pagination, and so on like that.

I’ve already mentioned eBPF and service mesh. At the moment, there’s not a huge amount of integration between these types of tools, and I think you have to do the work yourself in order to take observations from these tools and use them in a context like building simulations. They’re not maybe as ubiquitous as I think they might be in a few years. I think it could be the case in a few years’ time that this is just part of the substrate on which we build things, and there will be a way of grabbing observations of, for instance, HTTP traffic in any cluster you’re working in. This will lower the cost and effort and hopefully improve accessibility to observations of how APIs are behaving and thus make everything else that we do with that easier.

Then, finally, for mentions of MCP at the moment, I think it’s right at the peak of its hype cycle. MCP, whether or not you think it’s interesting in its own right as an API standard, the great thing about it is that there was just nothing before that, and now everybody has coalesced around the same way to integrate AI tools with everything else. The effect of this is that it opens up all sorts of possibilities for tying your existing AI-based coding workflows into new tools like API mocking tools. This is something we’re experimenting a lot with at the moment. The things you can accomplish, particularly in making these composite changes to projects, so changing the things that are using the API along with changing the simulation and making sure they all tie together, shows a lot of promise. I think this is going to be another major driver of productivity improvements.

Summation

Modern mocking tools can produce realistic simulations, realistic enough for most development and testing activities. I’m not suggesting that you should throw away integrated testing altogether. This is always a vitally important part of your overall testing strategy. It should be reserved for the things where there’s actually genuine integration risk. I think, if done well, that should be a fairly small minority of the work, and the majority you can do without having to resort to doing that and make life a lot easier for yourself. The combination of these techniques I’ve talked about, so API observability, generation, contract testing, are ways that you can reduce the labor involved in doing this stuff and increase your confidence in the results of it. Finally, using contract testing as a guardrail for GenAI is a great way to work with it, and, again, be able to get that productivity gain with this baseline of confidence and support that that gives.

Questions and Answers

Participant 1: You mentioned early on that a lot of people, the first time that they try some of these automated mocking tools, just find they get the same response back every time. What’s your first step advice for those people? Do they go straight to asking the LLMs to do better or adding more example responses? What’s your first-line defense?

Tom Akehurst: It’s a good idea to try and understand something yourself before you ask an LLM to do it. I hope most people would agree with that, that vibing things, particularly if you’re in a complex distributed environment, is maybe not the best idea. What I’ve done a lot of when I’ve been working with these tools is to try and find ways of making observations of things happening, either in production or in environments where things are integrated and working, and then trying to figure out from those observations, what is this telling me? If I make this date parameter in the future, I get a happy path response back.

If I make it in the past, that’s not valid, so I get an error. If I change pagination parameters, then this is the impact on the output that I get. Or, this resource can go through a finite set of states, so a payment might be initiated and pending, and in fraud check and then completed. The data looks different in these states. I think just taking observations of the thing happening and then porting those observations back into your simulation is a good way to start. You can do some of that by recording, so you can get an approximation of it. Then, some of it you can do by hand. Then, hopefully, when you understand how to do it by hand, you can then ask an LLM to industrialize that for you.

Participant 2: In the scenario when you’re building an API and you have this luxury of starting from the contract, how would you make sure that there is a coherence between the contract and the contract tests, that they align? Because you may start with building the contract, but then you may need to make sure that even the fields that are present in the contract would be also present in the contract test. At the same time, not to repeat yourself because that doesn’t make sense to repeat those things. How would you approach that?

Tom Akehurst: One thing to clarify about the contract testing I’m advocating for here is that it happens as a byproduct of other types of testing. This is another distinction between this and consumer-driven contracts where you’re producing a specific artifact for those tests. When you put contract validation in the front of an API simulation or a real API, you’re running a bunch of other tests. You’re running a bunch of functional tests, which if you’ve built them correctly, will fail if pieces of data are missing, if the contract isn’t right from the perspective of the caller.

Bryant: What’s your thoughts on how LLMs will affect the end-to-end testing now? I’ve seen a bunch of folks saying like, I can record the screen, what I’m doing and verify off that and that kind of Holy Grail, which I’ve been chasing for years. Do you think it will push more folks to do literally frontend testing and not test the APIs?

Tom Akehurst: I think you see that happening a lot already. Whenever there’s a new generation of frontend driven testing tools, you see this just proliferation of thousands of slow, brittle, not so great tests being built with it. I think this is going to be no exception. I think there’s a plethora of tools out there already that will just read your DOM and generate some tests and encourage you to just spam them out there and not think about them too much. There are some people taking this seriously as a problem and are looking at how LLMs can help build test infrastructure. Building the rudiments on which you build your test, whether it’s page objects, if you’re going through UI or means for interacting with APIs and being a little bit more intelligent about where the test is kind of going in and out, where mocks are used versus real end-to-end testing, and providing that supporting infrastructure so then you can build good quality tests on top of that, rather than just mindlessly generating the worst tests.

Participant 3: I was wondering if you’d go into a bit more detail about how these tools help with mocking like stateful behavior and things like that. The context is, is that when I tend to mock these API behaviors, I feel like I’m recreating the API sometimes. That’s a feeling that I don’t want to have. I don’t want to feel like I’m having wasted effort there basically. I wondered if you could go into a bit more detail.

Tom Akehurst: That’s probably one of the less solved problems in this space at the moment, and one that we’re very actively working on with WireMock Cloud at the moment. That’s a hard one. We’re looking at ways that we can look at recordings. Again, it’s this thing about making inferences about behavior and then encoding it in the simulation. We’ve done a little bit of work in this direction already, looking at essentially a snapshot of an API and saying, this looks like a stateful REST endpoint. This looks like this is the input data that creates the state.

Then we infer that this is the state that actually gets constructed, so the input state usually plus a load of synthesized state that gets added to it. Then we build a stateful model around that. It’s not a perfect approach in that it assumes that this is a REST API following typical CRUD conventions for a REST API. We’re going to be actively examining how we can do this, how we can reasonably confidently at least take observations and turn them into stateful behavior. Again, I think this is probably where the LLMs will ultimately end up helping because that inference can be a little bit more nuanced rather than hard rules-based kind of thing.

See more presentations with transcripts