During her KubeCon Europe keynote, Christine Yen, CEO and co-founder of Honeycomb, provided insights on how observability can help cope with the rapid shifts introduced by the integration of LLMs in software systems, which transformed not only the way we develop software but also the release methodology. She explained how to adapt your development feedback loop based on production observations.
She begins by emphasising that even though software engineers are used to “black boxes”, the LLMs are different mainly because they make reliable behaviours that we are used to and make them more complicated.
Current systems development practices rely on deterministic properties like testability, reproducibility (mockability), and explainability (debuggability) to ensure the correctness of their behaviour. Still, the whole reason for incorporating LLMs in your application is to capture the unpredictable nature of human language.
…most of us here today have been building software before this LLM boom, and we know the hard part about building software systems isn’t writing them. It’s testing, maintenance, tuning and debugging that comes afterwards…
Besides the changes in the methodology of building the software, the product release practices have shifted as well: rather than having a staged approach when limited groups of users have access to alpha or beta versions, we now have early access programs. These allow users to interact with your systems unpredictably.
Yen: …should we all give up and embrace prompt engineering? No! That’s why we are here…
Even though the shift might seem very complex, she thinks that the practices adopted lately are reasonable steps in the right direction to cope with all the shifts.
- Continuous Deployment and feature flags – provide a proper mechanism for rapid feedback loops
- Testing in production – forces us to embrace the chaos and engineer systems conceived for rapid and graceful failure.
- High cardinality metadata – gives us the mechanisms and obligation to reflect on complexity.
- High dimensionality data: – provide the mechanisms through which we can run highly parameterised experiments and investigations
- Service Level Objectives: Captures UX as the arbiter of quality, making the customer’s experience the most essential aspect of the system.
According to Yen, we already have a model for measuring, debugging and “moving the needle” on unpredictable, qualitative or quantitative experiences. Observability compares expected behaviour against what we see in production “in front of our live users”. This will help in the chaos of early access programs when something unpredictable happens. When users have a different behaviour than anticipated, and when deploying a bug fix that will break something else, tests can’t protect you like they used to do in plain API systems.
Observability helps embrace some unpredictability, enabling feedback loops that let you learn what happens with your code and incorporate those observations into your development process. LLMs have “an infinite” set of inputs and outputs, so tests are insufficient. Evaluations can be used, as they are more flexible and allow us to define intended or unintended behaviour. The “evals” are developed while you are creating the application itself. They provide the mechanism through which you can observe how the “magical box” behaves for different inputs, further incorporating that feedback in the development loop.
Similar to other black boxes(payments systems, for example) that we might have integrated into our systems already, LLMs need to capture the details that’ll allow us to understand how the system behaves further. Similarly to classical systems, we would record the outgoing requests to the third party and capture their responses. This will allow us to understand how input impacts output and the user experience. Further, these will translate into behavioural patterns of our system, enabling you to investigate any outlier user of your system rapidly.
Yen concludes by expressing her optimism regarding the future of software engineering, given that developers have adapted rapidly to the growing unpredictability of software systems’ behaviour. With the focus shifting to a system’s impact on the users’ experience, observability is important, especially in the age of Generative AI.