From Dashboard Soup To Observability Lasagna: Building Better Layers

Transcript

Martha Lambert: We’re going to talk about how we went from a really chaotic dashboard soup to a layered observability lasagna. I’m sure that doesn’t make any sense to any of you right now. Maybe you’re expecting a cooking lesson. We’ll dig into what it actually means today. I’m Martha. I’m a product engineer at a company called incident.io. We build a product that handles your end-to-end incident management. That means your alerts firing, paging you, all the way through to writing a postmortem. I work across the stack, but I focus a lot on the reliability of our product and the observability that enables that. Today you’re going to leave with a process to unsoup your dashboards, which I promise is a very technical term. We’re going to understand the importance of layering your observability stack and how to guide an engineer through the debugging process. Then, finally, we’re going to dig into a load of technical and practical tips for getting all of this to work really smoothly.

Backstory

First, we’re going to start with a story. Our story starts in early 2024. We’d just finished building an on-call product, so something that handles your alerts and pages you, wakes you up in the middle of the night when your software goes wrong. I’m going to use on-call as an example throughout this talk because I’m sure most of you know what it means to be paged.

Therefore, you probably also know how important it is that you can trust your pager and know that it will actually wake you up in the middle of the night. Reliability is really important. That’s where our story begins with a really big challenge. It’s January 2024, and we had a really bare-bones product, and we were due to release it to the market in just two months. Our reliability expectations are really high. It’s not ok to just cross our fingers and see what happens when we release it. It’s really quite meta because if our software is broken, we can’t page anyone, and no one else will know that their own software is broken, so we really need to avoid that. We designed this system with reliability in mind from the start, but we’d not actually tested it, and we had two months to find that reliability and prove that it worked. For me, reliability boils down to two things. Proactive, so you need to know that your system can handle all of the things you expect to throw at it on an everyday basis.

Then, reactive, so we can’t protect against incidents, and we can’t know everything that could possibly happen. It’s really important that when something does go wrong, your team find out about it really quickly and can resolve the issue just as quick. Getting confidence in these two sides is what made us sleep at night when we released this new, unproven product to the market. What I learned is the core thing underpinning both of these is great observability. Great is really vague. What does that mean? How is it any different from good? Luckily, taking your observability here doesn’t mean reinventing the wheel. I’m not going to spend today telling you to buy a load of fancy tools or rewrite your whole stack. It’s much more of a strategy and an approach to observability that changed how our stack feels to use.

A really common problem is soup. What do we actually mean by dashboard soup? It’s when you’ve got loads of single-use dashboards. Whenever you’re in an incident, you’re spinning something up that’s really useful, and afterwards, it’s getting lost to the soup. You’re swimming around in all these visualizations. There are definitely some really useful ones in there, but you’re never sure where to find them or where you put them after the last incident. It can feel like you’re doing all the right things, you’ve got the right tools, but it’s just not sticking.

Track back to the start of 2024, and that was me in the soup. I was working in this system where I didn’t feel at all confident using or extending our observability tooling. Previous companies I’d worked in hadn’t got this right either, and I thought observability tended to be this big, unwieldy mess that you just had to work out how to deal with in emergencies. Our story ends nicely, and today we’ll talk about how I got here. We’ve now got a really reliable product, and it’s been that way ever since we launched. We’ve got a lovely observability stack that feels great to use. Along the way, I’ve become a big observability evangelist. The most important thing is that this is all really widely adopted by our team, so it’s not just a few keen engineers that really care in the corner.

Unsouping Our Stack

This talk is for you if maybe you’ve got all of the right tooling, but for some reason it just feels bad. Maybe you’ve invested really hard in observability and you care a lot, but you’re not sure why your team aren’t adopting it. Or, perhaps you’ve got a system that you really need to make reliable and trust, and it’s not clear where to start. Maybe we’re now all really excited about observability and we want to go and jump into building dashboards, but don’t, because that’s exactly how the soup happens. Just jumping into building dashboards or metrics or logs with no structure is how everything gets messed up in the first place.

The key to unsouping our stack is an iterative process that we follow every time we introduce new observability to key areas of our system. It’s a process that allows us to categorically know that the observability we build is useful. It’s a bit like TDD or test-driven development approach to observability. I’m sure everyone has varying opinions on whether that works in practice for code, but when it comes to observability, it’s a really useful approach to know that you’re measuring the right things that you actually care about.

Following this process is what gives us confidence that we can use our observability in incidents when things go wrong. We’ve got three steps. First is predict. We state something about our system that should be true. Then we try and prove it. We apply a real load to our system to see what actually happens, and we can never believe it until we see it. Then we measure. From those above steps, we’ve probably realized that we can’t actually understand everything about what’s happening inside our system, so we build the observability to show us. We continuously cycle through these until we have confidence that our system is mostly working how we expect, but the key thing is that observability would always tell us if it wasn’t. We’re going to go through each of these steps and what they look like.

First is predict. This is where we state various assumptions about how our system should behave and establish what good looks like for it. We’ll leave this step with a whole list of cases that we actually want to apply to our system and see what happens. First, to make predictions, we have to have a really clear picture of what we’re looking at. If you’re working with an established system, you probably already have this, but if you don’t, get your whiteboard out now. We’ll have a look at our on-call system.

Essentially, our on-call tool ingests alerts from third parties, so Datadog or Sentry, and then wakes people up, makes their phone ring. Alerts in, notifications out. Internally, it looks a lot more like this. This is our real system diagram that we used when we were working on this. The important thing is to split up your system so each box has a single clear job that can succeed or fail independent of all the other sections. For on-call, we have our alert ingestion, where we pull in alerts from our third parties.

Then alert routes, where we apply rules for whether a user would like us to create incidents or page people. Escalations is our paging system. Schedules is how we know who’s on call. Then, finally, notifications at the bottom, where we call people or notify their app, depending on their preferences. Back to predicting. We need to take each box on this diagram and think of all of the ways it could possibly experience strain. If we zoom in on alert ingestion, we need to write down predictions or questions about our alert ingestion flow. These shouldn’t be things that are testable with unit or integration tests. We want to think of real load that our system would experience in the wild, so end-to-end system behavior. For us, these were some examples. Alert storms.

If some really common shared infrastructure like AWS goes down, everyone’s probably having a terrible day, and many of our customers will be pouring alerts into our system. We need to know that we can handle that and keep going. What if someone sends us a really large payload? Could that affect us? Or if a customer decides to pass really complicated values out of their alert, we need to know that that won’t bring us down. These are just a few examples of the kinds of things you want to know about and what you should be testing. Things that are only really visible when your system is properly running out in the wild.

We have this collective understanding now in our team of a load of things that we need to know whether our system can handle, which means it’s time to move on to proving them. Proving that when we actually exert real load on our system, it behaves exactly how we expect. Would anyone feel comfortable deploying critical infrastructure without testing it first? Hopefully not. You should treat your observability stack and your dashboards in exactly the same way. If your team just builds observability, people tend to measure the things that are really easy and common to measure. What we want to do is know that we’re measuring the things that actually matter and not just these hypothetical situations that might. You really don’t want to find out that your dashboards are measuring the wrong things during an incident, so do this before and test them out first.

The way that we prove our predictions was through a process we called drills. I’ll talk a little about what that looks like first. Drills meant getting in a room in person with four engineers and writing out on a whiteboard all the possible ways that we could break our system. Then we’d actually apply that load to see what happened. Doing this with your team is what gets everyone really bought in, and we rotated different team members in each day to circulate that system knowledge. It’s actually a really fun session, and everyone loves it. You get to try and break things that you’re not normally allowed to and chuck loads of different load at your system and see what happens.

Then, each day, we’d wrap up with a message like this. When you’re running drills, it’s really important to track a few things. How are you feeling about your system at the end? What are you confident about? What are you still worried about? Specifically, what have you tested, and at what volumes? What have you changed and improved both within your code and also your observability stack? It’s really useful to be able to look back on these, know exactly what you did and how your system was performing at a specific moment in time.

That’s the process for how we run our drills, but let’s have a look at a worked example of how we actually prove one of our predictions. Taking our prediction from earlier, we need to know that we can handle multiple customers pouring massive quantities of alerts into our system at once. We need to actually apply that load and prove it. It’s really important to not say it’s probably fine, but actually the interesting word here is handle. What does handle mean? Our success criteria is knowing that our observability tooling will show us that we’re fine. Fine can mean a lot of different things. For me, handle meant we’re doing exactly what our users expect all the time.

The rest of our app is unaffected, so we’re not pulling on resources that we shouldn’t be. If we needed to, we could probably handle a little bit more load comfortably. Now we want to actually apply that load and see if we are really handling it. We run our load tests using Grafana’s k6 tool. We had a really simple use case, which was just send increasing numbers of alerts into our system to ingest via HTTP and see what happens. k6 allows you to send distributed load into your system. Here you can see we’re splitting it across multiple geographic regions. It also allows you to really easily vary the rates that you’re sending it at. Here we’ve started with a low baseline of 1,200 requests a minute that we can easily vary. There are loads of ways that you can load test. You don’t need to use any particular framework.

The important thing is that you want a clear way to throw increasing numbers of requests at your system, track responses, and see what happens. When we did this with our prediction, we found that we were missing answers to loads of questions, and we weren’t actually sure whether we were handling this load at all. Think about yourself in an incident. What questions would help you really quickly triage the impact and know how bad something is? For us, that looked like not knowing where we were rate limiting and blocking requests, not knowing how delayed things actually look to our users, and being unsure of where our bottlenecks were. If something was slow, we weren’t sure which part of the system was slowing it down.

We’ve simulated these conditions for real. We’ve worked out that our observability is missing the answers to a load of questions that we care about. Now it’s time to actually measure. Now we finally get to build some dashboards. We want to use our observability to confirm that our system behaved how we expected in our prove and predict steps, which probably didn’t happen at least on the first try. In this section, we’re going to talk a whole load about dashboards, but dashboards can be really controversial for a good reason, because it’s so easy for them to suck. Before we actually build any, how do we avoid that dashboard soup? There are two classic mistakes with dashboards that we need to avoid.

The first one is that they can answer very specific questions. Dashboards are often a point-in-time picture of something that you knew would be interesting once. Naturally, you can only plot known unknowns. Charity Majors has written a load of great blogs about this. We can’t change this. We can’t give ourselves the ability to look into the future and plot unknown unknowns. Instead, we have to be really clever about the questions that we ask. For us, these dashboards are a triaging tool. We do ask a question, and that question is, how broken does my system seem to users? We want to make sure that when we’re using dashboards, we’re asking questions that will continue to matter and not fade into irrelevance in a few months.

The second mistake is that they’re often really static and disconnected from the rest of your stack. If I receive an alert saying database CPU is really high, that sounds pretty bad. What do I actually do? Maybe I’m receiving more alerts in parallel and things are looking a little bit scary. I look at my panel, and I see this increasing line, but I’m not actually sure where to go next, and I can’t find specific examples of what’s going wrong. It’s so easy to create dashboards that just leave you there without a thread to pull, and that’s what we really want to avoid. There’s a core principle that helps us try and avoid all of these, which is that we always treat our dashboards as a product. The way we make our dashboards good is by constantly putting ourselves in the shoes of an engineer using them. You need to be thinking about UX throughout. Incidents are really stressful, and you need to know that your tooling will let you work really quickly with high confidence when things go wrong.

The Observability Lasagna

The key way that we make our dashboards feel like a great product is through a consistent framework applied across our stack, which is our lasagna. A great observability stack will walk you all the way from a bird’s-eye view down to individual requests. This is our lasagna, so we have four key sections. We’ve got a really high-level overview dashboard, a more specific system dashboard, then logs, and then traces, which are our most granular detail on specific requests. This consistent framework means that engineers don’t need to have seen every single dashboard before they jump into a new one. It’s so important in an incident to be able to dive into unknown observability territory and feel right at home. The engineer’s journey through these layers is the most important thing. Actually, the really key piece of our lasagna is the arrows.

These layers are ineffective and won’t be adopted by your team if it’s not incredibly easy to navigate from top to bottom. That’s why another key principle of great feeling observability is connecting your layers. That means drawing all of those arrows yourself and always knowing exactly how an engineer debugging will arrive on the next step down. This is such a big productivity boost and is what makes our dashboards useful and sensible and heavily used by our team. Every single time you build a visualization or implement something new, you have to ask yourself where you would go next and then give an engineer a massive big button to click that’s really obvious. When we think about the stuff that underpins this, on a technical level, that means having really clear connections between metrics, logs, and traces. I’ll talk a little bit more about how we do that later.

Let’s have a look at what those dashboards actually look like. We start with our overview dashboard. This is the most high-level view of our system, and it sits at the product level. We have overview dashboards for each major part of our product, such as on-call. This dashboard is every engineer’s first stop when something is wrong. It provides a traffic light for each constituent subsystem, so signals that you can understand at a glance. Ultimately, it’s a triaging tool. This dashboard won’t and shouldn’t give you the full details of what’s going wrong. It should be glanceable and act as a signpost to point you towards where the actual issue is. You should put your overview dashboard everywhere. We have it as our Grafana homepage. We bookmark it in Slack channels. Your entire team should know that this is the first stop to assess system health whether they’re in an incident or not. Let’s have a look at ours.

Our overview dashboard contains a whole load of rows, each one is aimed at quickly telling us if there’s a problem with an area of our system. We start with infrastructure health. What is the state of the core pieces that our system relies on? Diving into that, we run on Kubernetes with a Cloud SQL Postgres. That means we plot CPU and memory for all of the relevant pods, and database health, so database CPU and how saturated the database pools our system relies on are. Next is queue health. We need to quickly identify whether we’re behind on any important async work. We do that by plotting the count of unacked messages for each of our core subscriptions. Basically, how many items are in the queue and how behind are we? This looks pretty messy and not very interesting right now, and that’s good, because everything is behaving as normal.

If there was an issue, we’d see a really clear and obvious spike that would let us quickly triage and dive into what was going wrong. Then we have rate limits. What traffic are we blocking, and where is that happening? Then for each of those boxes I showed on our system diagram earlier, we have a section. For each row, we want to give a really clear picture of whether our system’s single job is happy or sad, without going into too much detail. Alert sources are our external facing alert ingestion, so we track latency and response codes. Alert routes is us handling those alerts according to someone’s rules, so we plot outcomes and our throughput.

Then, escalation, so are we paging people? Taking a closer look, our escalation system runs on a Postgres queue, which means we need to know that we’re processing it at a really healthy rate. We track the oldest item in the queue and how much of our capacity we’re using at any point. It’s really important that this dashboard looks scary and red if anything is wrong, even if it’s minor. False alarms are ok. Here we can see it looks a little bit dodgy because we’re running a load test right here and you can see there’s quite a lot of escalations that are due. Going back to our connecting layers principle, this tells me that I need to dig deeper, and I can click right there on the escalations title and it will take me to a more in-depth system dashboard. Each of these rows will always link to somewhere where you can dig in a lot deeper. That’s it, that’s our overview dashboard. It’s a really simple traffic light of which areas of the product are healthy. Encourages us to really quickly triage and dig deeper where we do see issues.

Once we’ve identified a problem, such as our escalation system being backed up, we move to our system dashboard. We come to a system dashboard when we already know there’s an issue with an area of the system, but we don’t yet know the impact or what’s actually going on. Our system dashboard provides an exhaustive picture of system health, so how well it’s doing its single job against time. If an engineer is here, they probably know something is up, so we want to plot all of the interesting things that describe how well it’s doing, which should be your SLIs or service level indicators. SLIs is a whole separate talk, but there’s a few useful frameworks here, such as RED or the Four Golden Signals. Basically, the key thing to aim for is plotting how healthy this feels to your customers and whether you’re doing the right things internally.

Finally, these should link straight to your logs. A dashboard is absolutely not useful if you can’t give people a point to clearly jump off and dig into granular examples of what went wrong. Diving into our escalation system dashboard. As we said, our escalation system is a post-rescue, which means we need to make sure we’re keeping up with load and processing it sensibly. These system dashboards should and will look really different depending on your system’s job, so don’t try and make them all fit a mold. Think about what’s interesting and useful to track. We start off with a section on our capacity and how much of that we’re using. We really need to know how saturated our system is, so we can quickly decide how fast it might get through a backlog. That means we plot the amount of work we’re doing against our capacity. I’m going to dive deeper into this metric later on. Next is throughput. How much are we processing and how quickly?

Then, user observed delays. When an alert fires, how long is it until I get paged? Then, outcome. What’s actually going on with my escalations internally? This is way more granular than success or error. I want to know what type of actions my system is taking and what’s actually happening to the data internally. Finally, the last section of every system dashboard is always logs-based queries. This allows us to connect our layers and give someone a really easy place to go next. We have a filter which allows condensing to a single organization to view logs just for them. That allows you to really quickly find targeted issues when you’ve identified an issue in the metrics-based panels higher up.

We used our overview dashboard to identify that we had an issue with escalations. Now we’ve used our system dashboard to understand the scale of the issue, and we want to dig into logs to find a real example of that. Logs are a really useful debugging tool when you actually know what you’re looking for and have been pointed there by your higher-up dashboards. Your logs are single units of work. They show your engineer the journey of a request with specific details on what happened and when. Your logs should be consistent across your systems. You really don’t want to be in a situation where you have five different bookmarks for all the different log queries you should remember. Having a really solid pattern for your logs, and clear, obvious queries is really important. You should make it almost impossible to write badly formatted logs in your code base that don’t have useful metadata.

Then, finally, once we’ve found an interesting log, a trace is probably the final piece of our puzzle and will always be linked there by a trace URL field on our log. Traces are our final piece of the puzzle. They give us the most zoomed-in look at a single request to enable really up-close debugging. You probably don’t trace all your requests for cost reasons. Maybe they’re sampled or just errors. That means that when we do have them, we want to make sure they’re tracking all of the important stuff. Traces connect the dots on where your request spent its time. They’re particularly useful for debugging slow requests.

Anything from a third party, you should make unavoidable to add clear tracing to. You need to know that a 1-second external API response was the reason for your system slowing down. We do that by having a single shared base client across all of our third-party API requests. We also add tracing by default to our database queries. We do some smart stuff there, like separating our spans for actually processing the query versus waiting for a connection, so we can really easily identify whether our query is just slow or whether our database is simply under high load. Finally, just like logs, we have a really simple UX in our code base for instrumenting tracing in our code. All the engineers in your team should know how to add good traces and spans around the functions they’re using and when to use that tool.

That’s our lasagna. This framework is what makes our observability UX really consistent and great. The layers are what makes an engineer know exactly where to jump into and when. This is a journey we perform all the time. I got paged for high CPU on one of our on-call pods. I head to our overview dashboard, because that’s a pretty vague alert, and I don’t know what’s going on. I quickly diagnose that it’s an issue with our alert routing system, because I see high latency on that row. I dive into our system dashboard, and I quickly find out that it’s an issue with a single organization. Then I can jump straight to our logs, and I have a big button to go and click on them. Then, finally, I can really easily navigate to our traces and find out that it’s an issue with a lock that we’re holding on our database and is pretty easily fixed. That’s an example of having a really vague issue and having my hand held by our stack to go all the way down to the specific thing that fixes it.

Practical Tips – How Do We Actually Implement This?

There are loads of nice technical details that actually make this work nicely, so we’re going to finish by diving into a load of practical tips for what we do behind the scenes. The first important thing to do is make user impact your lens. If you’ve ever been paged for database CPU hitting 70%, you’ll know that means nothing on its own. It’s a warning sign. It could mean everything’s about to blow up, or maybe something’s just consuming a lot of resource and is fine. Classic metrics like CPU, memory, they’re really useful. What makes your observability actually amazing for triaging incidents is this addition of business logic and what actually matters to your users. Everyone, you’re the one who knows your system. You understand what’s actually interesting to track for your users. Think about those questions and answer them. There’s a few ways we layer that in. First is an outcome field on our metrics. We already said each of our systems has a single key job.

On that job, we always track an outcome. The key thing is, you can be way more granular than success or error. You know your system, and you know what the interesting outcomes are. Here, for our escalation system, we track what actually happened. Are users being paged? Are the escalations expiring before they reach them? This is really easy to add and cheap to do. We’re already tracking duration metrics on our system’s key jobs, so we just layer in an extra field of this outcome. We really understand within our team what that means and where these escalations are going. The next thing we do is tracking user observed times. Plotting the latency of a function is useful, and it’s the default for a reason. These traditional metrics often don’t tell you what a user is actually experiencing. Really challenge yourself to think of it that way.

Think wider than how long did my specific function take to run, and more how delayed does this feel to my user. In modern software, things are often going through multiple async processes and queues before they actually finish. For us, this means we track an execution delay. Basically, what was the time between an alert firing and my user actually being paged? This graph here was on our dashboard earlier during our load test. While some of our panels looked a little scary, we saw high CPU and a lot of load in our system, this graph is really comforting because we know worst case there is 2 seconds before a user is paged for a particular alert. In every dashboard I go to, this is always the first type of panel I look at because I really quickly want to know whether we look really broken to users before I can triage how bad things actually are.

Our next tip is connecting metrics to logs. We already said connecting those lasagna layers is the most important thing for the UX of an engineer debugging. The way that we do that is making sure that we never have a metric that’s stranded on its own, it’s always connected to a relevant log. One way we do that is event logs. Every metric that is tracked on an individual request level should have a corresponding log that you can search for.

For us, we do that with event logs or anchor logs, they’re also called high cardinality events. They’re a pretty common pattern in observability. It’s basically a single consistently formatted log that you log at the end of your system’s single job alongside tracking your metric. You need to make sure that you’re tracking all of the values in your metrics, but then adding all of that extra detail that you get with logs, so specific IDs, anything that would make your debugging journey easier. The really important thing about event logs is that you always need to log them. Even if they’re erroring, you need to make sure that they’re showing up. You don’t want to be in the nightmare scenario where you’re in an incident, and you know you have a really useful log, but you just can’t find it because the code panicked.

In Go, that means we wrap these in a defer so that they’re always logged, but just make sure that in your code base you have a really clear pattern for putting these alongside metrics and making sure that they’re always going to be logged.

Then the second thing we do is exemplars. Behind me is an example of using exemplars on a metrics-based visualization of latency. Those little dots are single trace representations of my latency value, and I can click on them and then really easily view the relevant trace. This is a really useful tool to bridge that gap between metrics to traces and give you really specific examples of outliers that you might want to look into when you see a really slow request. Using that in our code looks something like this. We track exemplars alongside those places where we are already tracking our event logs and metrics. We use them mostly for slow requests, but we have a really standard pattern with this, and that is a really key takeaway is for all of your observability tooling in your code, make sure it’s really easy to use and people know exactly how to do it.

Our final tip is visualizing limits. Charts are so much more useful when they give you context of how bad things actually are relative to your capacity. It makes it a lot easier to quickly know how long you have before something actually explodes and how bad things are. This isn’t always possible. Not everything has a ceiling or a clear breaking point, but when they do exist, make sure you’re putting them on your graphs. One way we do that is through plotting work, and we had a look at this visualization earlier. Work is basically a metric of how much is my system occupied at once in seconds per second. In our escalation system, we have two escalation workers per pod, and three pods, so that is 6 seconds of concurrent things at once.

That means our capacity is that line at the top. We are absolutely limited at 6 seconds, and it’s impossible for us to do any more work than that. You can see the work plotted in the green and red bars. This is during our load test, which is why there’s such a big spike. You can see here that we’re entirely saturated, so we know that we can’t process any more or go faster. The really interesting thing here, though, is making your limit a metric and not a static value. You can see that it dips, and those blue lines there are our deploys.

If we just plotted a static value at 6 seconds, I’d be really confused if I came here and saw that our capacity and work dipped. Because we also see the 6 seconds value dip, I know that shortly after a deploy, it took us a second for pods to come back up, and that’s the reason that we weren’t doing as much. Having these live limit metrics gives you the confidence that you’re actually working with the capacity that you think you have. Those are our tips, and these make it super easy to build useful dashboards, and they’re things that you want every engineer to know about when they’re contributing to your code base.

When are you actually done with this process? I’m sure all the engineering managers are going to cringe when I say you’re never done. Software changes and dashboards are living. When we put this project down, we didn’t make any major changes in a year, and everything still feels super relevant, and these dashboards are used daily. There’s a few things that should tell you it’s time to trust your system and your observability. You’ve probably thrown all sorts of shit at your system. You’ve known exactly how it’s coped every time, and the impact on your users.

The really key thing is that your team can all do the same, and you feel really confident that they’d handle this in a real incident. The final tip on that is don’t do this on your own. Anyone can build a beautiful set of dashboards and logs on their own, but there are many people contributing to your code base and fixing your incidents, so it’s really important you’re not the only person that knows how to handle this when things go wrong.

Doing this whole process with our team is what gave us confidence in our system. Reliability isn’t just the code you write, it’s whether anyone in your team can wake up in the middle of the night and quickly understand whether this is a serious issue. The expectation isn’t that everyone is an expert in these tools, but it’s really important that your team can quickly understand using your tools whether this is a quick blip for a single customer or a full-scale production outage. Doing this entire process, running our drills, building our dashboards, and exercising our system with our team and rotating the team through is what got us trust and buy-in in the tooling. This tooling is for your team and you need them to feel like they own it, and doing it with everybody is the most easy way to do that. One extra tip on that is game days. We run game days as quarterly drills of our incident management process, so we try and break our system and see how our engineers handle it.

If you think of the drills that we ran earlier as a test with the textbook open, there were no wrong answers and we just wanted to see what happens. Game days represent throwing you much more in the deep end. There’s no answers and you’re expected to just work out what’s going on and fix it. Mostly we use game days as an opportunity to test our incident response and communication, but they’re a really good way to see how your team is using your observability tooling, and whether there’s any gaps in knowledge or in your tools. Monitor these, spot places where people are struggling, and work out where you need to invest.

Summary

That’s our lasagna. Basically, good observability isn’t going to come for free with tools, and that’s the job of everyone to really understand your system and what’s interesting to track. Doing that from theory alone makes it almost impossible to build a really useful stack. Using our predict, prove, measure flow, and exercising our system is what can make you confident that it’s actually useful. Once you’ve got that down, we want to make sure we’re applying great UX principles to our dashboards, and connecting layers and holding the hand of an engineer through the debugging journey is a really important way to do that. It’s really important that we’re always thinking of the user at the end of this and not just technical measures that might be interesting. Then, finally, do it with your team, involve everybody, and make sure that everyone knows exactly how to handle these and jump in in the middle of an incident.

Questions and Answers

Participant 1: I’m intrigued on your game days, are you causing non-production incidents? Is this chaos engineering in production? What’s the actual drill that you’re taking the teams through?

Martha Lambert: It depends. We’d love to say we can always test in production, but with our on-call system, we don’t want to really break it. Recently, we’ve been running these on staging. There are all types of different scenarios we test. It depends what we think the team need to get better at, at a single time. It might be like thinking about our disaster recovery process. It might be like taking down our database. It might be just like having someone do something really dodgy in a console and seeing how long it takes to find it. Mostly, recently, we’ve been doing those in staging, but there’s basically a load of different ways we’ve run them before.

Participant 2: What are your thoughts on continuous profiling? Because at my organization, we’re doing metrics, we’re doing logs, traces. Have you incorporated any of the continuous profiling tools?

Martha Lambert: It’s not something we’ve dug loads into. I think the key thing with this entire talk is that it’s really technology and tool agnostic. Yes, we were using Grafana and we were using metrics in a certain way, but with all of this, it’s really about the approach of just thinking what tools you have available to you, where you point an engineer next, and the key thing is how they all link together. Anything you want to use, do it, and just make sure that they’re always connected and pointing people in the right direction.

See more presentations with transcripts

From Dashboard Soup to Observability Lasagna: Building Better Layers

Transcript

Backstory

Unsouping Our Stack

The Observability Lasagna

Practical Tips – How Do We Actually Implement This?

Summary

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Seattle startup Desk raises cash to automate real estate contract work with ‘better-than-human’ AI

total commitment to agency AI

Amazon will pay you to subscribe to Audible

What Is ‘Ghost Tapping’? The New Tap-To-Pay Scam You Should Know About – BGR

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Backstory

Unsouping Our Stack

The Observability Lasagna

Practical Tips – How Do We Actually Implement This?

Summary

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News