From Grassroots To Enterprise: Vanguard's Journey In SRE Transformation

Transcript

Christina Yakomin: I’ll be telling Vanguard’s story of our journey in SRE transformation. My name is Christina Christina Yakomin. I am a senior manager at Vanguard. I currently lead the performance and resilience optimization team, which is a small team that partners with our application product and platform teams across the organization to validate the performance and the resilience of our most critical and complex user journeys. Up until this past November, I’d spent the first 10 years at Vanguard climbing up the technical ladder. My prior role was a senior architect, so I’m not too far removed from the IC experience either.

Much of what I’m going to share in this talk was accomplished through ICs, either work that I did myself or that my peers were driving. Throughout that career, I’ve had experience with full-stack application development as well as cloud infrastructure automation, though lately I have specialized in reliability engineering for about the past 6 or 7 years. I’ve been a speaker at lots of conferences in the past, you may have seen me at prior QCon events, the Enterprise Technology Leadership Summit, All Day DevOps, o11yfest, or others. A lot of times I’m talking about chaos engineering. I’m a bit of a chaos engineering enthusiast, or other aspects of reliability engineering and DevOps.

Outline

We’re going to focus on business. We’re going to talk about a brief Vanguard overview, in case you are not already familiar with Vanguard and what we do. Then we’ll dive right into the SRE program origins, how we built this program from the ground up, starting from almost no resources to today, what is a thriving job family across the entire organization. Including how we stood up a coaching team to help us drive that along the way. Then, we’ll talk about how we’re tackling some modern challenges for SRE through a technical lens. I’ll shed some light on some technical solutions that we have been driving more recently.

Vanguard Overview

Vanguard, it’s one of the largest global asset managers. The numbers fluctuate day to day as the market does. As of the end of March, we had about $10 trillion in assets under management, best known for our mutual funds and ETFs. You may have a brokerage account or a 401k with Vanguard or one of our competitors. We are a globally distributed organization, though our headquarters is in Malvern, Pennsylvania. That’s where I am based. We also have an IT presence in Charlotte, North Carolina, in London, in Melbourne, Australia, and soon we will have a presence in India as well.

SRE Program Origins

Let’s talk about where things were approximately 10 years ago when I was getting my start at Vanguard. These practices, they may not have been fully outdated at the time, but we can certainly look back now and say, yes, these things are probably not what we would consider the best practices in the modern era. We were doing quarterly performance testing windows ahead of monolithic releases. Just about every single thing that we deployed about 10 years ago was on monoliths. We did four releases a year. It was a really big deal when we got to five, and then eventually six releases per year. All development would stop for a good few weeks to make sure that we could have time for performance testing.

All of that performance testing was conducted by a central team where all they did was performance test. They conducted all the tests, and pretty much all they were doing was sending a bunch of load at the monolith and comparing results to what they looked like last quarter.

If it was the same, or just not worse, or really under a certain threshold that we’d set, you were good to go. There was your performance test. They didn’t really worry too much about the specific business context of what they were testing because they weren’t familiar with it. Neither was our operations team. We still had Dev and Ops quite siloed from each other. Once things made it into production, operations was a totally separate ballgame. They were the ones responsible for all of the hosting. We even had centralized production support and triage teams that did not have deep application or business context in what they were supporting, even though they were on the front line. They knew enough to be able to walk the site and tell you which page was having an issue, but definitely not enough to truly troubleshoot. There were a lot of handoffs, making things pretty inefficient for operations.

Enter DevOps, which obviously solved every single one of these problems with no complications whatsoever. We broke down our monoliths into much smaller microservices iteratively over the next several years, and migrated them to the cloud in the process. As we did this, we also federated the responsibility for things like testing and operations and production incident response to the product teams who are now just responsible for their individual slice of the pie, their microservice. As we did this, we were also able to enable continuous delivery.

Now instead of things releasing quarterly, or five or six times a year, things are releasing five or six times a week. A lot of our product teams deploy multiple times per day. It’s pretty normal at Vanguard these days. This drastically increased the pace of software delivery and it had a lot of really positive impacts. It increased our agility. It allowed us to deliver business value much more quickly. The entire SDLC accelerated. There are some things that started to get missed as well.

This is around the time that we started talking about SRE and what it means to do SRE. SRE as an enabler of this DevOps transformation. It’s a little bit counterintuitive, and one of the lessons we learned is maybe we didn’t start in exactly the right place, but we were hearing a lot about this chaos engineering thing. Heard about it from Netflix, some other companies that were leading the charge with chaos engineering. We were really interested in this. A colleague of mine who was an individual contributor at the time in an engineering role was really passionate about the value that we could get out of chaos engineering, something that we had previously never tried at Vanguard. He wrote a white paper and made the pitch, got senior leadership on board, and they told him, “Go prove it. Make something happen. Make it work”. It was him and an intern. Then eventually also me.

Then another developer, and then another developer, as we continued to prove out the value of chaos engineering as a practice. The way that we did that is we built a self-service tool. We evaluated some vendor tools. It was pretty early in the market at the time, so there were a few, but they weren’t super feature rich and didn’t really meet the needs for the environment that we had at the time. We built one ourself. It was pretty simple. Kept the blast radius of experiments really small. Then iteratively added new features, like the ability to schedule experiments, add assertions to experiments so that you could tell the system what your expectations were and automatically determine if your expectations were met.

Then added more failure modes as well, including some with slightly larger blast radiuses as the time went on. For more information about the chaos tool that we built internally, I’ve got lots of other talks out there that you can find, usually under the title ‘Cloudy with a Chance of Chaos,’ that dive super deep into the tool that we’ve built. We still use this tool today. We’ve even added integrations to third-party providers like AWS Fault Injection Service.

Following this, we looked at performance testing as well. We’d made it completely self-service to do your chaos engineering, but we’re finding that there wasn’t as much value in the practice if there wasn’t load coming into the system. We weren’t quite comfortable with running our chaos experiments in production yet. We were running them in a lower environment that was as production-like as possible. One of the things that was not like production at all was the traffic patterns. We needed to find a way for the people who were trying to run these chaos experiments to also generate some load reliably. Enter performance testing as a service. Very similar concept. A self-service tool. It’s pretty much an orchestrator for existing frameworks like JMeter and Locust. Today we primarily use the Locust framework for our testing.

Spins up infrastructure, including load generators, just in time for a test. Takes in a test configuration, including a Locust script, and sends a bunch of load at your service, and then spins things down. Over time, we’ve added more features there as well. The same scheduling as well as some automated reporting that aggregates data from multiple sources, and even overlays onto that report details of if you ran a chaos experiment during that test and during what times, so that you can see right there on a single report if there was any impact to the application behavior when the chaos experiment was running.

There’s one more piece that we are missing as well, and that’s strong observability. We probably did this backwards, to have chaos engineering, performance engineering, and then the really strong observability. If I could go back and do it over, I’d do it in reverse. When we talk about resiliency testing, it’s that combination of the performance testing with the chaos engineering, the goal of validating a system’s ability to handle failure scenarios under load. Talked about our PTaaS tool. Talked about our climate of chaos tool. The key with observability is you need that to really be able to understand what’s even happening in the system when you are sending that load and then injecting that failure.

For us, what that looked like was standardizing around OpenTelemetry, getting some data out of just unstructured logs. We were primarily sending just about everything as unstructured logs that were really unwieldy to query. Started sending metrics and trace data to the appropriate tools in the appropriate formats through that standardization on OpenTelemetry so that they’d be easier to query depending on what you needed to interpret. That OpenTelemetry standardization also gives us future flexibility. We’re going to be able to swap out the tools that we’re using as the industry landscape of what’s available changes, because the industry really seems to have standardized around OpenTelemetry as well.

One other thing that we changed in terms of observability is the way that we talk about availability. Back in the 10 years ago time when everything was monolithic, it was pretty much you either had an open incident or you didn’t. You were down or you were up. That was how we calculated things, all in terms of uptime and downtime. Now we talk more in terms of service level indicators and objectives and error budgets. This allows us to inject some nuance into things, and right there in our observability metrics better understand the urgency of a situation. You could have two incidents where, one, the error rate is 3%.

In that case, maybe it’s not even an error rate, maybe the failure rate is 3%, but the type of failure that you’re experiencing is things are a little bit slow. Those 3% of requests are taking 4 seconds to complete instead of the desired 2 seconds. Then maybe in the other incident, you have a 100% failure rate and the failure case is the client is getting a 500 error. Those are wildly different incidents. Previously, they would have both been lumped in under downtime. Now that nuance with service level indicators and objectives and error budgets allows us to more accurately assess the urgency of responding. You’re probably responding in both scenarios, but taking a much different approach with how you respond.

We’ve adopted the CASE method for effective alerting. This is an acronym. It stands for Context-heavy, Actionable, Symptom-based, and Evaluated. Really what it means is that when an engineer on-call receives an alert, they have something they need to do about it. We’re not paging someone in the middle of the night because some server somewhere rebooted and it’s fine now. Most of the time, you don’t need to be woken up for that. It’s actionable, and within the alert you have all the context you need to be able to figure out what to do next, whether that’s a link to a runbook, descriptive information about what is broken.

I used to get pages that were just like, A, server name, and no further details. It was like, “Nice. Is it just saying hi? Is it down? Did it reboot? I don’t know. Time to go check the dashboards”. We have reevaluated all of our alert portfolios to make sure that they’re complying with this CASE methodology. I did mention a little bit before that we have built some automated reporting into our PTaaS tool, the performance testing tool, so that any time a test completes, you’re getting an informative report that aggregates not just what you’d automatically get out of the box with Locust, but other key metrics as well, like the saturation metrics for the underlying components of the applications under test.

SRE Coaching

We rolled out all of these tools. We’ve now got really strong observability options. We have the ability to run performance tests and chaos tests completely independently without a centralized team, and everyone is responsible for their own testing, their own operations, everything. We’re not really seeing strong adoption of the tools. We’re not seeing major improvements to availability, and teams seem confused.

As an engineer on the teams building those tools, I spent the vast majority of my day answering questions about how to do this and why to do this, not building new features. Out of necessity, we actually spun up an SRE coaching team that beyond just the evaluation and rollout of tools, built a self-study curriculum full of aggregated internal documentation of all different styles, including activities to complete, videos to watch, documentation to read, documentation to bookmark for later. The goal of this was to allow the engineering population to have a one-stop shop for everything they needed to know to get smart on how to be better at doing operations for the products they supported.

Many of them are coming from a full-stack engineer background, but it was development heavy and DevOps was still a bit of a transition for them. The SRE coaching team then also got involved with strategic vision as well. What is the long-term goal here? Is it, have these application teams do all of the SRE practices that we’re talking about? Maybe it looks like having full-time SREs throughout our organization. That lived under SRE coaching as well.

This is an outline of what our curriculum looked like. I’m not going to go line by line through every single thing here, but just here to show you the way that we structured it and why. We had it build upon itself, so if you just wanted the absolute fundamentals, you would complete level 1. Level 1 was about the basics of how to talk about availability in terms of SLIs, SLOs, and error budgets, how to structure your alerts, and what tools we had available for telemetry, for your logs, your metrics, traces. What’s the difference? When do you want to use each one? This is fundamental for every engineer in an on-call rotation, and it was also beneficial to some of our non-engineers.

We ended up adapting level 1 to have a version that was palatable to product owners, non-technical crew members, managers, so that everyone could get involved in talking the talk about availability and error budgets, and making prioritization decisions based on them. That’s been very successful. Level 2 is more in the weeds technically. This gets into architecting reliable systems, highly available architecture patterns, resilient deployment strategies, also gets into capacity planning, and mentions performance testing there. Then, finally, in level 3, this is for the folks that are really passionate about getting the most out of failures or being proactive about preventing them. We talk about the practice of failure modes and effects analysis and chaos engineering. We talk about incident response. Of course, we talk about blameless post-incident reviews.

SRE Operating Model

This is the SRE operating model that came out of the strategic visioning that the team did. This is where we’ve landed today for how we have SREs deployed across the organization. The product teams are pretty much the same two-pizza product teams that we’ve had, with full-stack engineers on them. We very rarely have a fully dedicated embedded SRE within those product teams. The scenarios where we would have one is if there is some very specific knowledge that you need to be able to support that application. Something that we don’t have all over the rest of the organization. A really good example, one of the first places we put a dedicated SRE was on the team responsible for our native mobile application.

Then we have SRE leads, one level up. In our organization, they operate at a similar level to architects in the way that an architect might have multiple different product teams that they work with to consult on any major new components that we’re adding, any major architectural changes that we’re making, modernization efforts, things of that nature. The SRE leads are going to be responsible for helping these teams get their resiliency testing done, evaluate their alert portfolios. They can swing into incident response as needed, though they may not be embedded in the on-call rotation. Depends on the area of the organization. One layer above them is the SRE champions. These we have, think about it basically per division, per line of business for us.

The SRE champions are usually just a small team of individuals that are responsible for things that are cross-cutting. Though we have product teams with federated responsibility and accountability for their slice of the pie, sometimes, often, a user journey spans multiple product teams. It may even span multiple lines of business.

In the case of logging on to vanguard.com, vanguard.com is owned by one line of business, but all of the security tooling that powers the ability to log on is in our security division. Those two groups need to work together. Their SRE champions facilitate that cross-division coordination. They partner with me and my team. Where we had the SRE coaching team, we now really have a centralized resiliency program office. This encompasses the teams that own our observability tool suite, change management, incident management, my team in performance and resilience optimization, our resiliency testing tool teams. It’s grown quite large over the years. For the specific practice of testing, it’s often my team that gets pulled in for the most cross-cutting, most complex, and most critical testing efforts, especially where testing hasn’t been successfully accomplished before for a given reason.

Challenges in SRE and Scaling SRE

Some additional challenges that we face in SRE and scaling SRE, one of the big ones is demonstrating impact. In particular, at a company like Vanguard, where the product that we are providing to our clients is a financial product. It’s our mutual funds. It’s our investments. Yes, we are providing those most frequently through digital channels, but we’re not selling software. IT remains a cost center for us. We’re not able to directly tie outcomes achieved in IT to revenue generation without making a few jumps.

An SRE in operations, that’s even harder. I really have a hard time selling people on, ok, so we ran this test in a lower environment, and because we ran this test, we changed a configuration that if a certain scenario had happened in production, it would have caused a problem, and if that problem had happened, it would have been really bad, and we would have lost money, kind of. It would have affected our reputation, and that’s worth this dollar amount to us, so we saved money. It’s just too many hops. It doesn’t really work. We have to get creative about the ways that we demonstrate impact.

At this point, we stop at proving improved availability because the entire organization bought into the idea that our client experiences are only as good as they are available. It doesn’t matter how many amazing features we deliver to our clients, if they can’t reach them, it’s worthless. We focus on that and try to prove out a tie to better availability from the teams that are doing this testing. We also look for where are we gleaning findings from the testing that we’re conducting.

Another big challenge, though, especially when it’s hard to articulate your impact is budgeting. Not just budgeting for things like the cost of running tests, but the cost of infrastructure capacity to do things like run in multiple regions to keep yourself safe from the risk of a regional outage, but also the budgeting for crew members to have employees dedicated in SRE lead or champion roles. That brings me to staffing from another angle as well, not just budgeting for staff, but hiring and training them in the first place.

Back when we started hiring SREs, probably in the realm of 5, 6 years ago, this kind of talent was really hard to come by, and we really felt like we needed these SREs to come in and hit the ground running. It was hard enough just to find the skill set, then bring those people in, and also train them on Vanguard’s context, which is why we spent so much effort building that curriculum and training the skill set from within, so that we were at least working with people who already had some business or technical context about how to operate at Vanguard. This remains challenging today, but it’s probably of the three things on the slide, the one thing that’s gotten a little bit easier as SRE practices have proliferated the industry as well. There are more folks who have been in an SRE-type role before looking for work.

Technical Achievements – Tackling Modern Challenges for SREs

We’ve got an architecture diagram. For the rest of the talk, I’m going to be focusing on some technical achievements that we’ve had in much more recent times. Most of these are all from about the past year, things that we’ve done to tackle some modern challenges for our SREs.

First thing I want to talk about is region failure game days. I’ve alluded a little bit to the importance for us of having multiple regions to protect us against a regional outage. The way that we validate that we are able to withstand that regional outage is with a tool called AWS DNS Firewall. This is specific to the applications we have hosted in our AWS environment, which is not everything, but it is a lot of things. AWS DNS Firewall is a component of Route53 Resolver, and it’s there to regulate egress traffic from a VPC. I don’t want to go so far as to say we’re not using it as it is intended, but we’re definitely not using it as it is advertised. It’s really more advertised as a security feature intended to prevent DNS exfiltration attacks. The way that we’re using it is to block access temporarily to region-specific URLs to simulate a regional outage.

By doing that blocking, anything that’s running, for example, in the us-west-2 VPCs won’t be able to reach back out to anything in the us-east-1 VPCs. Part of what makes this possible for us is we have quite a bit of standardization around our URL patterns in our cloud environment. We can take advantage of some strategic wildcarding and have things like *us-east-1 versus *us-west-2 in the configuration for this experiment, and block any traffic coming from us-west-2 back over to us-east-1.

The reason why this is important for us is that it allows us to then send some traffic to us-west-2 when it is unable to access us-east-1, and make sure that there weren’t any unanticipated dependencies back on the other region when this region is isolated. We want to make sure that every single one of the dependencies of the critical user journeys are deployed in the isolated region so that the entire user journey end-to-end can be completed. We use our PTaaS tool during these tests to generate the load. We don’t just use PTaaS for the high load tests. It’s able to store scripts and provide that automated reporting, so it ends up being really helpful for low load generation as well.

Another thing that I want to talk to you about is request rate scaling, which is an approach that we have developed to autoscaling that came out of necessity for us. Basically, the traffic patterns that we were observing at Vanguard were pretty consistent, but really sharp increases at the time that the market opens for trading during the start of the business day, so right around 9:30 a.m. Our CPU and memory-based scaling policies just weren’t working quickly enough to account for that traffic because of how long it took to provision new tasks or instances. By 9:27, if you’re not already scaling up and scaling up significantly, you’re going to be out by 9:33. Obviously, that wasn’t going to be acceptable.

Instead, what a lot of teams were doing is just staying overprovisioned around the clock because they didn’t trust autoscaling at all. Some teams were leveraging some form of scheduled scaling to maybe spin a whole bunch up at 9:00, and then spin back down off hours, but it was still more capacity than we needed most of the day. That spike in traffic is really consistent right at 9:30, and then drops off for a bit in the middle of the day, peaks up again toward the end of the day, and then drops off again off U.S. business hours.

One of the things we looked at was how tightly coupled are the incoming requests per second per ECS task, for example, to then the CPU utilization of that ECS task, which was being used as a rough indicator of expected performance of that task. In the graph you see on the screen, there was an application that had extremely tight coupling. As request rate increased, so did CPU utilization. We were then able to set a pretty conservative desired threshold for CPU utilization, something like 30% even, to keep it really low and be confident that the system would remain healthy. That would give us a number for the expected requests per task per second that that application could handle. What we then are able to do is look at the incoming rate of requests and the rate of change in that request rate.

Based on that, we can predict how many tasks do we think we’re going to need in the next 5 minutes, 10 minutes, 15 minutes, and start scaling them up, dynamically provisioning a new set of tasks based on our expectations of what we need. Since those expectations were based on a conservative estimate, we were usually just slightly overprovisioning and then correcting back the other way. This enabled us to leverage scaling in a way that we really weren’t comfortable doing before.

For the first time, autoscaling was as reactive as we needed it to be to meet our peak load demands and significantly reduce our overall AWS bill. Even for those teams that were using scheduled scaling before, they were able to more accurately react to that individual day’s traffic needs and not just be prepared for a peak that may not have been coming that day.

There is some variation, day-to-day, and a lot of that depends on how volatile the market is on any given day. Some of our highest peaks in the past year have been things like a day that the Federal Reserve meets and makes an announcement, or the day after the U.S. presidential election. While it’s not a surprise, we don’t always know exactly how the market is going to react, just that it will.

Then we certainly don’t know how our clients are going to react and what that means for what types of peaks we can expect. With everything that I explained on the last slide, with that request rate and change in request rate, we also started using something called Intelligent Scheduler that we built to predict the anticipated volume for the day on a curve based on that same business day in prior weeks. Today we would have looked at, what did things look like last Tuesday? What did things look like the prior Tuesday, and the Tuesday before that, and get a pretty good estimate of what the typical traffic patterns, typical capacity needs are going to be, and proactively provision tasks based on that. What we do here that you can see in the graph, is we can also then early on determine if there are deviations between our expectations for the day and what’s really happening. Before we even get to the big peak that you see in the orange depiction of real traffic for this particular day, blue is the model.

You can see there was already a gap really early on, more like 6 a.m., 7 a.m. It’s small because, overall, traffic is pretty small at that time, but it’s already more than we thought it was going to be. It’s more by a factor of 2x. When we notice the model is underpredicting, we calculate by how much is it underpredicting the volume, and then multiply the model by that factor. The gray line that you see there is the adjustment that the model made based on the fact that traffic seemed much higher than usual and much higher than expected on that day. It ended up much closer to the real traffic line.

Really where you see deviations, it’s mostly over-anticipated load. This was another enhancement to that request rate scaling that allowed things to be much more trustworthy when it comes to autoscaling, and brought us more comfort and confidence with not overprovisioning all the time. If we had to be overprovisioned to handle a day like this all the time, even if we were using scheduled scaling, we don’t need this every single day at 9:30. Now the fact that we can trust our autoscaling to react accordingly in real-time, it’s a really good thing for us.

I want to talk a little bit more about resiliency testing, which is more the wheelhouse of my team. I mentioned, back 10 years ago, we were doing this about quarterly with a centralized team. Now that we deliver software every single day, multiple times a day, what does that mean for how often we should be running things like performance tests? Should we be doing them every single day multiple times a day? Sounds unwieldy to me, but how can we get closer to that reality so that every time we’re deploying new software, we have some amount of confidence in its ability to retain the same performance that we’ve observed previously or to improve it.

Some of the things that make this really challenging are, first and foremost, the time it takes to run the test. You aren’t really getting much value out of a performance test that you’re running for less than 30 minutes. Many of our performance tests run for more like 2 hours. Let’s say I was able to cut it down to a really efficient and effective performance test in about 30 minutes. If I went and told all the developers at Vanguard that I was going to integrate this cool new feature to performance test into their CI builds, but now they were all going to take 30 minutes longer every single time they ran a CI build for their main branch, I would lose my job. That’s just not going to happen. They would protest and run me out of the building. We need to find a better way to do that. There’s also the stability of the test environment to contend with. We need to think about the fact that these integration environments, these lower environments where we’re running the tests, unless we implement all these unwieldy change management processes around them, iteration is happening.

Developers are building things, and they’re not purposely breaking things, but sometimes testing is how you learn. That’s why you learn it in the lower environment before it goes to production. As much as we’d like it to be a production-like environment, sometimes the functionality is broken. Something downstream affects your ability to run a test on a given day. There’s also the cost of maintaining prod-scaled resources. Quite a lot of our critical workflows, we have multiple copies running in production across multiple regions. To now maintain another production-scale copy in a lower environment is getting really costly, really quickly.

Some of the solutions that we’re exploring to address those challenges right now, one is post-deploy hooks. Rather than integrating in our pipeline as part of a build, thinking about where we can kick off a performance test asynchronously after something’s been deployed to the lower environment, but before it gets deployed to production, so that the builds can still happen as frequently as we need them to and as quickly as we need them to.

Then maybe the test kicks off after all of that is done while the developer is moving on to a different feature or already tinkering with the next iteration. Service virtualization is something that we’ve already got in place today and we’re working on making it a little bit easier for developers to integrate with. What this allows, it’s very similar to how you would mock in a unit test, but you’re segmenting and isolating your components under test by mocking in the live lower environment the dependencies that you don’t want to call out to.

If you have one of those downstream dependencies that’s acting up in the test or perf environment on a given day, you can just mock the response from it, get a guaranteed positive response that looks like what you expect it to, and move right along. This is helpful for things like test data management, for stability of a specific downstream service, or just for completely isolating an individual service so that we can do a component level performance test in a way that we haven’t really been able to do before.

Then, finally, we’re also building out automated environment scaling to limit the incurred cost associated with maintaining this lower environment. Right before a test is run, we will automatically scale up all of the components that we need to their production scale based on the infrastructure as code configurations that we can see for how they look in prod, scale them up just in time, run the test, and then scale them back down.

The Rise of AI

Now we’ve got a slide where the title is the rise of AI. I will try not to be too woo-woo about AI and take a rather nuanced stance on it. I want to talk about AI in two different contexts. The first is, how do we resilience test the tools that are backed by AI? As we are building products for the clients of Vanguard that will now have dependencies on models, how do we make sure that those features are resilient the way that we’ve been testing everything else we’ve ever built, when those models can really seem like a black box? In many ways they are. There are new failure modes that we need to think about, like not just model latency which can be a bit unpredictable, but also outages of knowledge bases that you may have varying levels of observability into the availability of.

Then thinking of load as not just the number of concurrent users but also the number of tokens that each user is submitting, like the length of their request. I am guilty of being particularly wordy when I talk to anything with generative AI, so I know I am part of the problem that I have to solve here with thinking about concurrent users and how much they want to yap with your AI.

Some examples of places where I have already started working with resiliency testing these technologies. The biggest one is in our contact center. We are working on enhancing the contact center experience, and take that with a grain of salt, by making some of the easier tasks automated to reduce the wait times for the more challenging tasks that need a person to respond to them. When all the folks that have never used an iPhone and barely know how to log into their desktop, call every day to ask, what’s my balance? We can put them through a chatbot. To the folks who are like, I need to do a very complicated change of ownership, go right to a person who’s not been bogged down with password resets and balance requests all day long. Simulating voice calls and chats for these types of tests is really difficult.

To simulate the voice calls, the way that we’re performance testing this today is there’s actually saved wave files of recorded audio of engineers that have been creating essentially a new form of test data for these tests that have gaps waiting for the response from the model, and then we run those all through simultaneously. It’s really difficult. The tool landscape out there today for doing this kind of thing especially with voice calls and chats and simulating human behavior as the input, really not as advanced as I would like it to be. We have a solution that works today. I’d love for a vendor to solve that problem for us eventually.

Further, this is not specific to AI, this is just in general but an important thing to think about, different communication channels have different performance needs just based on the psychology of the human expectation. If I am on the phone, whether I am talking to a human being or to a bot, if there is 10 seconds of dead air, I haven’t been put on hold, I’m going to start wondering if I’ve been disconnected, if someone hung up on me. If I’m sitting there for 10 seconds waiting for a model to respond to me, I am not satisfied.

If I’m in my browser, I have a chat window pulled up, as long as you pop up the little message, they’re typing indicator with the three dots, I’m probably going to sit for 10 seconds and be like, yes, they’re working on it. They’re getting back to me and it’s going to be ok. Because of those client expectations differing between channels, we need to think about, are there architectural decisions that we can make to prioritize the more latency-sensitive requests, especially if they have the same underlying models responding to both channels.

Then, likewise when you’re constructing the performance tests, you’re probably going to want to have your test spanning multiple channels, but the definition of success for each might look different. Where you may be only tolerating 2 to 3 seconds of latency on the phone, you’re going to be able to tolerate much more latency for the chats.

Then the other angle I wanted to take with AI is, what can we do to use AI to make the engineers’ lives easier who are trying to run these tests? We need to make these tests more efficient especially because as AI is helping us crank out code more quickly, if we don’t speed up the pace of validation of the production readiness of that code, we’re going to become the biggest bottleneck to the delivery of value.

Some of the things we’re doing, we’re building a chat interface to help engineers understand how to do the testing. A lot of the work that I would ordinarily do or that our coaching team would ordinarily do to navigate our documentation more effectively than our search systems can, if that’s their preference. Documentation is not getting replaced by a bot, it’s just another option that we’re providing for the people that are getting more comfortable and having more of a preference for interacting with tools backed by generative AI. We’re also looking at using AI to analyze the test results after a performance test or a chaos test to make recommendations based on the prior performance tests you run for this app, or prior performance tests run against similar apps. Maybe they’re on the same platform or have a similar configuration to yours. Make recommendations for how you can better tune your infrastructure configurations, and also to improve your test quality.

One of the biggest barriers to entry that we’ve had for getting teams to run performance tests independently has actually been producing the Locust script in the first place. Either they’re not super familiar with Python or they just don’t want to put in the effort. We have a HAR converter which is recording what you’ve done and then turning that into a script. We’ve actually found that a more effective way to generate that Locust script reliably is with generative AI where the input is the OpenAPI spec for the API that they’re trying to test. This is something we’ve tinkered with more recently, and the pilot that we ran was really successful. We’re working on baking this into our performance testing application right now.

Then, finally, I’m exploring the development of a chatbot to facilitate the failure modes and effects analysis exercise. This is something I do a lot of at Vanguard, is go and meet with teams, look at their architecture diagrams and ask them questions about common failure modes that I’m familiar with for those given components, and just being inquisitive. There are so many common failure modes of given platforms and component types that we can probably train an AI to do a good bit of what I’m doing for these teams so that I will just have to supplement it and I can scale myself and my team quite a bit.

Key Takeaways

First, start small. A grassroots effort can grow into a really major initiative. Don’t forget that everything that I mentioned this entire presentation started with one guy who was really passionate and wrote a white paper and got senior leadership on board, grew to a small team, and now it is proliferated throughout the entire enterprise. Learn from our journey. If you haven’t taken any of these steps yet, start with observability, then performance testing, then chaos engineering, rather than putting the cart before the horse a little bit.

Consider scaling SRE expertise across your organization with a hub-and-spoke model like we did, to make it sustainable and federate that knowledge effectively. Efficiency and automation are going to be key to success as the SDLC continues to accelerate and code is being delivered faster than ever before. Whether you’re a fan of it or not, think about leveraging AI to make your apps more resilient. Also, make sure you’re considering the new failure modes that AI is introducing.

Questions and Answers

Participant 1: Currently in your monitoring, are you tying releases to the incidents, and signals?

Christina Yakomin: Yes. We are tracking all changes to production in the same system where we are tracking incidents, making it easier to correlate the two. Simultaneously, separate from all of the resilience testing applications of AI that I mentioned, we’re building an AI analysis engine right now that’s going to really have access to that system and do more automated determinations of, how can we tie changes to incidents, try to better identify what the incidents are related to. Was it a change a vendor made, a change that we made? Was it load related? All these different things, and do some analysis on where we need to focus our efforts to become more resilient at the macro level.

Participant 2: You mentioned that you set up a model that essentially correlated like changes in CPU usage to potentially the number of requests per minute you’d be anticipating in the future. How easy was it or how difficult was it to essentially figure out nonlinear rates of change?

Christina Yakomin: There is a lot of really fancy footwork being done there with CloudWatch events and CloudWatch alarms primarily. I believe there is right now still a dependency on Lambda as well for some of the calculations. If, for example, we had a Lambda outage going on at this time and we were super dependent, all this stuff, we may need to dive in and do some manual scaling. It’s tricky math, admittedly, but didn’t take too long to get off the ground. From ideation to getting a use case into production I think was about 6 weeks.

Participant 3: I think you had a slide about the CASE method. Do you have metrics about alarm quality, and how do you improve that?

Christina Yakomin: Not at the macro level. The CASE methodology is awesome. That’s not original to us, borrowed from another great mind in the industry. What we do in most cases is we encourage the practice of alert portfolio review as part of on-call handoff. Every single time you hand off the on-call, you go through with at the very least the next person on-call, if not the entire team, to look at the alerts that you received during your rotation and make sure that every single one fit the bill: that it was actionable, that you had enough information. If it didn’t, then either you took care of that yourself during your on-call or it immediately goes into the backlog to get handled during the next on-call rotation so that you are keeping that alert portfolio fresh reactively.

Participant 4: As part of the performance testing, are you executing functional regression testing as well? Are you measuring the performance of the functional requirements as part of your suite?

Christina Yakomin: For the most part, any validation of functional requirements is happening as part of your unit integration tests during the CI build, and the performance tests are more focused on your non-functional requirements. Because this is completely democratized self-service, yes, there are certainly areas where the tool is being used just as a means to generate load for an end-to-end validation of a functional requirement like that.

See more presentations with transcripts

From Grassroots to Enterprise: Vanguard’s Journey in SRE Transformation

Transcript

Outline

Vanguard Overview

SRE Program Origins

SRE Coaching

SRE Operating Model

Challenges in SRE and Scaling SRE

Technical Achievements – Tackling Modern Challenges for SREs

The Rise of AI

Key Takeaways

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Details of Huawei, Xiaomi, Geely new EVs leaked on China gov site · TechNode

Cruz presses Wikipedia on bias amid growing conservative criticism

Apple prepares for Tim Cook’s retirement

Tech companies poured money into carbon removal projects now in Trump’s crosshairs

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Outline

Vanguard Overview

SRE Program Origins

SRE Coaching

SRE Operating Model

Challenges in SRE and Scaling SRE

Technical Achievements – Tackling Modern Challenges for SREs

The Rise of AI

Key Takeaways

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News