Transcript
Bethea: My name is Cooper Bethea. I’ve been an infrastructure engineer, an SRE, a production engineer. I worked at Slack and Google. I worked at a couple smaller places too. I worked at Foursquare, and Sift, formerly Sift Science. I’m a software engineer, but I’m bad at math. I like people. I’m better at people than I am at math. I ended up in this place where I’m doing a lot of work on site reliability and uptime, mostly because I was willing to hold a pager.
It turns out, in the end, this all adds up to me enjoying leading large infrastructure projects where the users don’t really notice anything, except at the end, everything works better. This talk is about one of these projects. I was an engineer at Slack, and I led a project to convert all of our user-facing production services from a monolithic to a cellular topology. How many of you use Slack at work? Clearly, this project needs some justification, and that’s what my VP thought too. I’m going to start by describing how we came to start this project.
Then I’ll give you an overview of how our production environment looked before, and then how it looked after, and these will look similar. You will wonder why this was so hard that I got to do a talk, because we didn’t even invent an algorithm or anything. This project was hard. I think projects like these are hard. We tried to do this at Slack once before, and it never really got off the ground. Part of what I’m going to talk about is how and why our rendition of this project actually succeeded.
One thing I would ask you to consider as you’re listening to this talk is these large migration projects often fail or run aground, even when you thought they would be simple, just changing one system to another system. We’re not even writing new software. A lot of the time, it’s just these reconfigurations of software. Why is this hard? This is some good old AI-generated slop that I made about a ship voyage, because a lot of these pictures are copyrighted. Similarly, we can ask, why did it used to be hard to sail a ship across the ocean? You know where you’re starting from, you know where you want to end up, but it was still hard. I think that big projects like these are similar, like these exploratory voyages that we used to do, where you know where you’re going, but you don’t know exactly how you’ll get there. I think that’s very confusing. It can be hard for organizations. Like all the best projects, this project was born from fear, anger, lost revenue, and a bad graph.
One day, we’re just slacking along, doing our projects, running the site or whatever, and this happens. This is a graph of TCP transmits split by availability zone in Amazon U.S.-east-1. For us, TCP transmits have always been a pretty clear sign that something is going wrong. Packets are getting dropped somewhere, whether it’s on one of the endpoint nodes, there’s a network and hardware in between, something is not good, and we need to take a look at that. If you look at the monitoring and do a little mental math, you can see that we have more of these dropped packets in one AZ than the others. You can see that the one sums to the same as the second two. We got the idea that this was about packets being lost traveling from that AZ into the other two AZs. We were like, that’s bad.
We basically called AWS, we did the thing where you filed your support ticket, and they were like, we were looking at it, and then we’re all sitting on Zoom. Then they found the network link and they fixed it, and all our errors went away. Then a few hours later, there was an automated system that put the link back in service, and it went down again. We called AWS again, and they processed the ticket, and they fixed it again. We were tired. We were just like, why do we have to go through this? Why do our users have to endure this? We built this distributed system, and it really ought to be able to deal with a single AZ being entirely unavailable. Why do we service errors? Why didn’t we just steer around the bad availability zone? It’s in the name, availability zone. You’re supposed to be able to build availability out of these things, multiples of them. We do a lot of trying to detect errors in distributed systems like, why didn’t that save us here either?
Slack’s Architecture
First, let’s take a look at the architecture of Slack. It’s a little different from what you’ll see at other sites, but not that weird. What we do is we have these edge regions where we terminate SSL close to users. We’ll get you a WebSocket out there. We’ll serve files, stuff like that that makes sense to push close to the users. Then the most important work on the site all actually happens inside U.S.-east-1. That’s where we retrieve channel histories.
If you post a message, that’s where the fanout starts. All that happens in one region, U.S.-east-1. What we do is we forward the traffic into this internal load balancing layer that you can see here that fronts each availability zone, and then these internal load balancers direct the traffic into a webapp, which is what you’re thinking, it’s based in Hack. It processes HTTP requests for users with the help of backend services and datastores. You would look at this and be like, we could just stop sending traffic to that AZ, and this would just work. We should just have done that when we had the outage. That didn’t work because this slide is a lie, because this is actually what everything looks like behind the webapp servers. You can see what’s going on here. This is spiritually descended from these three-tier architectures that we know. You’ve got a reverse proxy.
Then you’ve got an app server that terminates HTTP. Then you’ve got some databases behind it that it gets data from to answer these questions. We’ve just got this extra layer of reverse proxies because we’re crossing a regional boundary. We’ve got a whole bunch of different services on top of the database, including Memcache. There’s a Pub/Sub system that does fanout. There’s a search tree, stuff that is useful to the webapp in answering these queries. You’ll notice that most of these services are actually being accessed cross-AZ, because that was how we built them out, basically. We were counting on failover happening. We weren’t paying attention when we set up this part of the stack.
It turns out we got even more problems than that. Our main datastore is Vitess, which is a coordination layer, a sharding layer on top of MySQL in the end, and it’s strongly consistent. It needs a single write point for any given piece of data. You have to manage failover within the system. You can’t just stop sending Vitess traffic in one zone. If that zone contains the primary for a shard, you actually need to do an operation in Vitess to failover primariness, mastership of that shard to a different availability zone. We’ve got our work cut out for us at this point. We couldn’t pop the AZ out of frontend load balancing, but maybe we could do this automated thing I was talking about. Like just have the computers figure it out themselves and be like, that’s not really good anymore. We can just have the app service manager in our backends and fail away from the impacted AZ. This is a simplified waterfall. There’s a chunk of a waterfall graph between the webapp and its backends.
Once an API request from a user gets in there, there’s actually a lot of fanout. It’s not just these five RPCs, it’s maybe 100 RPCs. I’m like, wave our hands, like if you need to do 100 RPCs to your backends and you’re living with a 1% error rate, you’re probably not going to ever assemble a whole HTTP request, without a bunch of retries, things getting slow and messy. Then, conversely, once the app server is trying to handle things, the only reasonable response to like, I’m serving a lot of errors, I’m missing a lot of my backends, is for it to just lame duck itself to the load balancer and get taken out. This is viscerally unnerving to me, because we are forced to face this idea that if these webapp servers all have the same backend dependencies and they’re starting to lame duck themselves because one of these backends that they’re fanning in on is failing, we’d have a big possibility for cascading failure across the site.
Another consideration is like, in Memcache, you mostly shard your data via clients hashing over backend, use consistent hashing. Clients in the affected AZ had a different view of their backends in the Memcache ring, so we would get cache items duplicated in the Memcache ring. There’s some little consistency issues in there, and they can be found missing when they actually exist. That’s a recipe for database pressure, which again, fanning in, overloading the site in a bad way.
We thought about all this and we were like, it’s actually hard for the computers to do the right thing here. We want, perfectly, all the webapps in that one AZ to lame duck themselves and none of the webapps anywhere else to lame duck themselves. Then we got to do some stuff with Vitess, we’ll talk about later. It just felt bad. We were humans and we could see the graphs. We were sitting on the Zoom call being like, maybe we should drain this availability zone. If we could have smashed the button, we would have smashed the button, and we would have done it. We were also worried that if we remove traffic entirely from one AZ, it would maybe overload the other two AZs and send the site in cascading failure. We’re limping along at this point.
Originally, we’re serving like 1% errors. We are not in our SLO. It’s not as bad as it could be. We were afraid that we could send it into cascading failure and make it worse. We had some ideas, but we didn’t really trust it. It’s really scary to do things for the first time in an incident. What actually happened is we just sat on Zoom and talked about it for a while, and then AWS fixed it while we were talking about it. Then we felt bad. I felt bad enough that I wrote a post in Slack because that’s what we did there. It was called, we should be able to drain an AZ. Everybody was like, yes, we should. Then we’d made a plan.
Cellular Design
I’m going to start by talking about cells. In our implementation, a cell is an AZ. A cell is a set of services in an AZ. That set may be drained or removed from service as a unit. This is what we ended up looking like. Here’s the goals that we needed to satisfy. We took this backwards. We arrived at the cellular design by considering this actual use case that we had, which is, we wanted to be able to drain an AZ as much as possible.
Our goals are, remove as much traffic as possible from an AZ within five minutes. This is just table stakes when you’re in like four or five nines and higher territory, you need tools that work fast because you only get to have a little bit of downtime. Drains must not result in user visible errors. One of the things about being able to drain a data center in AZ is it’s a generic mitigation. As long as a failure is contained within a single AZ, you can just be like, something is weird in this AZ, I’m going to drain it. You can do that without understanding the root cause. It offers you a chance to make things better for your users while you’re still working on understanding what went wrong. If we add errors by draining, we’re probably not going to want to do that because it will make our SLOs worse. Drains and undrains must be incremental.
By the same token, what we want is we want to be able to pull an AZ, fix whatever we think broke inside that AZ, and then leak a little bit of traffic back in there, just like 1% or something, and be like, are we still getting errors? Did we fix the thing? It’s much better than just dumping all the traffic back in there. We wanted to be very incremental to a 1%-ish layer. Finally, the drain mechanism must not rely on the impacted AZ. Sometimes, especially when we have these big network outages, you’ll see some of the first things that we did when we were trying to make this work is we would SSH around the servers and make them lame duck themselves. We just did not want to be in a place where we were trying to do that on the end of a bad network link in a real big incident.
How Things Were Done, and Choices Made
Now I’m going to talk about the ways that we did it and back into how we choose to do those things. The first one, this is pretty straightforward and it’s the thing that you’re probably already thinking about, siloing. We have servers in one AZ and they just talk to upstream servers in that same AZ. It’s pretty simple, and for services that it works for, it works great. This is mostly a matter of configuration for services that are ok with this. We did some help in the service discovery layer and our internal service mesh to support this, but you don’t really need to do all that stuff.
Other services on the other end, difficulty, we have Vitess, where, as I mentioned, we have to internally manage the draining. Anything that’s strongly consistent in your architecture is probably going to need to have writes processed through a single primary point. You’ll have to do some failover. In our case, we’re actually able to do this faster than it seems in Vitess because each of these shards is independent. As long as there’s not too much pressure on the database overall, you can start flipping the shards in parallel and actually go quite quickly. We ended up building features into some autohealing machinery that we have for Vitess that would let us do this really quickly.
Then we started thinking like, what do we choose for each service and why? I’ve mentioned some things about state and strong consistency. It would be nice to have something a little more principled to think about here. Like I said, we like the siloing the best we can do it. Service centers don’t have to manage drains for themselves. They get to have three or four isolated instances of their service, so the blast radius of failures is better. In practice, they can’t all do that. Services that can do this and can’t do that, sort of break down by this idea of stateful or stateless. Like, is the service a system of record for some piece of data? If so, it will probably be harder to just silo.
Further than that, we can look at the CAP theorem a little bit to get an idea. We know the CAP theorem tells us that a system can be available during a network partition, or it can be consistent during a network partition, but not both. Actually, in practice, this ends up being a little more of a continuum. There are different pieces of storage software that satisfy this differently with different properties. When you look here, we’ve got the webapp, which is our canonical stateless thing. It’s just processing based on stuff it gets from datastores. Then we’ve got Vitess on the other end, and that’s strongly consistent.
In the middle, we actually have stuff like our Pub/Sub layer or the Memcache. It turned out that we were using Memcache as a strongly consistent datastore, which is the default thing that you do because you’re just always writing to one Memcache node and then reading out of that Memcache node again. This didn’t really work out great for us, but we found that there were many places in our application where we had trusted that we would have strongly consistent semantics. One of the things that we ended up doing for Memcache, and this is me trying to give you a little bit of flavor around the decision-making process, is we set up a series of per-cell Memcaches. Then we maintained a global Memcache layer, and we slowly, call site by call site, migrated from the old system to the new system.
Let’s look at one other thing here. We’ve got this other idea of services that need to be global, but they’re built on these strongly consistent datastores. For example, we used Consul for service discovery. Consul, underneath it all has a consensus database that is strongly consistent. A Consul cluster tends to fail as a unit and totally. What we used to do is we had many of these Consul clusters, they were striped across all the availability zones.
Eventually, we were able to bring down the number of Consul clusters in each availability zone into a high and low priority. Then we use our xDS Control Plane. xDS is the Envoy API for dynamic configuration. We were also able to use that for service discovery information, since that’s just a subset of what it’s already doing. We have this eventually consistent read layer where you see the Global xDS Control Plane over at the top, and it sits on top of these pre-AZ Consul clusters and assembles the information and presents it to them. By maintaining this as an eventually consistent service, if you have to run global services, I really recommend that they be eventually consistent as much as possible. It’s just the data model makes sharding and scaling much simpler. It’s a pretty safe choice for you. You can also see that we’ve brought down the number of Consul clusters. I’ll talk about that later. There’s a way in doing this project that we were able to do some things that we’ve been putting off, some housekeeping.
Finally, there’s the drain mechanism itself. We have traffic control via the Envoys and the xDS to get to the drains. As mentioned, the first implementation, we would use service to health check down. That’s why you see a duck emoji. We decided that that wasn’t incremental or nice enough for the data. There’s a very nice feature in the Envoy configuration called traffic weighting. We can just say, by default, everybody gets 33%. Then when we drain, we will set the cell to 0%. It just works.
Now I’ll back out. Here’s our original diagram. This is where we started. Again, here’s where we ended up. I’m going to flip back and forth a couple times. You can see we have less lines now. Things are nice and orderly. I’m glad that the slide is laid out this way. We’ve controlled the cross-AZ traffic in some pretty nice ways. We just have a few services that are talking cross-AZ now, and we’re being choosy about them. We can treat these services with more care, because we’re letting them break some cellular boundary, and there’s some potential for bad data poisoning, like for them to put load on other cells.
Fortunately, these are already our databases and our systems that are the system of record for data. We’re already being careful about them. Everybody who runs databases says it’s hard. I’ve always avoided it. This also gives us a model for bringing up a new availability zone if we want to, because we can just build one of these things out, build the data storage systems, start the replication underneath, and then build the other layers up from the bottom, and eventually enable them in load balancing. If we want to move around, do more or less AZs, stuff like that in the future, it becomes more like doing one at the end.
The Success Drivers
I did say we tried and failed to do this before. This looks pretty simple, and I’ve explained it to you in a nice way. Why did we succeed this time? This is the part of the talk where things get messy, and we’re going to talk about people’s feelings and why doing stuff is hard. The first time we talked about doing this, we were going to make these empty isolated cells and build all the infrastructure into them, and at some point, go lights on and cut over. Among other things, there were going to be solid network boundaries between these cells. There’s going to be a bunch of data repartitioning. It gave us a lot of nice properties. In the end, it was too hard to do. The reason for that is because we couldn’t stop working on features. Customers do not care that you’re doing this. None of your customers care that you are rebuilding your production environment, I promise.
At the same time, we’re talking about building this very beautiful system here. This is a lot different from the old system. The whole time we’re like building these, we’re maintaining two very separate environments and production topologies. This is a big drag on productivity because you have to make everything work twice and then keep it working. The environment that you’re not using is going to decay over time and stuff. Not everything is going to be covered perfectly by integration tests. Weird things will fail. God help your monitoring. There’s a lot of divergence that is happening when you’re working in this mode. The mental overhead is really high. We were like, we couldn’t even do this in a quarter. You can’t take a quarter off work for your whole infrastructure organization to do stuff. There’s no way we flip it out really quick. The resource cost is also an issue.
If we’re paying double for the site the whole time, we’re going to get broke. In the end, we abandoned this design, not because the end state was flawed, but because we couldn’t really find good answers to the concerns about how we actually develop and roll it out. The core property this solution is lacking is incrementality. We can’t really move smoothly back and forth from the old regime of things to the new regime. We can’t keep working on other stuff at the same time without doubling effort.
Conversely, we don’t realize any value from this project until we cut everything over to the new system. There’s a very real risk that it’ll get canceled along the way. We’ll start trying to cut over to the new systems and realize that we are worse off than before. Then we’ve just made a big mess and everybody is mad at us. None of these problems are really, again, entwined with the beginning or end states. In these large projects, it’s the messy middle where all the complexity and risk is lying. You have to respect the migration process. It’s the most important and the most fraught part of any big project like this.
At this point, we went to the literature, actually. This is called the slime mold deck, a lot of people do. This guy, Alex Komoroske at Google wrote it. You can see the URL here, https://komoroske.com/slime-mold/, but also if you just Google, slime mold deck, it’s the first thing. Then a bunch of Reddit stuff for slime and decks. It’s actually about how these coordination headwinds can slow progress on projects in large organizations. As Alex states the problem, is that coordination costs go up as the organizations grow larger. That’s what actually this is all about, coordination costs. Our original plan required us to tightly coordinate across every service at the company to build these empty cells all the way up before you turn them on. We can’t really set things up so that engineering teams can take bites out of the problem for themselves quarter over quarter.
Every team is going to support two separate environments until everybody is done. What do we do instead? We embrace this approach where we’re doing bottom-up development. When we went service by service and really figured out what made sense for that service, you remember there was that rubric with like AP and CP, I told you about. The people working on the core project team, we just sat down with each service and worked on developing a one-pager, like, maybe you can’t just be siloed or maybe you can’t just fail over now, but can we get to that place? What do we need to do to do that? Then also, one of our tactics is that we really embraced good enough. One service not being compliant with this does not risk the whole project. We can’t like not turn the thing on because one service got delayed a quarter. People have outages. Priorities change. We need to not coordinate so tightly. We need to make it so that things converge.
The concept laid out in the slime mold deck is this idea that instead of doing a moonshot, a straight shot for a very ambitious project, you should do a series of roofshots where you’re not necessarily going the most direct route, but you’re locking in value at each step of the way. Each service in my example here, that becomes compliant, locks in value for the project and reduces our execution risk. We don’t go straight to the goal, we use it site by. Another way to look at this is we’re willing to sacrifice some efficiency in return for a reduced risk of failure overall. We operated at Slack in this way where the infrastructure teams were in the same organization, but were largely independent. I think this is pretty common at large organizations now. I believe that most services operate at a local maximum of efficiency.
Some people show up and are like, “This service is terrible. That must be awful. You must’ve designed it in a dumb way”. I don’t think that’s true. All the people that you’re working with are actually smart just like you are. They have just been operating under different constraints. Services evolve over time. Every service owner in particular has a laundry list of technical debt that they want to clean up and they never have time to. That is because tech debt is the price of progress. It is not some engineering sin. Some of these things that service owners want to do and never get the time to do will make cellularizing things easier. In our big project, we can give these teams air cover to reduce some complexity before they add it back.
For example, in the service discovery system, returning to this example, we used to operate all these functionally sharded Consul clusters, and they spanned AZs. The idea here was that we would separate these internal Slack servers from stepping on each other’s Consul clusters. That wasn’t actually the greatest idea because it turns out, once again, customers don’t care. Any of these critical services being down basically meant that Slack was down for everybody. We just knew which service to point the finger at when we had these outages. In the spirit of reducing complexity, as part of doing the work, we just moved to these high and low SLO clusters, each within an AZ. Then we assembled the service discovery information over the top using the xDS layer that I talked about before. We were able to collapse all these functional clusters into high and lower priority clusters.
The high priority clusters aren’t better and we don’t run them better. We just expect more from the customers of the high priority clusters. If you’re a critical service, you need to be attached to a high priority cluster. Then you need to be careful about how you use that shared resource. Again, this is one of these things like selecting the datastores where we’re able to zero in on these extremely critical services and expect a little more from them. As it turns out, teams love this. Even if the team that you’re working with isn’t really sold on cellularization as a project, they almost certainly think that their team has a lot better work to do, and they’re right. They can get some value to themselves just from cleaning up the tech debt. You can see how this leads to a place where the teams have both the flexibility and enthusiasm to work on this larger product.
Project Cadence
I’m going to talk about the cadence that we used in this project. I started with this one-pager, which is, we should be able to drain an AZ. I circulated it to a bunch of engineers from different teams that I knew and trusted. The goal here is really to identify anything that just obviously won’t work. You don’t need to plan this project all the way to the end from the beginning. Most of the time, you’re looking for reasons to stop. You can bring this to institutional design review processes, just circulate it to people and listen to everybody’s opposition to why it won’t work. At this point, you haven’t really spent a lot of effort. We haven’t spent a lot of effort organizationally on making this project work. If it doesn’t work, that’s fine. We didn’t waste too much effort here. At the end of this phase, in return, you should have enough confidence that you can go to the engineering management, go through whatever processes you need and get agreement to run a pilot for a quarter or two. There’s a theme here where we’re gradually committing more as we gain confidence.
At that point, you want to start with several high-value services and engage deeply to make progress. It’s important that they are high value and critical services because those services are probably the most complex and because they will pay off faster in terms of like when we get them cellularized. There we go, a big chunk of value from doing that service. We also know the complexity of the problem is tractable, and we get some ideas about the common infrastructure that we’ll build to help.
At the end of this phase, we’ll attain organizational buy-in again through the management chain to expand all critical services. This is going to be the longest phase of the project. You can imagine there’s like a graph of engineers over time, and this is definitely the fattest part of the graph. Then we’ll start tailing off later. At the end of this phase, what we really want to do is all services have items in their roadmap or we’ve decided they’re just not worth doing. We start tracking our progress really heavily during this phase. One thing that we do is we regularly would do drains of the AZ to see how much we can get. Then, week over week, we can make a graph that’s like, we removed more bandwidth, and it goes up, hopefully. Then we can also build the infrastructure. I mentioned we can do some things to help people in the service mesh, but there’s also deployment tooling, job control. This is the part where your shared infrastructure tooling and systems will need to start changing to accommodate.
Finally, in the last phase of this project, we’re going to wind down, where we set things up, so like the happy path for new services is going to be siloed by AZ. You have to go outside some guardrails in the configuration to make a global service. We make sure that our long-term work for non-compliant services is tracked in roadmaps and OKRs. Then any work that doesn’t make the cut, we have incident review processes. As long as we’re tracking these things that we decided not to do, we can always bring them back up in response to a need.
When is Good Enough, Good Enough?
Now I’m going to go and talk about what I mean by doing things just good enough. At some point, the juice isn’t worth the squeeze. We had data warehousing systems at Slack that were basically just used by data analysts. Maybe there’s some batch processes that run, but user queries basically never touched those data warehouses. We don’t really care if they become cellular or not. That’s fine. If we decide it’s really important at some point, sure, but we’re here on behalf of the customers. It’s probably fine to stop before we get to those. Again, as with the rubric before, we want some structured way to think about this. Our two concerns are really, how do we figure out where to stop, and how do we keep what we’ve gained despite not doing literally everything in our environment? We figured out where to stop by working with each service to figure out the criticality of that service. chat.postMessage, for example, is the name of the API method that you use to post a message into Slack.
If chat.postMessage needs to touch your service before it can return, you are very critical. You’re really important for us to convert. Then, side by side, we want to have some idea about difficulty. That’s where the getting together with each service team and writing this one-pager about like, what it would mean, what we think the best solution for that team is, and how much effort it will take us to get there. You can use this to gauge difficulty. Then this tuple of criticality, importance, and difficulty. Criticality and difficulty is a good way that you can interface with engineering management about this project. You can be like, this is important and hard. This is important and easy. This is not important and easy. This is not important and hard, and do this Eisenhower decision matrix thing. We did a lot of these in phase one and two. Then when we got to phase three, which is the very wide phase, everyone was used to looking at and talking about these things. There wasn’t a lot of back and forth about whether we were doing this the right way or not.
Again, incrementality is most important. You should expect some of these services will need to take maybe a year, or more than a year to really get everything done because sometimes your most critical services are the most backlogged services, and the most difficult to maintain. We need to maintain an awareness of where the moon is as we roofshot along quarter to quarter. We made it so that each team’s engineering roadmap actually had compliance with this program as a goal. Quick sidebar is that, I believe that engineering roadmaps are just completely crucial for teams in infrastructure, in larger companies. If you just have a short document that says like, “This is what our service is like now. This is what we think our service will be like in a year, and here are some ways that we are getting there”. It’s just a very powerful tool to communicate both outside your organization into the rest of the engineering organization. It’s a good way for team members to understand the value and importance of their own work. I love a roadmap.
Finally, we measured our progress with weekly drains. If you’ll remember, we were worried about the capacity implications of it. We were like, every Friday for a while, around noon Pacific, we would drain the AZ and see how far we got. Then we watched the number go up and up over time. We’d do like 10%, and then we’d be like, can we do 20%? Then keep pulling traffic until we got scared. I think we only broke stuff one time. It was really good because it was a great signal for us to give to the company that we were getting something done. It’s meaningful. It’s the amount of bandwidth that’s left. It is a reasonable stand-in for user requests going through there. We were able to make it move. When it stopped moving, we could have conversations about that, and when we could expect it to move again.
Where Does That Leave Us?
This is where we actually ended up. Siloed services are drainable in 60 seconds via the load balancing layer. The Vitess automation reparents the nodes away from the fast AZ, as fast as the replication permits. We didn’t get all the critical services there 100%, but they’ve all got their roadmaps. One thing that you can use if you feel like things are getting deprioritized for too long is like, is it that important if they’re not having outages? Maybe it’s not. Conversely, if they have outages, maybe cellularizing would help. This is a really good thing to include in your incident review process. We’ve built into all our relevant infrastructure systems this default happy path, which is siloed per AZ configurations. We’re doing regular drains. At some point, we were like, we’re just doing these enough now that we don’t need to do them every Friday. Once you’re in the cycle, then you start having a good feeling about your capacity planning and you’re able to do it, and this is something that you can do if there’s an outage. Finally, we got a little bit of a savings because we reduced cross-AZ data transfer costs. That was nice.
Do We Actually Use This Thing?
The question that people always ask is like, do you really drain? Yes, we do. You get to use it more over time. You can use it for these single AZ AWS outages, but then you can also use it to help you do deploys. The same database that we power the frontend drains from, we just open the same one up to internal service mesh customers. Then they can roll forward, roll back their services. It actually opens up this new blue-greenish deploy method, where instead of just having like blue-green, you have like AZ 1, 2, 3, 4.
Then if you want to deploy some software, you just drain one of those AZs, pop the new software on it, undrain to 1%, 10%, 20%, whatever, and step it up that way. You can do that. It can give you a cleaner rollback experience than just popping binaries, like some people do. Siloing is helpful in other ways too. Sometimes you can have a poisonous deploy of a service, where the service itself is ok, but it’s causing another service to crash somehow. The siloing just helps naturally because of that. You can only endanger your upstream services in your own AZ. In general, we got to a place where if something looks wrong and it’s bounded by an AZ, generic mitigation. Like, let’s just try a drain and see if it gets better. Drains are fast. Drains are easy to undo. We don’t have to root cause our problems anymore.
Lessons Learned
Finally, what do we learn about running infrastructure projects in a decentralized environment? You really have to listen to people and you really have to move gradually and incrementally. You have to remember that every service is operating at a local maximum. The way things are is not because people are foolish, it’s because of the evolutionary pressures that shape the development of your system. Projects like this can actually provide potential energy to help these services escape from their local maximum to something better.
Questions and Answers
Brush: Have you had someone have an outage that this would have helped and they hadn’t completed their roadmap yet? Has that happened yet?
Bethea: Yes. That actually was very satisfying while we were in progress. People would be like, we had this outage, and we’d be like, siloing would help you, actually. Have you considered doing our program?
Participant 1: At my company, we do something very similar to this every week, where we’re draining, it’s not in AWS right now, but we often run into challenges convincing leaders that this is a smart thing to do because sometimes putting strain on the system will have some impact. Did you all have that and how did you all overcome it, or address it?
Bethea: Especially when you first start doing this, people will get anxious. You are removing capacity from a system. You can either do it like the inductive or the deductive way, where you either pencil it out and be like, we should be able to do this. Or you can do what we actually ended up doing where we would do these drains and walk them up slowly, and then like, if things got hot. Also, there is, I think, a compelling argument to be made that if you can’t do it when things are pretty normal, then you really shouldn’t be doing it when things are really bad.
Bethea: Did going to silos impact our continuous deployment? Tremendously, actually, because we ended up redoing the way that we did deploys entirely. We used to have a deployment system that did a lot of stuff, but in the end, it went from maybe 10% to 100%, spread across all AZs. We actually reworked it to fill up an AZ first and then spread across the other AZs on the idea that we could just simply revert that AZ by draining.
Participant 2: Then, as it came up, you would ship the traffic to a new AZ, 1%, and then bring the others up, or you would have old version and new version of this traffic at the same time?
Bethea: I think we were doing a little bit of a hybrid version just for historical reasons where we would roll that AZ forward by popping the binaries, by just flipping the binaries. If we needed to revert during the process, we would just pull traffic.
See more presentations with transcripts