Humans In The Loop: Engineering Leadership In A Chaotic Industry

Transcript

Michelle Brush: It is an interesting time to be in tech, for many definitions of interesting. I’m a site reliability engineer. My day job is to be somewhat of a pessimist. I like to say that I’m paid to worry. Really what I do is I analyze the way systems fail, and then I do engineering work to mitigate the risk. Either by trying to prevent the failures or more often than not trying to make it recover more quickly. Then when things do fail, I do the incident management to get things back to working as quick as possible. Then we learn from what happened. You might think someone who’s paid to worry might see a lot of doom and gloom with the industry right now. I don’t. I’m actually optimistic about the industry right now, so much so that I encouraged my son to take AP computer science and digital logic circuits.

Digital logic circuits is this weird mix of Boolean algebra and electrical engineering. It’s the lowest level programming that you tend to do because you’re basically stitching together AND, OR, and NOT gates to make something more advanced like addition or subtraction. I loved this class. It was one of my favorite classes. I never thought I’d use it, though. It was fun for little puzzles, but I just thought I would walk out of college and it would give me this great foundation of understanding how computers work, and that would be the end of it. However, to my surprise, I did use it. About a year out of college, I was working in a legacy C++ code base. It had 10,000-line classes with 1,000-line for loops and if statements that took up half a page. Of course, there were no unit tests and it was poorly commented. I had almost a little bit of fear every time somebody asked me to go make a change to the inscrutable business logic.

Then one day I had an epiphany. In the digital logic circuits class, we would sometimes try to simplify the circuits we were making in order to reduce the amount of wiring we had to do because we hated wiring. To do that, what we’d do is we’d take a very complex Boolean expression and then we would write a truth table of all the possible inputs, 1s and 0s, the outputs, 1s and 0s, and then we’d stick it into this funny thing called a Karnaugh map. Basically, a Karnaugh map is you create these patterns. It’s very strict on the bit ordering you have to have.

Then basically you look for groupings of 1s and 0s, and when you find a grouping, you can figure out a less expensive or an easier way to represent the logic. This may look like severance to some of you. I started using this all the time to simplify all these if statements I could never understand, and then all of a sudden, the logic became more understandable to me and I could make changes without fear. It became a tool I used a lot. There are a lot of things that I learned that I never thought I’d use again. I have used calculus in my day job, definitely discrete math. I’ve had the misfortune of having to use assembly language twice in my career. Then the speed of light is something that I like to say plagues me constantly. Comes up in my job all the time.

The New Tech Landscape

Having a solid foundation of math, physics, computer science has really helped me in my career over the last now 25 years. I find myself in this new tech landscape where we’re seeing things like the end of zero rate interest policy. Zero interest rate policy lead to companies wanting to cut costs. We have a whole collection of geopolitical and economic challenges. There’s this promise of AI/ML, and how large language models are going to make a single engineer be able to do the work of 10 or whatever, and all of that is also creating high demand and resource constraints on computer chips. I have to pause for a minute and acknowledge that this isn’t the first time we felt seismic shifts in tech.

Raise your hand if you were in the tech industry during the 2008 financial crisis? Keep it up if you were in there for the dot-com bubble bursting. For those of you with your hand up, like me, I felt like a lot of big ups and downs during those times. However, that last graph was not drawn to scale. The graph of employment over that time looks more like this. While there are dips, things keep going up. Here are the numbers. In 2000, we had 1.25 million, some sort of software or hardware professional. In 2002, it was 1.15 million, there’s a drop there. If you were one of those folks that lost your job as a result, it sucked. I don’t want to be dismissive of the real impact these things have on people. What’s interesting is we recovered. By 2008, it was 1.5 million. By 2020, it was 2.75 million. Now we’re sitting roughly around 3 million with projections that it’s going to continue to grow.

In fact, just recently, the Wall Street Journal said 200,000 roughly new jobs were created in tech in the month of March. I do have to acknowledge these numbers are rough. This stuff is hard to count. I actually downloaded data from the U.S. Bureau of Labor Statistics and I had to open all the zip files and pull out all the spreadsheets and then count things. It was not easy. Do you want to know why? Job titles kept changing every year. What it meant to be a software engineer or a computer professional changed all the time. That’s the point that I’m going to make in this talk is that our jobs are changing. That’s why it feels weird right now is that, like we have some jobs in the industry that are going to just shift to completely different work.

This is the thing that happens when new technology comes along and makes things more efficient, makes people more productive. Jobs change. As a thought experiment, let’s imagine what happens if AI/ML really does live up to its promise. Let’s just say that using coding assistants and large language models, we are able to produce so much more code. Engineers are so much more productive that really you have this magnifying effect where one engineer can now do the work of 10 or 20. What would you think would happen? The Jevons Paradox tells us that as we make some resource more efficient to use through technology, that it actually creates more demand. That people use more of that resource. We can think of software or code as this resource in question. Since Jevons Paradox says efficiency increases demand, what we’re going to see is that people are going to find more reasons to use something like software when it becomes more efficient to do so.

As an example, I take a lot more pictures with my phone than I ever did when I had to pay for film. If we follow this thought experiment through to conclusion, what we’re going to see is that as software gets cheaper to build, whole new applications and features we never thought possible will be built now. Those new applications and features will create demand for engineers.

The Curse of Knowledge

This is not an AI/ML talk. It is what is promised, which is a talk about engineering leadership through these times of change. Obviously, I can’t talk about what’s going on right now if I don’t talk about AI/ML, because it is a big influence on how our jobs will look different. To talk about how our jobs are going to look different, I’m going to start with laying a little foundation. I’m going to talk about a concept called the curse of knowledge. The curse of knowledge is a very simple model for how to think about human knowledge. In this case, it’s both knowledge and skills. It has two dimensions. The first dimension is your competence. On one side of it, you use incompetence, the other side of it is competence. You fall somewhere on that spectrum. Then you also have your consciousness. This is not like, how awake are you? It’s more about, how aware are you of your own skills and knowledge?

At the bottom, you’re completely unaware, you have no idea. At the top, you’re really self-aware. In this case, incompetence does not mean a judgment. It’s not a bad thing. It’s just we all have some things in life that we’re incompetent at. For example, I suck at basketball. I am aware of that, though, so that’s good. I also used to work in medical software, so I know just enough about medicine to be annoying to doctors, but not enough to not have to go. Then, I think I’m really good at cooking a chicken. We take that two-dimensional model, and we actually stack it like a pyramid. At the bottom of the pyramid is all of the things that we are unconsciously incompetent about. These are all the things that you don’t know what you don’t know. Because the world is so big and there’s so much stuff out there, this is most things, let’s just be honest.

If you know that you don’t know something and you develop curiosity, you can move up the pyramid to the things where you at least know you don’t know something, you’re actually aware. If you keep learning, you can move up to the conscious competence part, which means now you understand something, you know it, you have the skills, and most importantly, you can explain how you know it. This is really important when you’re mentoring or teaching others. You have to be able to explain the knowledge that you have.

Unfortunately, if you do something long enough, you start to build mental heuristics. You start building muscle memory. You start chunking. As a result of that, you become unconsciously competent. Which means you know things, but you can’t really explain why anymore. Any senior engineer has probably been staring at a design and thought, “Something’s wrong with that. I don’t know what. I can’t tell you, but it just feels wrong to me”. This is your unconscious competence speaking. The problem is, when you get to that stage, because you can’t explain it anymore, you end up having to be the person who always shows up and fixes things when things break.

That leads me to my next point. In our jobs, things break a lot. Software has bugs, hardware fails. When these things happen, we often have to make use of our intuition built on years of experience to at least point us in the right direction. Like we get a feeling that the problem might be over here. We have to take the time to troubleshoot and debug and understand the problem in order to say, ok, now I actually understand how to truly fix it. Our jobs require us to constantly move between this state of conscious competence and unconscious competence.

The mental model that works really well for me is to think, generally speaking, large language models are unconsciously competent. There was actually an Apple paper, very timely, that talks about this. Basically, large language models by their nature, they know things, but they don’t know why they know them. They’re not really good at explainability. We are really good at the conscious competence part. Comparatively, we’re really good at this. We’re also really good at knowing what we don’t know, so we sit really well at that conscious incompetence.

The biggest challenge with large language models is they sit some amount of time in that unconscious incompetence. That’s how I think about hallucinations. The model doesn’t know what it doesn’t know, so it gets creative and it fills in a blank. The good news is models are getting better and better. Hugging Face even has a leaderboard you can go look at. Right now, when I looked and created this slide, I think the best model on their leaderboard did hallucinations about 0.7% of the time. That is amazing. That’s probably better than human error most of the time. It’s still enough to give us pause and think, we can’t trust these things 100%. I think that number will go down, but I don’t know that it will ever get to 0.

Given that understanding that we have these things that are going to write some of the code for us, but it’s not entirely going to be trustworthy, it’s going to have that small rate of error, we have to look at what our jobs are going to look like in that new world when we’re using them a lot. The way I think about it is there’s going to be roughly two categories of jobs. One is making AI/ML more useful to others. Basically, these are the folks who are building features, building workflows that make use of large language models. This is going to involve actually understanding when large language models would be useful. It’s going to involve actually evaluating the model quality to see if it actually meets the use case.

Then, whenever there’s issues, there’s going to be some quality assurance work, maybe improving training data, maybe adding a fact-checking system on top. Then, because things always have to perform, even inference, there’ll be some work to optimize the model, make it perform better on the hardware. Then, of course, anything we do in software, we have to make sure the thing we build can survive the reality of production and real users.

On the other hand, there’s going to be folks that are using AI/ML to be more productive. Probably the other category is this one, too. It’s probably going to be you’re doing both if you’re doing that one. Most of us will end up being on this productive side. This means we still have the hardest part of engineering at all, which is figuring out what to build. Then describing it to something else so that we get what we wanted. Then, again, there’s all this problem around quality, maybe we want to refactor the code to make it more understandable, more maintainable. Obviously, we’re going to want to get user feedback. We might have to do some load or failure testing. Obviously make it perform. Then, again, we need to make sure the thing we built can survive the reality of production and real users.

The Ironies of (Partial) Automation

Since this isn’t an AI/ML talk, I’m going to spend most of my time talking about those folks that are just using AI/ML to be more productive. The way I think about those folks, myself included, is that our jobs are partially automated away. Partially. A piece of it. Not all of it. When work is partially automated away, this thing called the ironies of automation comes into play. This paper by Bainbridge, it’s a great paper. I recommend anyone read it, but particularly SREs, you must read this. What Bainbridge tells us is that when you automate some piece of work, the job that you leave behind for humans to do is actually harder.

The first reason is because you give the automation the stuff that is easy, comparatively. You give it the stuff that can be described as rules, can be described as patterns, can be described as heuristics. Then humans keep all the work that is hard and requires judgment. The second reason is because when you automate something away, you still have to worry about what happens when the automation breaks. Humans have to come in and they have to fix things when it breaks.

Unfortunately, because they stopped doing that task that they used to do, they are less familiar with that work now over time. It becomes this black box to them. When they have to go fix something, they have to page in, what did that work used to look like? Additionally, this creates a class of work of monitoring the automation. We have to figure out whether or not the automation is going to break. We have to detect when it does. Since humans are bad at staring at things for 30 minutes or more, we tend to automate the monitoring through another system, which again is just more automation. The cycle continues. The irony is that when you automate part of the work, the original job gets harder.

To really illustrate this, I want to talk about dishwashers. Dishwashers are great. We all use them, or most of us probably do. We love that we don’t have to wash dishes anymore. We didn’t get rid of all of the work. We still have to load the dishwasher properly, which a lot of people debate about what is proper. We have to add the soap. We have to start the load. We have to put away the dishes. However, there was also this other work that got created when we started using dishwashers. Something as simple as, now we have to notice when the dishwasher was done. We didn’t have to do that before. We used to just know when we were done. We also have to identify which dishes didn’t get clean. We have to inspect them and evaluate them and do a little bit of troubleshooting. Did we load the dishwasher wrong? Was the water temperature wrong? Did we not clean the filter? Then we have to decide what to do about that.

Additionally, we know that dishwashers can break in all sorts of weird ways, and we need someone to come fix them when they break. That’s either us watching YouTube videos and trying to figure it out on our own, or it’s calling a repair person. This might seem like a silly example because it’s a dishwasher. The same pattern has played out in other industries. For example, commercial aviation has been made better by autopilot. We can all agree on that. When the autopilot fails, the pilot still has to detect that it failed, assess the situation, and react and take corrective action. That’s harder.

Requisite Engineering Skills

If you take nothing else away from this talk, what I want you to understand is this slide, particularly if you’re an engineering leader. The Jevons Paradox says that as we use large language models to build more software, we’re going to create demand for more engineers. Unfortunately, the ironies of automation says that engineering job is going to be harder. The end result of this is we’re going to need a lot of skilled engineers. What type of skills are we talking about? Obviously troubleshooting and debugging. I love this old Brian Kernighan quote. This quote did assume that you were even writing your own code that you were troubleshooting.

Most of us already right now, we have to debug and troubleshoot a mix of code that might have been written by somebody else, someone who quit the company. It’s going to be even worse because now we’re going to be having code that large language models wrote in the mix. It’s not just this. What’s going to happen is our brains are going to start working on higher and higher abstractions. The time we might have spent trying to figure out some boilerplate code to go call some API or to figure out how to create some object on the frontend that we wanted to see, that time is going to be used on bigger problems. What that means is our brain does this thing called chunking. Chunking is basically like the brain’s version of encapsulation. We take a bunch of related stuff and we group it together, and then we store it in memory as a single concept.

Then we just work on that concept until we need to drill into the details. That’s how we’re going to be working a lot more. We’ve actually been doing this our whole careers. This is not new. We don’t work in digital logic circuits anymore because the tech industry gave us an abstraction. We got to move up the abstraction level. We don’t write machine code. We don’t write assembly language. Most people work in higher-order programming languages, but it goes beyond that. A lot of people just think about chunks. A chunk can be a database. A chunk can be a framework. A chunk can be an API method you call. We stop thinking about the details of that part. What I think what I’m going to see happen is that the new world is going to be accelerating this more than it already has. We have to start thinking about, how do we engineer quality systems without deeply understanding the parts?

Skill 1: Systems Thinking

The first skill that I think is really important is systems thinking. My favorite introduction to this topic is “Thinking in Systems”, by Donella Meadows. To give you a very quick overview, to me, what systems thinking is, is seeing the sociotechnical system. All the systems we work with have a mix of hardware, software, processes, and people. Then when we think about those things, we think of them not in terms of individual code. We think about them in terms of more flows. How does information move through the system? How are behaviors changed through the system? We think about, where does that information start and where does it land? We think about the feedback loops involved in the system and what those feedback loops can trigger.

Then we actually have to think about change. How do things change in the system? When humans are making changes, how fast can they move? What is the release or development velocity? How does our data grow? Then, of course, hardware failures again. To build up these skills, like to understand sociotechnical systems better, I tend to look towards the idea of mechanical sympathy, which is basically how well do you understand the hardware and how it behaves? I also look to an area called safety science, which there’s a ton of research in from Sidney Dekker, Nancy Leveson, a whole bunch of folks. To understand how people work, I tend to lean on behavioral economics, the work of Kahneman and others, because it tells us how humans don’t always make rational decisions, and we have to understand the decisions they will make.

For this part of thinking in terms of new terms, I tend to lean on control theory, which is what gave us observability but is much bigger than that. Then there’s this fun little area called cybernetics that really specializes in feedback loops. Then, when I want to understand change, I tend to lean back on my understanding of physics, because it’s a good analogy when you start thinking about how things change. Then I use a lot of statistics. Of course, I want to be data driven, so I still regularly constantly go out and just figure out how I can query and get the data I want to help me make decisions.

To show how this might all fit together in practice, I’m going to tell a story. Back in 2019, the services I was supporting at that time ran in three data centers, kind of like running in three regions. We thought that we were good enough running in three data centers, because we thought, what are the odds of a single natural disaster or infrastructure failure taking out even two data centers, let alone three, because these data centers were more than 800 miles apart? Lo and behold, of course, in 2019, Google had an outage where the automation that was meant to update and maintain some of the networking hardware in the data centers ran wild. Before a human could hit the stop button, it took out the networking control plane in two data centers, which happened to be the two of the three we were running in. You say, great, that’s why we had the third one. That’s why we decided that.

Fortunately, a lot of other teams had the same idea. The full scale of everyone failing over to that third data center took it down too, and we were down. From this, we realized, we made some bad assumptions, and we now need to realize that we need to be spread out more. We can’t just be in three data centers. We’ve got to be in more data centers. We decided after the outage to do the work of reconfiguring our entire service footprint. Thinking about how we’d want to accomplish this, the first thing that came to our minds is, we have to ask the question, what happens when you move things that are used to being 800 miles apart to being 8,000 miles apart? Of course, that means there’s increased latency. We thought about, what can happen with this increased latency? What I realized, because I know how people code, I know what happens when engineers are trying to solve a problem, is that probably somewhere in our code base, didn’t know where, there’s some for loop.

That for loop is calling an RPC, doesn’t know it, because it’s abstracted through a function. You take that for loop that’s calling this RPC abstracted to a function, and that’s probably sitting in some other function, abstracted again, and that’s sitting in some service, which is its own abstraction layer. It’s being called by some other component with a magical value picked years ago, probably something like 1 second. When that magical value is triggered, when it runs over, what’s going to happen is some error handling code, again, written years ago, that someone probably didn’t test all that well, is going to be triggered, and that’s going to cause a cascading outage. I knew nothing about the actual code when I was making this call. I wasn’t thinking about it at all, I just know. I know this is in the system somewhere.

Thinking about this as a system problem, we had to say, how do we programmatically discover the instances of these? Because it’s not going to work if we go interview a bunch of engineers. It’s not going to go work if we trigger a bunch of code reviews. Code reviews are terrible at finding issues like these. We said, of course, we’re going to have to do systematic latency injection. Luckily, because we had a layer where we could just inject this into every service call, we said, let’s just do that. Let’s just pick random times to inject latency in different service calls and do the full coverage and see what breaks, of course, in a pre-production environment. When we did this, we found all sorts of cases. We actually found a lot. Then we would go to the teams and we’d work with them, either asking them to change their magic timeout value or we would talk to them about batching up their calls so they weren’t making these loops.

The second problem we had in this effort was that we had this imperative rollout system that we used to deploy our jobs. We described things as a series of steps in Python. That was fine when we were in three data centers, but we were really concerned about what would happen when we moved to more. The system was called Sisyphus, which was a great name for a rollout system. We didn’t want to write a bunch of Python code to do this. We thought and we said, how do we get something that is easier to manage, easier to maintain, is less likely to have errors where somebody just misses a step? We asked the question every engineer should ask, which is, has someone already solved this? Of course, the answer was yes.

There was an existing system that was an intent-based rollout system that would allow you to describe the state you wanted production to be in at the end. It would do the work of checking safety constraints and gradually making the changes and making sure everything worked. We decided to use this instead. You might be thinking you didn’t build any code, you didn’t write any systems, how is this engineering? It’s this chunking process.

As you move up in engineering problems, you start thinking in these bigger chunks. It’s less about writing code and more about solving just the problems you have. This does not absolve us from the need to still understand what happens under the covers. We still have to be able to get into the details when needed. It just means that we can do that more opportunistically. We only do it when we know we need to work in a particular part of the system. We know we’ll need to do this. We’ll need to drill in and dig in because, of course, all abstractions leak and particularly our hardware abstractions. This is why we need mechanical sympathy, is because no matter how much we move away from that hardware layer, that hardware layer is real and it’s physical and it’s not going away.

Getting back to this landscape, what we’re seeing right now is a couple of things. One, the end of zero rate interest policy is making a lot of companies want to cut costs. Sometimes that is hardware costs. Maybe indirectly and abstracted away through cloud spend or something else, but they’re wanting to look for ways to tighten their budgets and save money. We’re also seeing a lot of high demand and resource constraints on computer chips. The combination of these things means a lot of engineers right now are being asked to get closer and closer to understanding the hardware. It didn’t used to be this way because we got away with it for a long time because of Moore’s Law. We had this sense that as our problems got bigger, we would just have smaller but more powerful computers, and then that would solve the problem for us.

Unfortunately, Moore’s Law has been experiencing a slowdown. In fact, even Gordon Moore said it would die in 2025. We had the slowdown, but we’ve been getting around that too because we all became distributed systems engineers. Who cares if we can’t get powerful CPUs, we can just have more computers. We found a way around it. Jevons Paradox, though, is still at play, which is as computers got smaller and cheaper and more efficient, we started using more of them. We started bringing them into our lives, from smartphones to smart light bulbs.

This created this infinite appetite for chips, which was compounded by Moore’s second law or sometimes called Rock’s Law, which is the cost of a semiconductor chip fabrication plant doubles every four years. What this means is it’s getting harder to make chips. There’s this great book about this called “Chip War” by Chris Miller. I love this book. Everyone should read it. Basically, it goes into detail of the history of how we got to where we are and why people are getting concerned about potential hardware supply chain capacity issues.

To give a very brief summary, what’s happening now is that there’s not a lot of companies that actually make chips, especially given the demand. You might think they are because you hear a lot about companies making chips, like Google makes TPUs, but usually the company is designing the chips and then they’re sending it off to a chip fabrication plant, sometimes called a foundry. We don’t really have enough foundries to keep up with the demand right now, and it’s very expensive to build them. What we imagine is that as time goes on, we’re going to be asked for more and more what I call creative efficiency, which is we’re going to be all asked to do more with less. Meaning, how do we get the most out of the hardware we have? Weirder, though, which is really interesting to me, is that the whole industry spent this time moving from the idea that computers are cattle, not pets. Stop treating your computers as pets.

Lately we’ve been needing these very specific machines to do things, machines that are good at matrix multiplication, GPUs and TPUs. Those machines are so expensive and hard to get access to, given everything going on right now, that suddenly they’re like fancy pets again. All of a sudden, you’re being asked to say, get the most out of those expensive machines we just bought. Make sure they’re utilized. Make sure they’re efficient. Make sure you can get what you want out of them. It’s like reversing direction a little bit. There’s a lot of pressure to make the systems we run more hardware efficient and get most out of what we have. I think this is going to change a little bit how we think about our architectures and designs as a result.

Skill 2: Non-Abstract System Design

Jon Bentley wrote this great book, “Programming Pearls”, and in it there’s an essay called, ‘The Back of the Envelope,’ it tells stories about engineers before they even wrote any code, just being able to do mathematical functions and figure out whether or not their architecture would scale or not, given different parameters.

At Google we do do this, we call it non-abstract system design. How this works is that we make a map of our architecture, just basic old box and arrow diagrams. We understand that since most systems are distributed systems and most successful systems eventually encounter scaling issues, and all systems are eventually hardware constrained, we actually want to understand that before we go build and deploy the system.

We basically take that map we created, which is the reality of our system in time and space, and then we add real numbers to it. We look at the queries per second we expect. We look at the latency cost like I had before. We look at the algorithmic growth. We look at the size of objects we want to store. We look at even old things like my nemesis, the speed of light. We basically get an idea in numbers of what’s going to happen when we actually run this thing in production. We started doing this to find reliability issues, but in this new cost pressure world, it’s actually been really useful to find things like, we can’t afford to store stuff at this rate, so we want to come up with a more efficient strategy like sampling or maybe aggregating some data.

In fact, just recently, my team found that this work we were doing to migrate from a monolith to smaller services was actually going to dramatically increase the capacity footprint, because each service, its minimum footprint, added all up, was larger than what we were paying for the monolith. Because we had real numbers, we could actually have a constructive conversation about tradeoffs. It wasn’t just these vague debates around, but velocity, but reliability, but efficiency. It was like we could say, no, we understand the actual math of the tradeoffs we’re making.

Skill 3: Reliability Engineering

As I mentioned, we started doing this to help with reliability. That makes the next skill that I think is actually going to be really important. It’s like we still need reliable systems, even if we’re not always writing all of the code. When I think about reliability, I don’t always jump straight to the SRE book that Google wrote. You might think I do, but I don’t. I actually come to Dr. Richard Cook’s work on how complex systems fail. If you’ve seen this, it’s written as a series of theorems almost. However, I tend to think of it visually. I think it is starting out with, all systems start as a small system that mostly works. I say mostly because they have bugs. They run on hardware. Small is a little bit subjective because most small systems still have a database. They still have multiple services. They still have users, maybe operators, sometimes policies. It’s not necessarily simple, it’s just small.

In those small systems, despite them being small, because it’s made up of all these things, there are still hidden assumptions. There are still emergent behaviors. The collection of those assumptions and emergent behaviors almost makes like an aura around the system that’s bigger than the system. I think of that as the complexity boundary. That complexity is often what causes incidents, in my experience, because a lot of outages happen even when things are working perfectly as intended. It’s because assumptions amongst the parts don’t work out and then you have this weird outage. When we have these outages, people show up and they fix the problems. They fix the assumptions. They repair the bugs.

Oftentimes, they actually want to come in and address other aspects of the problem. Like maybe they add controls. Maybe they add automation. Maybe they add more monitoring. That increases the size of the overall sociotechnical system. It’s now bigger. We shift the complexity boundary. The complexity grows with it. The cycle continues. We have another outage and humans show up and maybe they bring in more advanced recovery concepts like automated failover or circuit breakers or backup systems. The system gets bigger, and so it gets more complex.

The one thing that does worry me is that when we start using large language models to build more and more of our software, they are rate limited by inference speeds, and how fast we can come up with the right prompts. Your average engineer is rate limited by their words per minute and how often they can think through all of the code they want to write. I think that the large language models are going to really accelerate how quickly our systems grow and how complex they get. Since complexity is the source of outages, that’s what scares me. Then, because we’ll have automated away whole chunks of the system, we’ll be less equipped to jump in and debug it when things go wrong. I have this saying I love, which is every line of code is potential load bearing, and that’s true, even code written by large language models.

Going back to ironies of automation, we’re going to have this harder job now because we’re going to have to respond to failures when we don’t always understand the parts or the in-depth aspects of the system. To do that, we’re going to need to take this map of the reality of the system. We need this map because it gives us an orientation, a framing of what the system looks like. Then we’re going to have to do this thing called generic mitigations. Generic mitigations is a way of systems thinking in incident response.

Basically, it’s this idea that instead of trying to find the root cause of an issue or debug it, you just have a set of tools in your toolbox that when the system’s having a problem, you can use a sledgehammer and whack the system back into place to give you time to actually do the deep troubleshooting. A good example is there is one generic mitigation that I bet everyone has done, turn it off and turn it back on again. That solves all sorts of weird issues, like memory leaks, deadlocks, resetting bad state that’s in memory. All sorts of stuff that’s fixed by that.

See, the thing is you have to design your system so you can do that. If you just start to restarting a bunch of systems that weren’t designed to do that, you’re going to have an even worse failure. It’s this combination of building tools, these generic mitigations, but also making sure our systems will respond well to them. You don’t need to have a lot of generic mitigations because the whole idea is that there’s millions of ways that a system can fail and you want a small set that will get it right again. Some of my favorites are, of course, rollback.

Rollback is another really common generic mitigation. Migration, meaning like a live migration or moving something over to different hardware will allow you very quickly to understand if the hardware is the problem. Again, your systems have to be able to be live migrated without having problems for that to work. Another great generic mitigation is picking certain features that you will gracefully degrade in order to allow other features to go forward because they’re more critical. All sorts of things. The idea is you have to be prepared that you can’t wait to page in all that context about the system you didn’t write. You’re going to have to have other tools that let you respond.

Skill 4: Complexity Theory

Where I’ve been trending with this is that really, as I said, things are going to get really complex really fast if everything plays out as folks expect. You’re going to need to get really good at managing complexity. You already need this now, you just might not know it yet. Understanding complexity usually relies on leaning into complexity theory. What complexity theory basically tells us is that most systems have emergent behavior that is somewhat unpredictable. It is nonlinear. Sometimes it feels non-deterministic. It may even feel chaotic to some of us. This is true even when we think we have a somewhat simple system. Because if you’ve ever looked at chaos theory, you understand that even very simple functions can yield very extreme results depending on the variability in the inputs. Our systems, more and more, are going to look like complex adaptive systems.

In complex adaptive systems, the cause and effect isn’t always obvious, and behavior isn’t always predictable because the people, the processes, the things within the system can adapt. As we build more recovery mechanisms, that’s an adaptation mechanism for the system. The system’s going to change. Understanding the parts is no longer sufficient to understand the whole because of all those interactions. We have to take different approaches to how we approach our work. Ideally, you would reduce complexity. This is actually my first suggestion on anything, is like, can you get rid of some of that complexity? This might be refactoring. This might be eliminating unnecessary service calls, whatever. Just try to get rid of some of the complexity. Sometimes you can’t get rid of all the complexity, because there’s this idea, and it came from Frederick Brooks, that software has aspects that are essential and aspects that are accidental.

The essential things are the things that have to be there if you’re solving a problem. For example, if you’re working in medical software, you can’t ignore the complexity of clinician-patient interactions. That’s always going to be there no matter what happens. The essential stuff is the stuff like the business logic, the fundamental algorithms. This creates essential complexity. The accidental stuff is the stuff that we added as engineers when we were trying to solve the problem through hardware and software. This is like our technology choices, when we write a bunch of boilerplate code, the design tradeoffs we make. In this sense, accidental is not meant to be unintentional. A lot of this is actually intentional decisions. It’s meant more in the philosophical sense is how Frederick Brooks used it, which is more like, can you remove it? Will the thing still be true?

Basically, we are always going to have at least the essential complexity and probably a little bit of accidental complexity all the time in our software. Since predictability is not a feature of complex systems, we can’t plan work as like, do A, B, C, D, E, F, and so on. We have to discover the work because it won’t be known to us at the beginning. We have to take approaches that allow for that, which usually means experimentation. I mean this in the purest sense of have hypotheses, do a series of work to see if your hypotheses are true, and then change the work you planned as a result. Some of this could involve doing back of the envelope engineering first. That could be an experiment. You could do rapid prototyping. You could do scalability-first testing. This is where you stand up a skeleton of the architecture and you scalability test before you add any of the logic, just to see if even the architecture is going to hold up or if it’s going to fall over.

Basically, this work of devising these experiments is going to be really hard work for people to do, because it makes that you reorder the way you thought about the project. It’s not going to follow the most intuitive linear path. The actual experiments might get easier because you can look to the large language model or the coding assistant to do a lot of the work there, but you still have to drive forward the learning and change what we’re going to do based on it. When you plan to learn, basically your projects proceed nonlinearly. This can be uncomfortable, because we like things that are linear. A mistake engineering leaders often make is that they demand a simple and predictable solution when dealing with a complex system, and they don’t empower their teams to work in a way that yields discovery.

I think this happens because a lot of engineering leaders started out as an engineer working on that first system that mostly worked. They think, back in my day, I could just do blank and that would all work. They make the mistake of just thinking that you’re still working with that small system. You’re not. You have this system to deal with today. It is much different than what they had. This system is complex, maybe even chaotic. It does not respond to waterfall planning and Gantt charts at all. You cannot tame this system with playbooks and best practices.

This thing requires almost like a conversation to work with it. You have to do risk management. You have to do experimentation. You have to do discovery as part of your project planning efforts. This is what we were doing with that example I gave of moving the services around, is that we didn’t understand fully what would happen in the system, and so we created an experiment, which was the latency injection, which was like poking the system with a stick and seeing what happened on the other end.

Final Note on Engineering Leadership

All the stuff I just talked about, this experimentation focused way of developing, systems thinking, working in these chunks that you don’t really understand anymore, that sounds like a lot to expect of an engineer. Often, when I talk about this, the next question I get from a lot of people is, does this mean we will no longer need junior engineers? We will only need experienced engineers in the industry? No. In my opinion, it would be a big mistake for engineering leadership to start thinking that. Because, yes, we’re going to need folks with a lot of experience and a lot of skills and a lot of understanding of these higher order concepts. We also have another problem. We’re not going to work forever. We’re not going to always be here.

Those of us that learned what we did the hard way, writing boilerplate code, dealing with early troubleshooting issues and very low-level stuff, in my case, dealing with assembly language. We’re going to have to give way eventually to engineers that had coding assistants and large language models from day 1 of their careers. We’re going to need them to become the experts that we are. None of us is born an expert. None of us came out when we were born and knew how to write code. We learned it. Making any expert requires first acquiring a novice. Then, of course, we should make sure they have some foundations. They need to have this background that helps us navigate these complex systems we deal with. Of course, we need to level them up in these higher order skills that we need, like systems thinking, and non-abstract system design, and reliability engineering, and, of course, dealing with complexity.

We’re going to start here. They’re going to be in that unconscious incompetence state where they’re not even going to know what they don’t know. Then we need to move them up through time and investment to the point where they understand what they don’t know, then they learn things, and then eventually even maybe get to that unconscious competence part. Since we had it easy, this path just naturally, organically came out for us, we’re going to have to recreate this path for others.

First, we’re going to have to invest in mentoring. The best way to mentor is not through advice. Advice is the worst way to mentor. The best way to mentor is actually to think out loud, which is you bring someone along as you’re solving a problem, and you talk through your thought processes and why you are solving it the way you did. It’s more like an apprenticeship. We’re going to need to do this more deliberately. The other thing we’re going to do is we’re going to have to make sure we’re having opportunities to learn for folks. We have to make time and space. It’s very easy for us when we’re experienced folks to basically say, it would just be faster if I did this myself. If we do everything ourselves or if we give everything to the coding assistants, we’re not creating space for these early career folks to come in and learn how these things all work. We have to be able to do delegation, like actual delegation. I was almost tempted to call this the fifth skill because so many of us have a hard time doing this. We just want to do things ourselves.

Then, of course, we have to make sure there’s space to learn. Not just for the folks coming into the industry, but also for ourselves because we’re going to have to grow and adapt and learn as we go. The way we approach projects is going to be more learning focused and less just predictable. Because developing the next batch of experts is going to be a big part of what we need to do going forward, just as much as building all these amazing software systems that we never thought we could build before. Because we’re going to need a lot of engineers, capable of managing all this complexity that’s going to come at us really fast at this accelerated rate. Because if we don’t do that and we don’t manage the complexity that’s coming at us, our systems are going to devolve into chaos.

See more presentations with transcripts

Humans in the Loop: Engineering Leadership in a Chaotic Industry

Transcript

The New Tech Landscape

The Curse of Knowledge

The Ironies of (Partial) Automation

Requisite Engineering Skills

Skill 1: Systems Thinking

Skill 2: Non-Abstract System Design

Skill 3: Reliability Engineering

Skill 4: Complexity Theory

Final Note on Engineering Leadership

Leave a Reply Cancel reply

Stay Connected

Latest News

A Research Leader Behind ChatGPT’s Mental Health Work Is Leaving OpenAI

Revolut hits $75B valuation in new capital raise | News

Dreamy Black Friday Deal: Get Secure Cloud Storage at Just $1.99 Monthly

⚡ Weekly Recap: Fortinet Exploit, Chrome 0-Day, BadIIS Malware, Record DDoS, SaaS Breach & More

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

The New Tech Landscape

The Curse of Knowledge

The Ironies of (Partial) Automation

Requisite Engineering Skills

Skill 1: Systems Thinking

Skill 2: Non-Abstract System Design

Skill 3: Reliability Engineering

Skill 4: Complexity Theory

Final Note on Engineering Leadership

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News