Transcript
Michael Stiefel: Welcome to the architects’ podcast where we discuss what it means to be an architect and how architects actually do their job. Today’s guest is Randy Shoup, who has spent more than three decades building distributed systems and high-performing teams. He started his architectural journey at Oracle and as an architect and tech lead in the 1990s, then served as chief architect at Tumbleweed Communications. He joined eBay in 2004 and was a distinguished architect there until 2011, working mainly on eBay’s real-time search engine.
After that, he shifted into engineering management and worked as a senior engineering leader at Google and Stitch Fix. He crossed the architecture and leadership streams in 2020 when he returned to eBay as chief architect and VP for eBay’s platform engineering group. He’s currently senior vice president of engineering at Thrive Market, an organic online grocery in the U.S. It’s great to have you here on the podcast, and I would like to start out by asking you, were you trained as an architect? How did you become an architect? It’s not something you decided one morning and you woke up and said, “Today, I’m going to be an architect”.
How Did You Become An Architect? [01:41]
Randy Shoup: Great. Well, thanks for having me on the podcast, Michael. I’ve listened to so many of the, in general, InfoQ podcasts, and particularly your interviews with Baron and Lizzie, and various other ones, so super excited to be here. Yes. How did I become an architect? Yes. Well, I woke up one day and said, “Today, I will architect”. No, my background is multidisciplinary. So, when I went into university, I was not expecting to be a software engineer. I always loved math and computer science, but I was planning to be an international lawyer. When I was in college, it was the late 1980s. I graduated in 1990. So it was the height of the Cold War, and the U.S. and the Soviet Union had tens of thousands of nuclear weapons pointed at each other, and I wanted to stop that.
Anyway, long story short, through my university career, I studied political science, with a particular focus in international relations and East-West relations. And from that, I took an appreciation of nuance and seeing the big picture and really understanding the problem writ whole. I won’t go through all the boring details, but I also, while I was in college, interned at Intel as a software engineer. So I worked building software tools for Intel’s mask shop, which is one of the things that you need to do to make chips, and I love that.
So when I graduated from university, I was planning on being an international lawyer, that was the mainline career, and I ended up double majoring in political science and then mathematical and computational science because then I was like, “Well, I’m not going to go straight to grad school”. So I worked for two years as a software engineer at Oracle, then I was like, “Okay, time to take the GRE and the LSAT and go to law school and international relations school”, which I did start. For reasons, that didn’t work out. I wasn’t as interested in that as I thought I would be, and the secondary career side gig of software was really pretty fun, and so-
Michael Stiefel: So, in other words, don’t quit your night job.
Randy Shoup: Exactly, yes. Well, very well said, yes. If you have a side gig that you love more than your main gig, maybe you can make it your main gig, and so that’s what I did. So, eliding over lots of details, which we could talk about or not, I wasn’t excited about the international law career for reasons and then begged for my job back at Oracle. My friend and mentor and manager at the time welcomed me back with open arms. And so it’s really the combination then of the engineering skills and doing software with the typing, and also being able to see the big picture.
What makes a good architect? [04:01]
And so, to me, I think we’ll talk about this: a good architect is somebody who can go all the way up and all the way down. And Gregor Hohpe, whom I hope everybody who is listening to this knows, has this wonderful phrase that’s in his book. It’s called The Architect Elevator. So, from the boardroom, you can talk to executives, all the way down to the engine room where you can actually tweak the machinery. So, how did I become an architect? It was crossing those two things. I always was a techie and a fuzzy, as we used to say in college. So, both on the technical side, really enjoyed the math and the science, and also on the other side, fuzzy, like liberal arts and so on.
Software Interacts with the Real World [04:40]
Michael Stiefel: And lo and behold, here you are. So, one of the things that we’ve talked about, and I know both of us are interested in, is realizing how fragile our critical systems are. And very often, the engineers, which is surprising, and the public, which is not as surprising, do not realize it and realize the consequences of the fact that these systems are fragile. The most recent one was the example of the CrowdStrike failure. But even things such as… Let me pick an example that people are probably most familiar with. You go to Amazon, you’re told that there are three books left, but you actually have to wait until you get an email, later on in the process, to find out you’re actually getting the book.
Randy Shoup: Right.
Michael Stiefel: And even if they tell you, you get the book, what happens when in the process of getting the book, it gets damaged in the warehouse, in the process of sending it to you. Do they cancel? They ask you, “Do you want it?” So the interaction between the software and the messy world is something that I don’t think people understand.
Randy Shoup: Yes. And the real world can be messy, as you just said, like, “Okay, physical goods can be damaged in warehouses”. They could not make it in shipping. They could get rain damage. They could not be there, even though the system says they are. And also, even more often, the software can screw up. The software thinks that we decremented… Software forgot to decrement a counter, and they should have because of reasons, and all sorts of stuff. People can’t see me because this is a podcast, but I do not have hair. But once, I did. And so I like to joke that, when I entered software and then architecture in particular, I did have hair, that’s actually a true statement. And yes, with all the failures and trying to deal with them and trying to design around them, that’s in large part how my hair… how I went bald.
Designing For Failure [06:55]
Michael Stiefel: Well, I think you just said something very, very important that a lot of people don’t think about, and I’ve gotten pushback from people when I used to talk about this. I used to have a whole talk about designing for failure.
Randy Shoup: Yes.
Michael Stiefel: People didn’t like it. Well, some people did appreciate it, of course. I’m not saying nobody liked it. But especially if you got to the higher ups, they didn’t like it because, “Why should we think about failure? We want design for success”. And-
Randy Shoup: Michael, you’re guiding for failure to start. What are you doing?
Michael Stiefel: Right. But I think we see this also in software engineers, and this is in security, this is in lots of places. They like to design for the happy path.
Randy Shoup: Yes.
Michael Stiefel: They think this will work and this will then work. And when it comes to an error, “Okay, we’ll just throw an exception”, without thinking about where the exception goes, what the program state is, to be caught by the great exception handler in the sky. So how do we get developers and the business leaders, and the public at large to understand that software failure actually is a fact of life?
How to explain software failure to executives [08:09]
Randy Shoup: Yes. In your question, you already went boardroom to engine room, which I love. So, let’s start with the boardroom and then drop to the engine room. So, how do you have this conversation with executives? And look, if you look at talks I’ve given, sections in them, even if not, entire names of talks, are all about designing for failure and everything fails, which is absolutely true. How does one have that conversation with executives is just making it clear that this is not about, “I want it to fail”. This is about things go wrong in the world and we have to deal with them, and so the designing in the face of failure. It’s not designing in order for it to fail, you were never saying that, but it’s designing to be resilient. So that’s the term of art these days is resilient in the face of failures.
So there are lots of ways to do that, but having that conversation is, “Look, as architects, we should be able to say, must be able to say, ‘Hey, we get the happy path. There are many things can go wrong along the way, and here is the way that the system is going to handle when those things go wrong.'” And that could be the system retries things, that’s legit. That could be the system times out, that’s legit. That could be the system reconciles, comes back around, like in banking. You send things back and forth between banks in real time, but then there’s this reconciliation at the end of the day, at least traditionally, where you looked at the 999,000 on the one side and the 1 million on the other side, and you started it out with rules or with humans.
How to get engineers to care about software failure [09:31]
So, in all these situations, the architect needs to imagine… I guess, now, I’m in the engine room already. The architect needs to imagine all the things that can go wrong. And what I think trips up a lot of engineers is it’s scary to think of, “Well, gosh, so many things could go wrong”, and that’s true, so many detailed things can go wrong. But if you think about it from a higher level, there are usually not a million different classes of failure. There are a million different instances of failure along the way. But even a relatively complicated workflow, like, I don’t know, a payment system or a checkout, or something like that, there are a handful of classes of things that can go wrong. Resources could be unavailable. How do we deal with that? Logic could not work. How do we deal with that? Things could be slow or down. How could we deal with that.
So what trips up people that are new to this, which is fine, people are beginners, I’m a beginner in lots of things in my life, but the beginner is that it feels overwhelming. There’s a million things that go wrong. How could I check for all of them? And the answer is don’t. The answer is, think about the three or four classes of things that can go wrong and have a pattern of how to deal with those individual things.
Michael Stiefel: In other words, think about the system state, thinking about what the safe point is, where you go back to.
Randy Shoup: Sure.
Michael Stiefel: To give, maybe, a concrete example is there’s lots of reasons why the credit card service you use may not be available, but you only care about the fact that the credit card service is not available.
Randy Shoup: Beautifully said. Exactly right, yes. I’m going through my checkout flow and I want to charge the payment method, and I can’t. We can imagine, if we thought about it, 25 different reasons together off the top of our heads in a minute why that could be true, network down, my payment system is down, their payment system is down, my credit card… blah, blah, blah, blah, blah. But from the perspective of the logic of that thing, it only matters that I cannot charge this payment method at this time. So, the wonderful thing about working with credit card systems is they’ve been around for a million years and they’re clunky, but the interface does tell you what happened. So, you can get something back that says, “Oh, the card was declined in an unrecoverable way. This is a fraudulent card”, “Okay, we’re not going to retry that one”.
But the more common case is it’s some kind of transient failure and like, “All right, we just retry it”, or if a human is there, we could say, “Hey, human, we couldn’t charge this method. Want to give us another card?” As every single person listening to this has had the experience in buying something at the grocery store, or whatever, and for whatever stupid reason, their particular card in the moment isn’t getting taken, so, “Hey, do you have another one?” “Okay, sure”. Again, patterns to deal with this exception handling, all the way up to the top of the system, which in this case, if someone is there as a human, or pop it up and then retry that part of the workflow, retry the charging payment methods, like Step. There’s lots of ways to deal with it.
Business Rules Help In Handling Failure [12:28]
Michael Stiefel: Also, there’s very often a business rule that enters in. So, for example, if the credit card service is not available and this is a customer that you know very well, and it’s $5 or $1,000, or whatever is small in your business, you may say, “Go ahead”.
Randy Shoup: Yes, yes. And actually, I’m sure you know this, but not everybody listening might, this is exactly how ATM machines work.
Michael Stiefel: Yes, that was the exact thing that popped into my mind when you said that.
Randy Shoup: Yes, yes, yes. And again, just to say… either of us could say it, but I’ll start saying it, the way ATM machines work is, “Hey, they have network connectivity to the office systems of your bank”, just like anything else, and they have rules in there that say, “If you cannot connect to that person, you feel free to give Randy $20, but do not give him $2,000”, that kind of thing.
Michael Stiefel: And also, this is a situation where, because you may be on a network that’s not your bank’s, there has to be reconciliation among all the banks at some point in time, which is what you made reference to before.
Randy Shoup: Right.
Michael Stiefel: So, it’s a combination of business rules and software judgment.
Randy Shoup: Yes, I’m going to riff off that in two ways. I really love the way you said it. So, again, to restate, a million individual instances of things could go wrong. There are typically a small number of classes of failure. For each of those classes of failure, we may have a business rule that says, “Hey, this is how payments are supposed to work. When this kind of failure happens, you’re supposed to retry for three days and then you can give up”, or whatever. So, to your point, there could be, in an ideal situation, a business rule that says what to do. In any case, we have to solve it. Regardless of whether there’s a business rule that told us how to solve it, we, as engineers, have to solve it. That problem will occur and we need to decide what to do. We can’t throw up our hands. And so, again, we should apply some kind of pattern to that.
The other thing that, when you mentioned business rules, I really love and I want to reinforce for people, we’re talking about a payment workflow or a checkout workflow, but this is a generic comment I’m about to make. The initial instinct of the naive engineer, and I’ve been naive for more time than I should admit, is to hide that failure and better is to have it be part of the interface. Again, if you’ve not worked with payment systems or checkout systems, the interface is not, “Please make the payment, yes, no”. It is, “Please start this payment”, and there’s a workflow with states behind it, and those states are visible outside. So it’s visible to say, “Hey, the payment is started. It’s pending. It’s authorized. It’s completed”.
And my point there is, as an architect or a software engineer dealing with one of these domains, oftentimes, when you have these failures or when it’s possible to have these failures and you can’t resolve them immediately, as in this example, pop that up a level. That’s now part of your interface. Part of the workflow and part of the payment system that we just sketched is the idea that payments can be in this intermediate transient state. They can be accepted, but not yet already done. And in the U.S., for those of us who live here, we still have to wait three days for… If we’re going to pay Michael for something, typically, the banks give themselves three days to get everything all sorted, even in the modern world.
The “It’s Never Going to Happen” Fallacy [15:54]
Michael Stiefel: Just as an amusing side point, I remember the days when I ran a business that took credit card payments, but this was before real computerization. You’d have a stack of credit card slips, the one the merchant had, and you go call the bank and list off all the numbers, and they would tell you, in other words, whether it was good or bad. So, in other words, part of the problem is that I think a lot of people have never worked with these manual systems and realize that, actually, the probability of failure is higher than people think. People very often think, “It’s not going to happen”. I’m sure you’ve heard engineers say that. “We don’t have to worry about that”.
Randy Shoup: Yes. The wonderful and terrible thing of working at places like eBay, Google is the things that occur one in a million times occur thousands of times a day.
Michael Stiefel: Because you have 10 million times.
Randy Shoup: Yes, yes, yes. And the things that occur even a billion times a day at Google scale occur thousands of times a day. So, there’s no hiding. And you didn’t really ask me to do this, but I’m going to say it anyway. It is not professional software to ignore failure. You are not being a professional. You know this, and I know you believe it, but just for the listeners, you are not being a professional software engineer if you don’t handle, in some way, failures, and handle could be, again, retry, or reconcile, or something automatic, or it can be simply fail, fail, fail all the way up. But one way or another, you can’t ignore, I was going to say, the possibility, the guarantee. I guarantee you that everything in your system will fail at one point or another. Guaranteed, absolutely guaranteed.
Michael Stiefel: Yes. And you have to leave the system however you handle a failure in a stable state.
Transactions and Workflows and Sagas, oh my! [17:44]
Randy Shoup: Correct. And I know you and I have chatted about this, so maybe this is going to make me want to take it in a transactional way. So, a way that we traditionally have approached this problem, and it’s a good one, it’s a great tool to have in our toolbox, is transactions. So, the conceptual idea… I know everybody knows, but the conceptual idea is I have several things I want to do and let’s wrap them all in a transaction and make them all happen together or none at all. So atomic, consistent, isolated, and durable, the ACID properties. And when you have a system where that can work, it implies a single database. When you have a system that can work, that’s great, that’s a wonderful tool in your toolbox you absolutely should use.
And also for the systems, even the “simple” systems that we were just talking about, payment systems, checkout, et cetera, not like that. There’s not a single database between Bank of America and Deutsche Bank when they exchange stuff. There’s not a single database or a single distributed transaction between my local grocery store and the credit card processor.
So, how do we deal with that? It’s asynchronous, and so we make it a workflow. And workflows are made up of asynchronous messages and we figure things out, and we have the SAGA pattern, which we could talk about in detail if we’re interested, but the conceptual model at the higher level is, “Well, when I can’t make it a transaction where it’s all or nothing in this moment, I need to then think about it as a workflow or a state machine, and it’s incumbent on me, as the software engineer, to make sure that we enter that state machine from a safe system state and that we transition through the states in that state machine and we exit the state machine in a safe state”.
Do Not Hide Transient States [19:22]
And again, what I was saying before about making these transient states visible, when you’re in one of these situations, it really behooves you to not hide the fact that there’s all these state transitions and transient stuff happening. It’s said that it should be explicitly part of the external interface, if that makes any sense.
Michael Stiefel: I think I can give you a simple example of that where it should be visible not only to the system, but to the end user. Let’s say you’re signing up a customer and they have to provide social security number, all kinds of information.
Randy Shoup: Right.
Michael Stiefel: They may not have all the information at once.
Randy Shoup: Of course.
Michael Stiefel: So what are you going to do? Throw them out and make them reenter everything all over again, or keep it in a semi-complete state, which may enable them to do certain things, but not other things.
Randy Shoup: Right. What a wonderful, visceral, evocative metaphor. That’s gorgeous because everybody can say, “Well, yes, when I’m getting my passport”, as my son just did, “there are steps and there’s a whole workflow”. And he doesn’t have his birth certificate and all these things all at the same exact moment and transactionally enter them all and not. And even if he did, to your exact point, he spent two hours entering all this various biographical information about himself and various things and proving his identity, and so on. And if, at the end, some stupid thing at the U.S. government went wrong,
Michael Stiefel: The Internet goes down.
Randy Shoup: … and you had to go reenter all those two hours again, that would have been insane. It was annoying enough as is.
Workflows are Resilient to Failure [20:57]
So, anyway, I love that metaphor because it doesn’t mean it’s a failure… A workflow is resilient to failures. A well-timed workflow is exactly resilient to the kinds of failures we’re talking about. And also, it is “resilient”, there’s probably a better way to say it, to, “You know what? I don’t have that information right now.” So, thinking about filling out a government form, or doing a payment process, or even the software engineering that we do, thinking about it as a workflow is really freeing and it is the problem.
Align the Architecture WIth the Problem [21:31]
One of the other… It’s a meta principle for me, and this is the way I look at it, but I think the very best architected and design systems I’ve ever worked with are very directly aligned with the problem. This is exactly domain-driven design. But when you can take your overall problem and reify the real world into the software directly, if that makes any sense, there’s a thing that’s part of the world like, “Oh, we have eBay”. “Okay, people buy and sell things online”. Okay, well, just imagine what happens when you’re buying and selling things offline.
Every one of the steps that you do, like I walk into the store, I go choose the thing, I pull out my wallet, all the steps that happen should have, do have, an exact analog in the software system that we build. And if we can use the inspiration and the, often, thousands of years worth of human knowledge about how to do those things in the real world and just put those into the software essentially, then we’re in much better shape.
The example I always give in this is I’ve never worked for Lyft, or Uber, or Grab, or whatever, and I don’t know their history, but I should look it up, I guarantee though, because this is how every system evolves, they started as a monolith. Each of them, I bet. And then what’s the natural domain decomposition for an Uber? It’s driver side, rider side. So, the rider side has a bunch of concerns and apps, and so on. And then totally separately from that, there’s the driver’s side. And then totally separate from that, there’s the back office like, “Well, how do you show the rider what drivers are available?” So where I’m going with that idea is it behooves us as architects to really fully understand the real problem. What’s the real thing we’re trying to do? And in this case, it’s obvious like, “Okay, I want to get a ride from point A to point B, and somebody is going to drive me”, and then reify that or express that in the software.
And then to the point back that we were talking about, about how to deal with failure, it becomes pretty obvious what the patterns are. We have to still type and stuff, but it becomes pretty obvious what the patterns or conceptual mechanisms, if that makes any sense, to deal with these things. What happens? What’s supposed to happen when I schedule a ride and they don’t show? Well, because they’re late or whatever. Well, I don’t know. When you’re hailing a cab in New York City, how does that work? Well, okay, it’s exactly the same thing.
Michael Stiefel: Although some things become a little more difficult in the virtual world. For example, to go back to your eBay example, if the merchandise is right in front of me, I can inspect it.
Randy Shoup: Yes.
Michael Stiefel: That’s a more complicated problem in the virtual world, or to go to your example of Uber and Lyft, in the past, there was a taxi commission that I knew the police ran a security check on the drivers.
Randy Shoup: Right.
Michael Stiefel: So, in other words, how, as you say with the reification, sometimes it’s not a one-to-one mapping.
Randy Shoup: Totally. So, we are strongly agreeing, but it won’t seem like that for one moment. Yes, the conceptual problems are the same like, “Hey, I want to see if this merchandise I want to buy is good”. That is a real problem. To your exact point, the solution in the virtual world is different from this. I can’t touch and feel it, so what else can we do? And you weren’t asking, but I’ll actually tell because it’s cool, eBay has at least three mechanisms I can think of off the top of my head for that. Number one, eBay’s feedback system, that’s been around for the 28 years, or whatever, of eBay. And this is gameable, but you can develop over time a trust system for the seller and for the buyer. Okay, so that’s number one.
Number two, in terms of the specific merchandise, for various things, I think it might be broader now than when I was here before, there’s a money-back guarantee. If you get something and it doesn’t meet the description, there’s a mechanism to return it and get your money back. And also, for particular items that are very often counterfeit, think sneakers, watches, handbags, a bunch of these things, particularly for Gen Z are like traded assets, essentially, people get the… I’m going to say it wrong, but people get the Michael Jordan super sneaker, or whatever, and only made 10 of them. They got gold stars on them, or whatever.
Anyway, for that, eBay has started actually bringing those things to physical warehouses and physically inspecting them and putting a virtual stamp of approval, if that makes any sense. So, a long-winded way of saying, 100%, there is that same problem statement, which is, “Hey, I want to buy this thing. Is it good quality and is it the thing that I actually want to buy?” And in the virtual world where we can’t touch it, we actually need to do a bunch of different other schemes, essentially.
Michael Stiefel: But if you think about it, and I’m going to date myself a little bit here, this is the exact problem at Sears, Roebuck and Montgomery Ward had with mail order.
Randy Shoup: Totally.
Michael Stiefel: So, in other words, there was a reputation in the company there, as opposed to the person. But again, there are more analogies than one might imagine to help you think about this problem.
Randy Shoup: Yes, that’s great. I love it, I love it. Yes. Exactly mail order, yes.
Workflows and State Machines [27:05]
Michael Stiefel: I have found that people have trouble with workflow and state machines, especially where there’s approvals involved and there’s long-running. Before we were talking about state machines, we’re talking about, in a program where there’s failure, it may be asynchronous, but, more or less, the software is waiting on itself, so to speak.
Randy Shoup: Sure.
Michael Stiefel: But when the thing have things like getting approvals, which is another way where workflow comes in, when it’s long-running, people have trouble with that. Especially when you have to deal with choreography or event-driven stuff, that becomes another level of distraction that you have to put on things.
The SAGA Pattern [27:50]
Randy Shoup: Yes, 100%. In fact, I’m dealing with this exact issue in my day job because, “Hey, we’re an online grocery and we take people’s payments, and we need to ship them things, and that’s a workflow”. So, the most widely known well-functioning pattern for this is called the SAGA. Anybody who wants to Google with that, they’ll find stuff from Chris Richardson and Caitie McCaffrey on what are called SAGAs. And so just very high-level, it is, we got this workflow and lots of different… not one system does it all. It’s interactions between different systems.
So, A sends a message to B, B does some things, B sends a message that’s received by C, they do some things, and the SAGA is just a way of representing that at a bit of a higher level. And then if there’s a failure at C or D, then you do compensating operations. So you do individually transactional, individually durable operations along the way, but they’re separated in time. So A happens, and then at a totally different time outside that transaction, B thing happens. And totally after that, the C thing happens. And if there’s anything that goes wrong with this workflow, you do what’s called compensating operations, essentially undoes in the reverse direction. So there’s a lot of literature and techniques around that SAGA pattern. That’s a great one for people to… Oh, why did people even invent that? And exactly because this is hard.
Orchestration and Choreography [29:13]
Relatedly though, and this is something that I’m just looking into but lots of people know a lot about, is an open-source system called Temporal. It is a way of making these workflows durable in a very easy to program setup. And I’m not going to do it justice because I haven’t actually done this stuff yet, but we’re going to do it real soon now. So, you mentioned choreography. So there’s choreography versus orchestration. Choreography is these events fire and traverse themselves, but there’s no central coordinator. Orchestration is where, just like in a conductor in an orchestra, there’s a central “controller” that makes sure that A and B, and C, and D happen or don’t happen and controls the workflow going back and forth.
So, Temporal is an orchestrator concept, and you program that orchestration logic in a regular programming language, Python, Java, PHP, whatever, Go. They support a million of them, and the system stores where you are in that workflow. And if you have failures along the way, it brings the system back to the state where you were, and then you just keep going.
I’m not giving it full justice. But actually, if anybody googles for Temporal, they have a great website with lots of sample code and lots of great explanations. And also, there are literally 100 or more YouTube videos about how it works and all the companies that use it. And so I won’t even be able to list them all, but every Snap story is a Temporal workflow. At Coinbase, every Coinbase transaction where you’re moving crypto back and forth is a Temporal workflow. Netflix uses it as the base of Spinnaker, which is their CI/CD system. I’m actually forgetting a bunch of things, but it’s used very… oh, Stripe. Every Stripe transaction, which is also money, is also a Temporal workflow. So it’s open-source, then there’s a cloud offering by the company that supports it.
Anyway, I mentioned this only because I have this exact workflow problem in my day job and I wanted to make it easier, and I was all set to teach everybody about what I was just saying, event-driven systems and SAGAs, and compensating operations, and state machines, and so on, and those are the real thing. They actually work that way, that’s actually how the systems ultimately work at the base. But I was searching for a way to make it easier, and I think that Temporal is it. And don’t trust me, trust Stripe and Coinbase, and…
Michael Stiefel: Of course, people sometimes have trouble deciding when to use choreography and when to use orchestration.
Randy Shoup: Yes.
Michael Stiefel: But as you say, you really have to think about what’s important in the problem that you’re going to solve.
Randy Shoup: I have my own answer. Because again, as a reminder for people, if these are new to you, choreography is like a dance where there are lots of events that are happening, but there’s no central coordinator that is saying, “You step here, you step there”. Whereas by contrast, again, orchestration is the orchestra conductor, tap, tap, tap on the lectern, or whatever you call it, and getting everybody to play in rhythm. So, when to use both? If this workflow is very important and is complicated, then you want orchestration, for sure. So payment processing, checkout, all these things that we’re talking about, those, in my view, absolutely should be orchestrations. Why? Because you have a state machine that you need to make sure executes durably, reliably, and completes successfully one way or another, either all done or all the way back to the beginning.
You use choreography in those cases where you don’t… I don’t want to say don’t care, but you don’t care as much. It’s not as much of a state machine as an informing of other systems to do a thing, and you’re like, “What do you mean, Randy?” Well, I’ll give you an example. So, I’ve been at eBay twice. eBay has been using an internally built Kafka-like system for many years, almost 20 years. And a thing we learned, everybody else learned at the same time, too, is choreography is best in those cases where you don’t have a state machine. You just want to inform people and have them do stuff.
So example is, when you list an item on the site, we absolutely want to make sure that all the payment and the exchange of stuff actually happens, so that stuff is orchestrated. But when you list an item on the site, there are literally tens of different other things that happen. So, you list a new item, eBay takes the image that you gave them and thumbnails it, and all these different things. It gets checked for fraud. It increments and decrements a bunch of counters about the seller’s account. Right now, the seller has sold a thousand things and, yay, they now get a gold star or a platinum star. So all these different things, and none of those is a workflow in the sense that we could and should continue the mainline work of accepting that item and putting it on the site as those other things happen in parallel.
Michael Stiefel: Because if no one gets the gold star, nothing else is dependent on that gold star.
Randy Shoup: Yes. And I want to be clear that, ultimately, you do get the gold star, but it doesn’t have to happen in a state machine-y way.
Michael Stiefel: Yes, yes.
Randy Shoup: This example, I think, is a reconciliation that, if we didn’t process that event like, every so often, we come around and go, how many things did you actually sell? Anyway, but I hope that explanation makes sense.
Michael Stiefel: Yes. So, I want to summarize this part by saying, and maybe this is something that will appeal to both business people and engineers. We talked about how designing for failure is a reification, or an abstraction, or an implementation of what happens in the real world. Well, if you think about the real world, failure happens, and the point is you dust yourself off when you have a failure, and you get up and go on.
Randy Shoup: Yes.
You WIll Fail – How Will YouRespond to Failure? [35:07]
Michael Stiefel: So the real issue is not did you fail, but how do you respond to that failure.
Randy Shoup: Yes, it’s exactly about the resilience to failure. The wonderful framing, which is not mine, I think it’s John Allspaw and the whole resilience movement, but I’m going to say what these acronyms mean in a second, it’s not about minimizing MTBF, it’s about minimizing MTTR. So, what do I mean? MTBF is mean time between failures. And so if you’re a hardware manufacturer, a thing you want to say is, “Hey, these hard drives that I ship, they don’t fail very often”, and you’re like, “What do you mean by very often?” “Well, our mean time between failures is one in a million, whatever, or four years for this Seagate hard drive, or whatever, but that’s not the right way to think about software. It’s not work so hard to never have anything go wrong because that doesn’t work. Things are going to go wrong. Instead, MTTR, mean time to restore. So, instead, think about, when things fail, how can we recover as quickly as possible and get us back into a correct state?
Again, whether that is retrying and trying to move forward, or rolling back, or undoing, or whatever, and trying to get us back to the beginning, either way. The correct thing, and we’ve learned this over time in the industry in the last, let’s call it, decade is systems are easier to build, much more reliable to operate, and much cheaper if you don’t try to avoid failure. But instead, you try to respond to failure and be resilient to it. So this is exactly the insight behind cloud computing. It’s not have one big system that never, ever, ever goes down, that’s mainframe era thinking.
Instead, it is, there are in a modern data center, literally 100,000 machines and not try to make none of them fail because, at any given moment, handfuls may be, hundreds are down, thousands maybe. But whatever, who cares? Because we put stuff in three different places and we can move things around quickly. And so all these patterns at the higher level are all around letting individual components or individual steps fail, and we don’t care about that because we have a higher level correctness that we’ve layered on.
Resilience is Not A Castle With Moat, Alligators, and a Drawbridge [37:26]
Michael Stiefel: I think last QCon San Francisco, there was a session that we both attended. It was about security. And I apologize for not remembering the speaker’s name, but they had this metaphor of it’s not about building a castle with a moat around, and then alligators, and a drawbridge to make sure no one gets in.
Randy Shoup: Right.
Michael Stiefel: That’s not the right metaphor because the invaders will get in.
Randy Shoup: Yes.
Michael Stiefel: What do you do to section them off, and deal with the failures that inevitably will happen, because the castle will be breached.
Randy Shoup: Yes, exactly. It’s not the hard shell in the soft center. It’s instead zero trust where you componentize and isolate all the individual things. You assume you’re overrun. Your invaders are there, they’re in the house, but what can we do to make sure the room I’m in is safe, or even if they get in the room, they can’t harm me because I wear an Iron Man suit, or whatever? So that mental model is great. The other equivalent, I hope it’s equivalent, is the componentization. Like you say, the isolation. So that’s circuit breakers, that’s bulkheading, that’s all those kind of patterns. And thanks to Michael Nygard for writing those all up in his fantastic book, Release It! Please buy, try and read that.
Michael Stiefel: I recommend everybody read that.
Randy Shoup: Yes. So Michael Nygard and Release It popularized… Because he wouldn’t even himself say he invented these things, but popularized circuit breakers, bulkheading, et cetera, which are exactly these isolated components of the system that are safe. But the other way to think about it is every mitigation or defense that we would put in place is Swiss cheese, but make sure that those holes in the layers of Swiss cheese don’t overlap. I’m making this up, but imagine five pieces of Swiss cheese and you orient them in such a way that at least one of those things blocks everything. So there’s no hole all the way through.
And also, probably related to the first, but I would say it separately, the zero trust where your mental model is you’re out there naked on the internet and you need to make sure that anybody you interact with is legit. So that’s a mutual TLS, so end-to-end encryption in transit. That’s encryption at rest, that’s integrity of the messages, that’s authentication and authorization of the identities of the people that are talking to you.
Michael Stiefel: All the things that the WS-star things tried to solve, that was a big industry struggle. But I think eventually, the industry has realized that these things are important and it’s not just about encrypting or one transaction between the user and the system.
Randy Shoup: Yes.
Architecture and Team Satisfaction [40:05]
Michael Stiefel: I do want to ask you a question. I don’t know if we’ve ever discussed this before, but this is something that’s become interesting to me is how architecture can affect team performance and team satisfaction with their job. You must have come across this. At one level, it seems simple. For example, if you have a loosely coupled system, it makes it easier for individual teams to do their jobs, but I think there’s something deeper here. And I think because you’ve been both an architect and an engineering manager, and if you have lots of different roles, you must have some unique perspective on this.
Randy Shoup: Yes. I don’t know if it’s unique, but I definitely have a perspective. Off the top of my head, I would say at least two, and it’s going to grow in a moment. So number one is, at the highest level, if your system architecture matches the problem, again, this is back what we were saying before, if you take a domain-driven design approach and you can find a place in your system that matches a part of the real problem you’re trying to solve, that’s already good. Why is that good? Because it reduces the cognitive load for the people trying to solve problems, again, because it matches the problem. So, if you understand the problem like, “Oh. Well, where does the payment processing step belong?” “Oh, it’s in the payment processor”. “Okay, cool, that’s awesome”. So, matching the problem is number one.
Number two is, to your point, componentization, whether that is, think about, as microservices or components in some other way, not having to think about the entire system all at once, but instead only having to think about the payment processor part or the bank interchange part, or whatever, taking a big problem, which is the entire thing of eBay or Google, or whatever, and instead making it a much smaller problem. And then thirdly, architecture should be a tool that helps you think, and so having the tools and the patterns to do things easily. And so, we were talking a bunch of those things like, “Hey, if it’s easy to do workflows.”.. Workflows are complicated, but if we, in our architecture, make it easy to do them because we have either built a system that makes it easy or we have other implementations of the SAGA pattern, or whatever, that you can go look at.
So a good architecture is one where, as an architect or as an engineer, I have a lot of different tools. And I don’t mean compilers. I mean components in the system or patterns in the system that allow me to do things. Because the best architectures that people have worked in are ones where you’re like, “Oh, let’s see, I need a data store”. “Okay, here’s this menu”. “Okay, cool, I’ll take that one”. “I need to do events back and forth”. “Okay, I’ll take that”, and having all those tools. And now, I’m going to add a third one, which is a paved path, like a Netflix or a Google, where all the pieces of this system are well-supported. So I can easily spin up a new service because there’s this template. Everything is in there. It’s integrated into the monitoring system, integrated into the RPC mechanism, integrated into CI/CD, blah, blah, blah.
Should Each Team Use Its Own Set of Tools? [43:14]
Michael Stiefel: When you’re talking about tools, one thing that always comes to mind is this struggle between the desire to impose that everyone uses the same tools, because it makes it easier to transfer for systems or hire people, and each team choosing the tool that’s most appropriate for them. And this extends to languages, to high-level tools. How do you feel about that?
Randy Shoup: Yes, I feel very strongly about that. You want to have both. So, the best places that I have worked and the most effective, call it, architectures or engineering organizations, or whatever, are ones where there is a paved path, or a very small number of them. Netflix and Google are great examples of the possible programming languages in the world. At least when I was at Google 10 years ago, there was good support for four, there was good support for C++, Java, Python, and Go.
Other languages are allowed, but you have to roll your own, everything. It has to integrate with a monitoring system and integrate with the testing frameworks, blah, blah, blah, blah, blah. So, certainly at large scale. I’m going to make a different comment when you’re small scale, but at large scale, having a paved path that is well-supported by people whose job it is to support it. That was my team’s job at eBay, by the way, to support the frameworks, and also allow people to go off the reservation. So paved path, but you can bushwhack.
And why do you allow bushwhacking? It’s because, sometimes, the right thing to do to solve this particular problem. Like some machine learning problem, let’s say, they should do it in Python because that’s the language where everything is written in. And if you’re doing some other kind of system, maybe that’s Erlang. So there’s a reason why WhatsApp was only eight people when they were acquired by Facebook, or whatever, because Erlang very much matches… and that whole system supernaturally matches the messaging problem they’re trying to solve.
Anyway, my point is, again, paved path, plus the ability to do new things, and to do new things, again, is because it allows people to match the exact problem they’re trying to solve individual teams, and also it allows growth and evolution of the common framework. So if you’re in this monoculture and you never look outside, you’re stuck. And there are lots of examples out in the world where companies have gotten themselves stuck into a rut where, “Okay, we’re only Java, and we’re going to keep our blinders on and never look anywhere else”, and that hasn’t been a super bad choice, but there are companies that are in the Microsoft ecosystem which is better now than it was 10 years ago. But you know what I mean? “Hey, we only do Microsoft”, and even the thinking about how to do distributed systems was very isolated, if that makes any sense.
Michael Stiefel: Yes, I lived in that world for a long time.
Randy Shoup: No shade on either of those ecosystems, I use them both, but you see where I’m going with that. So, I have a strong visceral belief that you shouldn’t have a monoculture at scale. Okay. Now, when you’re small, like I am, we have 100 people in the engineering organization at Thrive Market, where I work, we really should all be working on one thing. Some individual teams need to… Again, machine learning is a great example. They need to do stuff in Python, whether or not we did that elsewhere on the site. But when you’re small, it should definitely not be like every team for itself because you don’t have a lot of time to waste, a lot of resources to waste on doing things in multiple ways. So, that’s my quick thinking on standardization versus letting a thousand flowers bloom.
Michael Stiefel: I like that approach because it differentiates between the small to the large, and you can see what happens if you adopt what you say, for small teams when you grow to larger scale. Because a lot of the problems, which we didn’t talk about and it’s another whole thing we could talk about because we don’t have the time, and this podcast could be hours-
Randy Shoup: Yes, we’ll do another one, or something, if you want.
The Surprise of Large Scale [46:56]
Michael Stiefel: Right. Because what happens when, all of a sudden, you wake up and you were at small scale, and tomorrow, you got mentioned in the press and you now are at large scale all of a sudden.
Randy Shoup: Yes. Every company that we think of as large scale had that scaling event.
Michael Stiefel: Yes.
Randy Shoup: It is rare to have had that happen very slowly. It happens, but it is rare. Everything is an S-curve, but the way more likely is you’re chugging along, no one knows about you, and all of a sudden, kaboom, you hit something. Again, mentioned in the Wall Street Journal or reach a critical mass of people knowing about it and telling their friends, or whatever. And we don’t have time to talk about it here, but I have thought about for many years these phases of companies and products. And there’s a starting phase and a growth phase where the J-curve, as people talk about, or, really, the S-curve starts to get steeper and you go faster, and then it flattens out.
Michael Stiefel: So maybe some other podcasts, we’ll talk about scaling and what other else comes up.
Randy Shoup: Sure, sure.
The Architect’s Questionnaire [48:03]
Michael Stiefel: This is the point where I like to go off and ask the questionnaires that I like to ask all my architects. I find it also adds a human dimension to the podcast.
Randy Shoup: Great.
Michael Stiefel: So, what is your favorite part of being an architect?
Randy Shoup: I mentioned it earlier, actually. It’s the Gregor Hohpe’s Architect Elevator. So, I get a lot of enjoyment out of going up to the boardroom and down to the engine room. There’s something that’s just very energizing for me about being able to see things and help solve problems in the large, but also see things and solve problems in the small, and each of those lenses informs the other. What I don’t like-
Michael Stiefel: Right. What is your least favorite part of being an architect?
Randy Shoup: I haven’t had this experience a lot, but when it is not considered important or strategic to the organization or the company, I don’t like being not productive, or useful, or valuable. So, if I am in a situation where it’s not considered valuable to do this stuff, I should go somewhere else.
Michael Stiefel: Is there anything creatively, spiritually, or emotionally satisfying about architecture being an architect?
Randy Shoup: Yes. This is where, again, I said I’m a multidisciplinary person at my core. I’m not just interested in the computer science side. I’m not just interested in the international side. We contain multitudes. And so the thing that I love about being an architect is being able to play, again, on both sides.
The thing that really resonates with me is I tend to be more of a deductive reasoner, rather than inductive. So what does that mean? Deductive is you have a set of principles and you apply them. My sense is more people are inductive where you look at a bunch of examples and then derive from there. I like to do both, but my go-to model is to have a model, if that makes any sense. I like to think in terms of… People see this if we look at my talks back almost 20 years, I’d like to state, “Here are the principles. We should split things up. We should be asynchronous. We should deal with failure.” So I like to take those principles and have a clear, almost platonic statement about what the principles are, and then apply them in the real world.
And then maybe orthogonally to that, I didn’t have this word when I was younger, but I’ve always tried to be a system thinker. I get enjoyment, and value, and, I don’t know, spiritual energy, I guess, from really seeing the whole board and then being able to do interesting things within that.
Michael Stiefel: Thinking about what you said, and I’ve done a lot of teaching, and what I have found is, certainly, what you say is true that most people proceed from the concrete to the abstract, rather the abstract to the concrete.
Randy Shoup: Right.
Michael Stiefel: But I think there’s a difference between giving a talk, as you mentioned, and teaching where you want to start from the principles, as opposed to learning where you want to start with the concrete example. Because very often, the abstract principles seem too vague.
Randy Shoup: Yes.
Michael Stiefel: Because presumably, when you give your talk, you state the principles, but then you explain them with concrete examples.
Randy Shoup: Yes. You’re not implying this, but I want to say there’s nothing wrong with either model. A, I use both. And so it’s not like, “Oh, you can only be an architect if you do deductive reasoning and thinking principles first”. I’m just saying you asked a great question, which is, “What resonates with you, Randy, at a deeper level about architecture?” And that’s what resonates with me is this idea of coming with principles and then applying them. So, that’s what resonates with me personally. If you only did that, you would not be a very effective architect if you only did the other way. If you only did inductive reasoning where you only looked at examples and then abstracted from there, you would not be very effective. Both techniques are important, and I use them both all the time.
Michael Stiefel: So, what turns you off about architecture or being an architect?
Randy Shoup: When an architect behaves in an ivory tower way. Again, lots of people get excited about lots of different things, and that’s great. Again, it’s great that we have a diversity of people and approaches in this world. I do not like not being useful. And when I say useful, I mean… A lot of us do. I have the skill and capability to pontificate. I could do that, I don’t want to. Again, boardroom to engine room, I would much rather work it so that things actually matter. I’m not interested in producing documents for document’s sake. I’m interested in changing the world, or at least the company.
Michael Stiefel: In a science fiction world, you’d like to step into the UML diagram, into the box, and see what’s in the box.
Randy Shoup: Yes. And the purpose of diagramming, and the purpose of saying things and principles, and the purpose of doing architecture at all is to solve customer and business problems. That’s what we’re here for, and it is worse… You’re not applying otherwise. It is worse than useless for some fancy, well-paid person to pontificate about stuff and not have that very directly be connected to solving a business problem we couldn’t before, solving a customer problem we couldn’t before.
Michael Stiefel: Do you have any favorite technologies?
Randy Shoup: I do. I very much, obviously, am going to date myself. So, again, as I mentioned, no hiding, I started my career… graduated in 1990, started doing my internships a couple of years before that. I’ve always loved SQL. My starting thing was doing Oracle database related stuff, again, in my internship at Intel for a couple of years, and then I went to work for Oracle for seven years. So, nothing about Oracle database in particular, although it’s always been really good. But SQL, I think it’s just we have yet in the data world, and this is not a bad thing. We are still using 1970s relational algebra, whether we know it or not, doing data systems. So I think that’s wonderful.
Again, dating myself in terms of when I was last, and it was a while ago, hands-on keyboard as my primary job, but I’m really good at Java and C++, so I did a bunch of stuff there. Not so much technologies, but again, patterns and approaches, particularly at large scale. And don’t do it if you don’t need it, but a services or microservices approach, event-driven architecture, those are things that really solve real problems and I use all the time.
Michael Stiefel: What about architecture do you love?
Randy Shoup: I love finding an elegant solution to a problem. Actually, we just had this the other day. It would be too long to explain the details. Not that they’re secret, but just the other day, one of my teams at Thrive was going down a path that would work but wouldn’t be right. And so, going in and doing a little bit of Socratic method of, “Okay, let’s restate the customer problem”, which in this case, “Hey, what does the ML team need to be able to run their models in real time?” And like, “Okay, let’s explain what you need. What do you have and what do you want back?” “Okay, now that we know that, hey, let’s think about what the interface should be on the next level down”, and like, “Okay, we have a bunch of options, but this one is more natural”.
Anyway, I love being able to see a problem and helping to reframe it in a way that makes it easier and maybe sometimes even obvious to solve. It’s such a leverage point. It’s such a force multiplier to be able to… not by myself, but help us all to see, “We think this feels hard. We’re going down this thing, we’re bushwhacking, but there’s actually a really easy, or straightforward, or natural way of approaching this problem if only we think about it differently”. And so, that’s what I get a lot of enjoyment out of.
Michael Stiefel: So, conversely, what about architecture do you hate?
Randy Shoup: I don’t like deferred gratification. So, I like it when if we’re going to really put an architect hat on and do a big architect-y thing, whatever that even means, and we can do this, but I want to make sure that we get value now, as opposed to, “Hey, I sketched out this fancy architecture”. “Oh, yes, we’ll action on that in two years because it’ll take us all this time to do X, and Y, and Z”, and so the deferred gratification there. And then similarly for “big” changes, again, takes too strong a term, but a thing that is hard is dealing with what I’ll call the activation energy to get the organization to think in a new way, start doing in a new way, yes. So I wouldn’t say I hate it, but if I can reframe it, what do I struggle with and not enjoy, that’s it, yes.
Michael Stiefel: So, what profession, other than being an architect, would you like to attempt?
Randy Shoup: Yes, I think I already hinted at that in our intro, but I think I would not do this today, but the other career which I thought was going to be in my mainline career was international law. And thankfully, I found another way. The other thing at the moment would always have been true, but even true now is gourmet chef. So, my personal creative outlet in my life and big enjoyment is I’m a big foodie on the eater side, and so, therefore, I have learned to be a foodie on the chef side. So, I know because my sister-in-law does this herself, but it’s hard to work in a real restaurant. That’s actual real work, very, very hard and uncompromising. But just from an enjoyment perspective, gourmet chef.
Michael Stiefel: Do you ever see yourself not being an architect anymore?
Randy Shoup: Yes and no. I think I will not ever stop trying to frame things in a different, and hopefully natural, and hopefully elegant way. So, from that perspective, that’s a core part of me, I can’t turn that off. Is it possible that someday I will… You know what? No, I don’t think I ever will, to be honest. Even if I, as many of my friends have done so far these days, shift into more an advisory mode with various companies and individual people coaching, or whatever, which I love and do as a side gig as well, I will never shy away from talking about the architecture stuff, if that makes any sense. So, yes, I guess I’ll never stop.
Michael Stiefel: When a project is done, what do you like to hear from the clients or your team?
Randy Shoup: Yes. First and foremost, why do we even do any of these things? It’s because it solved a real problem. So, first and foremost, I want to hear we had a problem, or we had an opportunity, and we solved the problem, or we executed on the opportunity. So that’s number one. At the end of the day, if we don’t make things better, we should do something else. We can make something else better because there’s lots of opportunities for improving. The other though for me is if I hear back, “Wow, that was really elegant how we approached that problem, that was really extensible. I”, other engineer, not Randy, “can now see how we can evolve the system in this way, and that way, and the other way”. So, I guess that’s the other thing that I would really want to see, and this is from the team. It’s like, “Oh, man, we approached this problem in a way that opens more doors than it closes”.
Michael Stiefel: I like that.
Randy Shoup: Yes.
Michael Stiefel: Well, I know there are more topics we could talk about. As we talked through ideas that came into my head, we could go down this path, but thank you very much for being on the podcast. You’re a great guest, and I hope we can do it again sometime.
Randy Shoup: Yes, me, too. Look, between us, we have many, many decades of experience, and just being able to share some of those ideas together is great. So, thanks for having me on. Happy to do it again. Loved it.
Mentioned:
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.