Do Microservices' Benefits Supersede Their Caveats? A Conversation With Sam Newman

Transcript

Olimpiu Pop: Hello everybody, I’m Olimpiu Pop, an InfoQ editor, and I have with us today Sam Newman, as people call him, the father of microservices or the person who coined the term. Without any further ado, Sam, can you provide us with an introduction to you and what you’re currently doing?

Sam Newman: I work as an independent consultant and trainer. I work with companies all over the world, doing different things, really. I don’t focus solely on the area of microservices, but rather typically work in helping teams change their architecture or improve what they derive from it. I’m also in the process of writing a book on making distributed systems more resilient. I also need to be clear: I did not coin the term ‘microservices’. I was in the room when the term was coined, but I’m sure it wasn’t me who said it.

A decade of microservices [01:13]

Olimpiu Pop: Okay, great. Thank you for clarifying that. Nevertheless, whenever I think about microservices, I think about the book that you wrote a decade ago. I was checking earlier, and yes, there have been 10 years since the first edition of the microservices book, and that’s what I’m thinking whenever I have to think about a system, I think about your book. And probably the main reason is that it’s a very focused book, it’s not very thick, and it has a lovely cover. And if I look at the chronology of that, you have the first book 10 years ago, then four years ago you had your second edition. What changed during that period?

Sam Newman: The second edition is a thicker book, so I think that needs to be clearly stated. I mean, firstly, the first edition and the second edition, the core concepts are the same. So the focus on an independently deployable style of service-oriented architecture focused around business capabilities, treating microservices like black boxes, being technology-agnostic, avoiding database integration, focusing on loose coupling, the importance of domain-driven design, those are all elements that flow across the two books. And in terms of the core ideas and the core principles, those haven’t changed, but a couple of things happened.

Firstly, some of the things in that first edition were speculative, as in we were still drawing on a smaller number of case studies. Although service-oriented architecture has been around since the mid-nineties and microservices are just a type of SOA, we were looking at organizations using service-oriented architecture in the microservices style, and there weren’t as many case studies to draw upon for that work, so I didn’t have as much of a richness to draw from experience from.

These were the days when we thought containers might be a good idea, perhaps, and so I mentioned them in there. There were some areas where we clarified how we want to handle random deployments, for example. We also had a lot more case studies out there. But also a lot more time spent talking to people, talking to companies, working with companies, seeing these ideas put into practice, realising that a lot of what I thought was quite apparent was too implicit in the book. For example, in the first edition of the book, I discuss how to break apart your user interfaces. I say you should do that, and I give a bit of an outline, but I assumed people would do that work. And they didn’t. So in the second edition I wanted to firm that up.

Some old ideas just resonated more with me the more I sat with them. So, the importance of information hiding just kept rattling around in my head. I recall that Martin Fowler helped review the first edition of the book and mentioned it to me. I mentioned it in passing in the first edition. But the more I thought about it, the more I felt it was key to the discussion. And so, I wanted to provide more explicit advice, draw upon additional case studies, and, to some extent, cover new technologies. However, I made it timeless in that sense. Still, I also wanted to introduce some firmer mental models for thinking about the practical modularity of microservices.

For me, the chapter I’m happiest with is the second chapter of that book, where I’ve done the best job I’ve ever had of bringing together the ideas of microservices, domain-driven design, coupling, cohesion, and information hiding. It all makes sense, for me anyway. And so that chapter was worth a second edition alone, even though that whole book ends up being twice as big. So the principle is the same, hopefully clearer, more explicit, and with more case studies. That is the main difference. And in some cases, we got it translated into a couple of different languages than the first edition. It will likely reach a few more people.

Olimpiu Pop: Nice. Thank you for that summary. Let me summarise the main points. You started with something that was very conceptual initially; you were just imagining how things might look in a distributed ecosystem and how they come together. Then, the second edition became more concrete, and people started using it. You then came back with case studies and extracted the key points from those.

Sam Newman: Yes, maybe not conceptual. It was conceptual in some areas. So, for example, when talking about the container technology, it was speculative because there were very few people playing around with things like LXC, and I mentioned it in the book. Still, it’s early days; it was pre-Docker when I wrote that edition, so some of that stuff is speculative. Some areas were speculative, particularly regarding the testing. I had reached the point where I realised that end-to-end testing was highly problematic for systems with a much larger scope or distribution. That chapter ended up being quite controversial in some circles because I suggested that you probably should stop doing end-to-end tests for larger scope systems when you have more than two or three services, okay, one team fine.

That got a lot of pushback, certainly inside Thoughtworks at the time, and some people they’re furious at me for saying, you shouldn’t do end-to-end testing. It’s like… Additionally, numerous case studies are available. The two primary case studies that I draw on extensively were from Netflix. Ben Christensen was very kind enough to review the book, having looked after performance and resiliency on the API side and REA in Australia. So, by working with more clients myself, you get to enrich that experience by hearing people’s stories.

A lot of the book is around trade-offs, and so I can say, here’s some trade-offs generically about a particular idea or a certain way of doing something, but the power of a case study is that it’s a real story, it’s something that somebody did. I can discuss the trade-offs they made, what worked for them, and what didn’t work. And I’ve realised, over the last 15 years or so, that being able to tell real stories is essential because they stick in people’s heads in the way that contrived examples or theoretical things don’t. And so again, I wanted more space for those stories to help them stick. And I’m incredibly fortunate that lots of people were very generous with their time in terms of sharing their stories. And this is just a call-out for anyone listening to this podcast, conferences, you need more case studies. If you’ve had a fun story at work, consider sharing it in a talk, as everyone can learn from your experiences.

Microservices are more about team autonomy than anything else [07:28]

Olimpiu Pop: Thank you. So, while listening to you, one thing was going round and round in my head. During this decade, I had been thinking that, yes, wow, 10 years ago we didn’t have Docker, and now Docker is already moving away from the limelight, while others are emerging. And then you mentioned database integration, and that’s a clear anti-pattern from multiple perspectives. What makes a microservice, to just put it in a short definition?

Sam Newman: Well, okay, let’s talk generically about what a service is in a service-oriented architecture, it’s a program running on computer which communicates with other services via networks. So a service is something that exposes interfaces via networks, talks to other things via networks. That’s a service generically. A microservice is one of those where it is independently deployable so I can make a change to it and I can roll out new versions of it without having to change any other part of my system.

So things like avoiding shared databases are really about achieving that independent deployability. And it’s a really simple idea that can be quite easy to implement if you know about it from the beginning. It can be difficult to implement if you’re already in a tangled mess. And that idea of independent deployability has interesting benefits because the fact that something is independently deployable is obviously useful because it’s low impact releases, but there’s loads of other benefits that start to flow from that. And so it’s a type of service which is independently deployable. That is a microservice.

Olimpiu Pop: Okay, thank you. With or also with the organizational benefits, because it seems that most of the people when they are speaking about microservices, they bring into the discussion also Conway’s Law and the way how the organization of their enterprises is depicted by the architecture of the systems. So what do you think? Does it have an impact?

Sam Newman: I think it’s almost a truism, as in it is an observation that was made back in the late 1960s, which we found to be true on so many occasions, that the architecture and the organization end up matching and typically it’s the organization that drives the architectural shape. Every now and then, it works in reverse, but that’s rare. So that’s just almost like, as far as I’m concerned, as close to a fact as we get in computer science, where we don’t have evidence for it. We lack evidence for many of the facts in computer science. So, it’s like that is true.

So then the question is, what does microservices have to do with that? Well, the reality is that microservices are typically focused on the business domain. We know our architecture follows our organizational structure. There’s been a shift over the last 10 years towards more product-oriented organizations where we expect a delivery team to be more aligned with the business organization, and then the microservices architecture is ready and waiting for that shift. So, it’s more about having companies that want to move towards a more product-driven approach, with teams structured around business outcomes and more closely aligned to the business. That’s happening.

We’ve also got organizations recognizing the importance of teams having a higher degree of autonomy because they’re recognizing that if they just add more people but keep that bureaucracy, then you add more people and you can go slower sometimes. So, wanting more autonomous teams to get more work done with the people you have. These are all the things that are happening, and microservices, as an architectural style, are ready to help all these things happen. So, when I, certainly with the enterprise organizations that I’ve worked with and with the scale-ups I work with, the single most common reason for using microservices is to enable a specific style of organization.

The vast majority of people who tell me they’ve scaling issues often don’t have them. They could solve their scaling issues with a monolith, no problem at all, and it would be a more straightforward solution. They’re typically organizational scale issues. And so, for me, what the world needs from our IT’s product-focused, outcome-oriented, and more autonomous teams. That’s what we need, and microservices are an enabler for that. Having things like team topologies, which of course, although the DevOps topology stuff was happening around the time of my first edition of my book, that being kind of moved into the team topology space by Matthew and Manuel around the second edition again sort of helps kind of crystallize a lot of those concepts as well.

Olimpiu Pop: Thank you. I’ve heard a lot about the points that are important these days: product thinking, mainly because engineers tend to focus on technology, which misses the point altogether. This also aligns with domain-driven design and autonomy. So what you said basically, if I got it correctly, is that microservices as an architectural style allows us to be more autonomous and deliver value for the organization and make it leaner, basically, without all the fluff around it.

Is the cost of microservices worth their benefits? [12:27]

Sam Newman: Yes. And that’s the case. The second question is, is the cost of microservices worth those benefits, and could we achieve some of those benefits without incurring that cost? And then what’s very healthy is that some of the concerns around the cost and complexity of microservices have led people to say, what about this new thing called a modular monolith? Which is wonderful because again, a concept from the 1970s, earlier than that, modules, and so you can achieve the same kind of organizational autonomy with a modular monolith as you can with microservices? No. However, you might get enough by reducing the downside.

I’m glad we’re now having a healthier debate around these issues because, in many ways, the software industry is still remarkably poor at modularity in general. Microservices, when done right, are a modular architecture where a decision has been made to put a module over a network boundary. For me, it’s a relief that we’re now having these conversations. I think the big problem is we have a lot of organizations, and this is not new, that just assume they have to use microservices. That there is no debate. That’s the architecture we’re picking. It’s the default state, and that a monolithic architecture must be bad, and that’s something I pushed against in the first edition, and that’s something I pushed against even more strongly in the second edition.

While I can discuss all the benefits that microservices bring, I also describe them as an architecture of last resort because they introduce a significant amount of complexity when building a more distributed system. And so if I can work with a company and help them solve a problem with a nice, simple, modular monolith, that’s absolutely where we’re going to start, because we don’t need everybody to be using microservices. Many people are using them who would be better off without them.

When not to implement a microservice? [14:13]

Olimpiu Pop: Well, I didn’t want to ask this question, but somehow the way you responded prompts me to, so I have to ask it: What is the scenario when you shouldn’t use a microservice, even though most people are wrongly adopting it?

Sam Newman: The simple answer is when you don’t have a good reason to. Now, that sounds silly, but what I mean by that is one of the first things I do with people is ask, ‘Why are you using microservices? And they can’t give me an answer, or the answers they give are all quite insular focused. They’re like things for reuse. That’s not an answer, because reuse is a second-order benefit. Why are you doing this? If you are choosing an architectural style or transitioning to an alternative one, I think it’s essential that you understand the reasons behind it. What outcome are you reaching for? Do you know why you’re doing this?

Because then, actually, when you get down to it, okay, well, we want to do this because we need our application to be able to handle the load we’ve got, and that’s why I pick microservices. Okay, great, great. Did you look at a simpler way to solve that problem? Did you consider getting a bigger machine? There’s nothing wrong with a bigger box. Did you consider writing two copies of your monolith behind a load balancer? Did you try that? That could have been a good idea first. So for me, it comes back to the goal. It comes back to what outcome you expect the architect to bring? And if you don’t have a good reason, you shouldn’t be doing it.

Now, I think additionally, there are certain scenarios where I think microservices tend to be even more a suboptimal choice. Now, of course, all of this advice is general; there are always exceptions. I think too many people think they’re the exception, but there are certain scenarios where I think microservices are an even worse idea. In the case of early-stage startups, it’s a waste of time. You’ve got not enough time, not enough people, you don’t know if anyone wants to use your product, so put your time and energy into that. Don’t put your time into building a scalable system that’s going to handle your success if you’ve got no idea if you’re going to be successful. So don’t waste your time there.

The second area is where the domain is new. Determining module boundaries can be challenging if you don’t understand the domain. If the domain is very new or highly emerging, consider sticking with a modular monolith early on. The third area is if you, and this is increasingly a niche, are shipping software to a customer that they will install and manage. Microservices are a nightmare in general, so avoid this approach. Those are the areas where I’m saying, look, you have to do every single thing you possibly can to convince me this is a good idea. But for me, and this is all I do at sit-down with teams, it’s like, what is the outcome?

There are gradients of microservices architectures, and they shouldn’t be binary. [16:40]

And we talk about microservices like it’s a switch. You’re either doing it or you’re not. It’s more like an extent or a degree or like a dial you’re turning. So, you can go full microservice, and then you can say, ‘We’ve got one or two.’ There’s a lot of spectrum on that, right? So you look at a company like Monzo, where they’ve probably got 10 microservices per developer, some of the more extreme ratios that I’ve seen, other organizations, have 10 developers per microservice. That is an ocean of difference in terms of the complexity and the challenges you are facing.

So for me, it’s always about being very focused on that outcome. What are the things that we need to achieve for the business? Have those things we need to accomplish for the company changed? Does that mean we need to change our approach? Rather than saying somebody laid down a decision about an architecture we were going to pick and we’re just sticking within those tram lines, no, no, no. It’s a constant reevaluation of what we are doing to achieve the outcomes. So maybe you don’t need to keep turning that microservices dial up, and perhaps sometimes the answer is to turn that microservices dial back down. Swinging wildly between two extremes is unhealthy.

Olimpiu Pop: So, to squeeze that in a phrase, what you’re saying is that it’s more about the journey rather than the destination. Depending on the state of your business and what your target is as a business, you might go more microservices or fewer microservices depending on the context.

Sam Newman: Yes, I like that phrase, the journey, because let’s be clear, if you view architecture as a destination, you’re starting with a fundamental flaw in your thinking, which is that architecture is fixed. So, if you view architecture as the destination, we’ve arrived; that implies it’s static and won’t change. And that is not the right way to think about architecture. Sure, architecture can be hard to change. You should constantly be changing it. So, architecture is always a journey.

So for me, that business outcome, what the users need, what my business needs, that’s my north star constantly, and we’re always weaving towards it. What do we do next to get there? What do you do next to get there? Having a vision for where you think you might be in 12 months, 18 months, two years, absolutely fine, no problem at all. Recognize of course, you might be wrong about that. Architecture is always a journey. It constantly is changing. It should always be some sense of flux. And if it isn’t, you’re likely storing up pain in your future.

Start with a monolith and evolve towards what’s needed for your organisation [19:09]

Olimpiu Pop: So the bad news for the hardcore techie guys is that it’s more business-focused. We need to examine the business metrics and determine what we want to achieve with our software system. And because we touched on that as well, we need to look more into evolutionary architectures and the way they evolve over time. And that begs a quote from Neil Ford, which I also heard at QCon, from Luca Mezzalira’s talk, where he says that reusability is also a way of tightly coupling our systems. And basically, can we say that a potential journey, a potential path, towards a goal can start simple with a monolith, make a proper separation of concerns in terms of modules, and depending on the context, you can break those modules as microservices if you need to scale partially those points?

Sam Newman: Absolutely. And my advice has been, again, for over a decade now, to start with a monolith first. If I were starting a project today, it would be a monolith. If I were starting a startup today, it would be a monolith. For me, I need a reason; I have to be convinced that now is the time to split off a service. Running and operating a more distributed system incurs costs. Another analogy I’ve used for this is to think about medicine. So when you go to the doctor and there’s something wrong with you, they might give you some medicine, like, “Oh, my leg hurts”. “Oh, okay, well I’m going to give you a pill and that pill’s going to stop your leg hurting”. Great, I’ve solved my leg hurt. And then the doctor says, “By the way, there are some side effects. There’s a one in 20 chance your leg might fall off”. And then you get to decide how bad the pain is in my leg? Do I want a one in 20 chance of my leg falling off?

I don’t think there’s any drug that would do that to you, but there is this idea with medication that you have the side effects. I think the issue is that because we’re not thinking about the outcome, we don’t have a sense of the benefit, so we don’t know about the leg pain, but we’re also not always being rational about the side effects. And when you don’t have clarity around those two forces, it becomes challenging to know I should take that pill or not? Which is also why I give strong advice to all people who are thinking of adopting microservices, it’s like, if you think it might be a good idea, but you’re not sure, actually, there’s nothing wrong with starting. There’s nothing wrong with doing one. You’ve got a monolithic application right now. What if you created one microservice done using the right microservices principles, got it in production, got it being used, and learn from that process?

Actually, do it in anger. Because the other side to this, which is a bit where it becomes challenging, is that I can talk theoretically about the challenges that microservices might bring. Still, I can’t predict exactly which one you’re going to hit on which day. Some problems you might not experience for years, some you might hit on week two, there are so many different variables there. So if microservices might be good, start with one and get it out there and see what happens and then learn from that.

I think the challenge often in enterprise organizations especially, is that they say things like, we are going to do a microservice transformation, and I’ve told my boss, We’re doing a microservice transformation, and I have got a lot of money to do a microservice transformation. Sam, please come here, talk to us about our microservice transformation. And I get there and say, “I don’t think you should use microservices”. And they’re like, “But I told my boss we’re doing microservices”. Then we’re trapped. I have gone into more than one client where I have been used as air cover to explain why they’re reversing the charge on the microservice transformation. Screw microservice transformations. Don’t talk about that. Talk about an outcome-driven approach.

So a couple of my clients, they’re saying, our transformation is that we want to move to more autonomous teams. So that’s the outcome. We’re moving towards more autonomous teams, and that’s our north star. What do we need to do that? That gives you a lot more air cover. So, more autonomous teams could be, maybe microservices, in some places where it’s appropriate. In other places, it could be reducing the need for tickets and doing things that are more self-service. You then give yourself wiggle room because if your north star is scaling the system, more autonomous teams, you can try microservices, don’t work out, you can try something else.

You don’t have to go back to your boss and say, “We screwed up, we made a mistake. We’re not doing microservices anymore”. So you don’t have to burn that political capital. So if you actually structure these things as outcome-focused, you give yourself so much more wiggle room to try things, to change your mind, to learn and adapt. And I know full well I’m doing myself out of work here. Please hire me for your outcome-driven initiatives. However, I dislike it when people say it’s a microservice, because that’s like saying we’re implementing an enterprise service bus.

Olimpiu Pop: Who cares?

Sam Newman: Who cares? Right.

Architectural decisions should be driven by business-focused outcomes, not technology [23:45]

Olimpiu Pop: Exactly. Thank you. So that’s nice. And that was my feeling also at QCon. Before I attended QCon I was thrilled when people were talking about user experience. You have these hardcore guys that usually used to be site reliability engineers from not long ago, and they were thinking about only in terms of boxes. Now out of the sudden, they are just talking about the user experience of that user or that group of users on that percentile. And I was like, wow, it’s actually happening, people are thinking about outcomes.

Well, it was a small part of the presentation group, but hey, I’m thrilled to hear that people are finally getting to that point. But given that we spoke about two of your previous books, let’s talk about the current one. What was interesting was that you had three points that you focused on during your presentation at QCon, and those were retries, timeouts and idempotencies. And that was nice because for me, what I’m saying, especially with this generative AI hype and whatever wave of hypes because it’s like a tsunami, it’s not over. Another one is starting, it’s getting back to the basics, basic stuff, and it’s essential to understand the basics and to build again.

And it was very eye-opening for me the way you looked at it, in, again, outcome-driven, focused on real data, especially I’m thinking now, especially on the retries, the number of timeouts, not the number of timeouts, but the span that you’ll use to configure your timeout. But what I’m thinking now, and it’s stuck in my head as a question is, what happens with the data consistency, for instance, or all the other stuff that usually happens in the distributed ecosystem because in the end data, well, people say that it’s the new oil, it’s the new gold, whatever, it’s the one that drives the application. What are the things that you will probably touch on in your book that we still have to focus on? And come up with some heuristics on how we can decide those things, leaving technology aside.

Sam Newman: The problem is that a lot of the discussion around the complexity of the distributed system is either using my magical framework and I’ll solve all your problems, or you don’t ever have to worry about there being a network, which is fundamentally a lie. Hello, things like Dapper. Oh, I’ll pretend the network doesn’t exist. It’s like, it does. Or they’re, let me do a whole hour-long presentation about Byzantine consensus protocols, and it’s like, just kill me now. This is going to break my brain.

And I think we get oversimplification in some cases in the wrong way and overcomplexity in other ways. And I think the reason I mean oversimplification is that a lot of our technical solutions will often try to hide the reality from us to an extent where we have lots of developers that through no fault of their own are living in a world where they don’t understand what can go wrong. But at the same time it doesn’t mean they should understand the nuances of CAP theorem or be able to debate harvest and yield or talk about the nuances of Raft versus Paxos or anything like that. Use Raft, obviously.

So for me, I try to think… So I think about a persona for the book. The reason that the second edition of Building Microservices is so thick is because the book is mainly about the pain of microservices. If you’re going to do this, here are all the things that could go wrong, and here’s how you handle them. That’s why the book’s so thick. And so in that spirit, there are lots more people building distributed systems than in the past, and the degree of distribution is increasing. So, if I were a developer who suddenly found myself on a project where it was a distributed system, what would I need to know? And so I’d want that book to be a journey. I want you to start with the fundamentals and take you through.

Yes, we are going to talk about more complicated things. I’m just finishing up the chat to where we’re talking about why exactly one’s delivery is potentially a fool’s errand and all that kind of stuff, in how message brokers work. So the persona was my son, relatively new, working as a developer on a project where they’re doing all this stuff. It’s like, what the hell? What’s going on? What is a rubber ring in book form that I can give to all these drowning developers that are being swamped by this challenge, this complexity?

What generates the pain of distributed systems? [27:46]

So I kind of start in the beginning, and the book is in two parts, part of it is about the human side of human factors, and I frame resiliency very much in the kind of context of a socio-technical system, which is our software, people and technology working together. If you ignore that, you’re in trouble. So, I discuss things like safety management, resilience, engineering, and the differences between those concepts and everything else. But talking about the simple heuristics, the simple stuff, actually, what is it about a distributed system that hurts? What is it fundamentally about a distributed system that causes all of this pain?

It’s three things. It used to be two, and I found out a third thing. One, it takes time for stuff to go from A to B, and that’s only partly under your control. Two, sometimes the thing you want to talk to isn’t there, and that’s only sometimes under your control. And three, resource pools are not infinite. Fundamentally, if you always have those things in the back of your mind, all the rest of the stuff we talk about is just a logical extension from that. So recognizing that those things exist, and then the question is, what can you do about it? So for me, it’s like I want to take you on that journey.

I don’t start off talking about things like eventual convergence or anything like that in that book necessarily, because that’s a logical extension of those ideas. And so in my talk it’s timeouts, retries, idempotency, and that’s because that’s effectively the first two chapters of the technical side of the book is okay, if it takes time for things to go from A to B and sometimes the thing you want to talk to isn’t there, you need to give up. So let’s talk about how you give up, which is timeouts. If you’re going to give up, we need to talk about trying again, We’ll speak of retries. If you’re going to retry, you need to make retries safe. So let’s talk about idempotency.

So I always try to write my books in a way that they’re topic-focused. You could read one chapter, and it makes sense, but I’m also trying to take you on that journey. Then we move on to things like rate limiting and load shedding. So, how do you handle having too much load, and then it gets more and more complex in that kind of section? But I think it can be boiled down to just those three rules. It takes time for stuff to go from A to B, and it’s not instant. Sometimes the thing you want to talk to isn’t there. And by the way, the resource pool’s not infinite, and that’s it. Really, that’s it. And then this book is all about what does that mean? What are you going to do about that?

Olimpiu Pop: So what I hear from you is we need to make very simple systems, well, simple in the way we interact with them and making things simple is hard. And pretty much that’s it.

It should be easy to understand your system [30:09]

Sam Newman: Well, there’s a more general point here is that we need to make it easy for us to understand what our system is doing. So that means if you’ve got a complex system or a simple system, it doesn’t really matter because ultimately if a system is simple technically but hard to understand, then it’s a complicated system as far as I’m concerned. So for me, there’s that. So I am talking about observability from a resiliency standpoint through the book. Mine is really more simple than it’s kind of almost at the more fundamental level, which is a timeout. Oh, okay, I can do timeouts, I use Resilience4J or I use Polly or I use Dapper, and it’s going to do timeouts on me.

What do you mean by timeouts for you? What does that mean? Oh, it handles timeouts. Okay, so does it magically know how long to wait? Does it know if waiting too long is going to cause resource saturation and cause your system to crash? Probably not. Does it know if waiting too long is going to cause your customer to be annoyed? No. If you retry too quickly, then you hammer the services. So you could have a tool that does that, but you need to know how to configure that. You need to know how to take the user experience, expectations of your users, understand the resource saturation, because the resource pool is not infinite, and use that information that you have to gather as a human being to take a look at these great tools and know how to use them well.

Resilience4J is brilliant, Polly is great, and a lot of the Polly stuff is now in the latest versions of .NET. Great stuff in there, but it’s a toolbox, and as a developer, it’s like, which bit do I pick up now? And I don’t think it is that hard if you kind of bring it down to those fundamentals. And I think that firm foundation stands people well, but there is also part of that which is, I think this is the stuff that we’ve talked about in terms of my talk is, if I need to know how to set a timeout, I probably need to know how to look at a histogram. I need to look at my system behavior and understand what it’s doing. And so, for me, that comes back to the fact that we need our systems, whatever they’re, to be as easy for those systems to be understood as possible. And that’s why the observability stuff that I’m sort of weaving in at the moment is going to be such a big part of the book.

Olimpiu Pop: What I was thinking of is that, as I said, it sounds very simple, you have a histogram and then pretty much you’ll just focus on the middle ground more or less and then you take some decisions. You have those that are very fast, and then you have the ones that are very slow, more or less, and then you just focus on the majority.

Sam Newman: Yes, but… The “but” is big, right? Because the key thing is you focus on the middle majority in terms of that’s where you start. And so the old classic thing is I might have tail latencies and maybe initially when I’m getting myself sorted, I might potentially set a timeout before those tail latencies. And sometimes that’s the right thing to do from a system health perspective. Once you’ve sorted your system out and your system isn’t falling over, then you should probably work out why you’ve got tail latencies, because the tail latencies are typically users.

So, for me, it’s like there’s going to be a discussion of tail latencies in the book, why they’re there, what you should do, and why you should worry about them. But early on is, if you trying to accommodate tail latencies risks the health of the system, then yes, timeout before those tail latencies because those tail latencies typically are, they’re humans, right? That’s why it’s a journey you go on. I mean you don’t talk about request hedging in chapter one of a book like this, actually, most people don’t do request hedging anyway, and for developers who know, they can jump into the middle chapters where it’s more complicated and advanced if they want to.

Ensure a good quality service to most of your users [33:33]

Olimpiu Pop: Yes, that’s what I was thinking of now, tail latency, more often than not, it’s more or less rotation-based. It’s a round-robin. It’s not always the same individual who is unlucky enough to get a timeout. How should we envision that thing? If I’m really unlucky and I’m always in the tail latency, where should the decision be taken? Because we spoke about the outcomes and obviously when you’re discussing at business level, you’re discussing broader level. Do you have any thoughts on this?

Sam Newman: So if you are seeing a certain user or group of users that are consistently experiencing poor latency, then the first thing is, can I get that information out from my observability tool? And this is why observability platforms that support quite high cardinality can be useful because then I can associate a username or a user cohort identifier with those metrics and I can get that analysis out. So that’s why I like things like Honeycomb for that purpose.

So, say I’m seeing a user group that is consistently having higher latency, well then the next thing is what information do we have about that user group that might help me isolate what’s going on? Sometimes the issue is we don’t have the data so we need to add it in. But examples of where you might have high latency could be you’ve got a sharded system and it’s a problem with one of the shards, right? That’s a pretty obvious one. And again, that’s why the drill-down is essential. Looking at the average response time of a database cluster is useful, but if you want to drill down and look at an individual shard, it is.

The second reason could be location-based. Is it a part of the world you’re serving that’s not going so well? Could there be some caching that could help? Is it just that they’re in Singapore and you happen to be in Iceland and that’s always going to suck and maybe you say that’s kind of life? I’ve had some issues before with specific ISPs that were causing problems with their cookies, resulting in excessive redirection flows. So the key thing though, is firstly you’re spotting it. So that pipe, is it a user group that is consistently seeing poor experience and in which case, then you’ve got actually, in some ways that’s healthy, then you can add the data to find it out.

If it’s inconsistent, you’re seeing those outliers, and it still comes back to me for looking at those outliers. And that’s why, again, using an observability tool that gives you the appropriate ability to handle how you’re doing your sampling. So, for example, making sure that you’re doing tail sampling, so you decide at the end of a capture of a trace whether you’re going to keep it or not. So anything that’s going outside of your norms in terms of how long things are taking, you are capturing those traces. You can gain a lot of information.

Sometimes you can see tail latencies because they’re just unfortunate enough to have a whole lot of cache misses along the way. That can be a good reason sometimes, where it happens. You can also have tail latencies to say in a JVM-based system, you hit a node that was just spinning up, and it took a bit longer to respond. That happens. But again, if you’re capturing the traces, you can then dive in and find out what’s happening. As you get more and more data, you might see more and more humps. So if you see clear patterns, we’ve got a nice bell curve and then there’s a big gap and then there’s quite a cluster, well, that tells me there’s a bunch of stuff here that could be really, really interesting. And sometimes you get the hump on the other side right as well, which is like, these were fast. What’s going on there? We should double-check that it didn’t happen.

And for me it’s more, and this is the key to observability as a concept, I need to have the data available for me to ask questions that I didn’t know I was going to ask. So can I explore? Can I gather? Can I tease these things out? If I don’t have the necessary data, can I add it back into the system to start understanding what’s going on? And it’s also about having a tool chain which allows you to build the views that help you get those insights quickly as well. And that’s another part of it. And that’s where kind of the cold, hard, oh, we support open telemetry has to give way to, well, actually what does that tool let you do with it? And for me it is that, can I explore what’s going on here effectively enough?

By implementing observability on day one, you gain valuable insights into your system [37:35]

Olimpiu Pop: Thank you. It seems that observability is one of the key points here, having a lot of data and understanding of your system, but obviously, observability is not precisely the first thing that people think about when onboarding on a project or even for more mature projects, observability is hard to achieve. How would you recommend getting started with a new project to have that insight about your system and to be able to make those decisions?

Sam Newman: I’d start on day one with some fundamental things. I said this in both editions of my book; the very first thing I would ever put into a system is log aggregation. That’s often a prerequisite, and I would follow that up with saying that you start off from the beginning by using open telemetry for capturing your metric and trace information. So, get an SDK for your language for open tel, put that into your code, use that from the very beginning, use that for capturing trace information. With Open Telemetry, you can have a variety of different vendors. Open source, non-open source – you can always change your mind later; get that in early.

For log aggregation, I still suggest using a separate system. Although theoretically, open telemetry supports logs, I think that in practice, there isn’t the degree of maturity we would like around open telemetry and logging. It’s less about the protocol and more about the tool vendors, if that makes sense. So that’s definitely where I’d go. I believe also logging can be pretty helpful if you’ve decided to pick an open telemetry collector that has low cardinality, which again, choose one that can support high cardinality, hello Honeycomb, but if you’re going low cardinality, your logs are even more critical.

So doing those things early, log aggregation, definitely day one, get every service logging in a consistent format. You can use log forwarding agents to reformat logs, but don’t because it’s costly to do that. Ensure they log in properly in the correct format. Start with structured logging, rather than semi-structured. The world has moved on; new structured logging makes your logs more queryable – that’s day one. And from the beginning, again, open telemetry inside your services for instrumentation. And at the very beginning, you might only use open telemetry for ingress and egress of a service endpoint so that you can track start and end times, response times, things like that and outbound calls. And then later on, you can do things like adding open telemetry instrumentation for things like database calls and stuff like that if you would like.

Around the same time is to be really clear and upfront about what information you’re happy to capture. A good starting point is to avoid including anything personally identifiable in a log or observability tool. Just don’t emit it in the first place. Don’t use scrubbers, just don’t put it out. If you don’t put it in a log, you don’t put it in open telemetry, you don’t have to worry about where it goes. So, again, that is something I would sort out at the very beginning, establish as a principle. Those are the three things I would start with. Log aggregation, open telemetry, don’t put PII into either and just do introductory ingress and egress, and you can always add on more stuff afterwards once you’ve got that in place.

Olimpiu Pop: Thank you. So what I hear from you is start with the discipline in the way you do it across the board to have queryable logs and start with those, because otherwise it will be more complicated. Just think upfront to make sure that you don’t have PII to avoid headaches later down the road. And after that, continue on the road to more implementation or a complete implementation for open telemetry, where you can just instrument database calls and stuff like that.

Sam Newman: I would also say, you want everybody using the same pane of glass. What you don’t want is five different teams all using open telemetry, but all using their own open telemetry collector. You need one collector because the whole point is seeing the correlations in the traces. Get that scaffolding in early and you can always add that information. You realize, oh, I wish we knew about X. Well now we’ve got the basics, we can add that in for next time.

So yes, probably alongside that, I would say these are other just general principle things. You want the same tool chain in all your environments. In your CI environment, if you have an environment where you can see what’s going on, such as a QA environment or a dev environment, or whatever it might be, I want the same toolchain or the same log aggregation tool. I want the same observability tool. It doesn’t have to be the same instance of it because I want a developer who gets some experience and exposure to using those tools in a dev environment could then be helpful in a production environment, could sit with an ops person, if they’re not the ops person as well, to understand what’s going on.

So you want to pick a vendor whose pricing scheme allows you to have the same tool everywhere. The reason I’m being clear about this is because some vendors, the way they price, means that you end up in situations where an organization picks one tool for production, and they can’t use that tool in dev or QA. It might be a brilliant production tool, but if the pricing means you can’t use it in dev and QA, get rid of it, find something that lets you do that.

Ephemeral environments are an invaluable tool for your organization [42:26]

Olimpiu Pop: Thank you. You mentioned before, earlier in our conversation, flux and somehow flux and outcome for me just pushes me to think about continuous deployment, getting quick in front of your users to have that quick feedback loop, and now you mentioned different types of environments. And there are a lot of movements nowadays about just getting rid of the QA, the staging environment, just going to production, going to continuous deployment. What do you think about these things? Are they useful on the route or not?

Sam Newman: So there’s a difference between static environments and ephemeral environments. So if I’m thinking when I talk about my CI environment, what I typically mean, that’s an environment where effectively I’m probably, as I’m doing a build of let say a service, I’m deploying that service probably and spinning up some infrastructure temporarily to support it, running a couple of tests on it and it gets shut down again. So that environment isn’t static; it’s gone. But that doesn’t mean I still don’t want to get the logs from that, or I still don’t want to get the telemetry from that. So, for me, ephemeral environments make an awful lot of sense.

Having a lot of environments between you and production doesn’t make a lot of sense to me. For me, though, you have to understand why those things exist. What is the purpose of all these environments? So, sometimes the environment itself can be the problem, but typically the environment is a symptom of an organizational issue. Environments represent Conway’s Law as well. Oh, the integrated test environment is really painful, and it keeps flaky. We have always had people working in it. It’s a real issue, so we need to address it. No, no. Why does that exist? What is the delivery process which requires that? Because yes, we could make some improvements to it, but you need to understand why it is there. So I think often our environment issues are just symptoms of an organizational or a process issue.

Olimpiu Pop: That’s nice. So that means it definitely helps to have an environment where you can just test it, to just probe how your system will behave, but it’s pointless to have an environment that is most of the time unused. Ephemeral environments are a good middle ground to avoid this kind of waste from that perspective.

Sam Newman: A couple of other quick things on that. If you’ve got like an organization, we’ve got 10 teams and they’re all building microservices, and you’ve got one massive system, a developer should not have to run all the microservices. A developer should only ever have to run the microservices they’re working on. I shouldn’t have to run all a thousand microservices just to do my local testing. That’s wrong. You’ve got to fix that. So you’ve got to get good at stubbing approaches. Stubbing, most of the time, not mocking, people mock way too much. That’s one thing.

Another place where ephemeral environments can work well for some of my clients is that they’ll really work as sales tools. So, in environments where you effectively have to do single-tenanted customer installations, having ephemeral environments for that is fantastic. You pop in before the sales call starts, press a button, and five minutes later, you’ve got a test environment for that customer. Those things are amazing. The whole ephemerality of it though, comes back; ephemeral or not, the principles are still the same, which is all the configuration for how that environment is brought up and how that environment has been changed. It needs to be version-controlled, automatable and traceable.

So, if you’re adopting that infrastructure’s code approach and you’re controlling and limiting, if not eliminating, any manual edits to that environment that bypass version control, those fundamental principles have not changed. That was the stuff that’s been there since the beginning of time; it’s in the Continuous Delivery book. More people need to read the Continuous Delivery book. Kief’s book on Infrastructure as Code is in its third edition. Please read it. That stuff has not changed. So, whatever your thoughts are about whether your environment is ephemeral or not, that stuff is still the same.

Olimpiu Pop: Thank you. Good. Is there anything else that I failed to ask you that we should address?

Sam Newman: Well, a couple of things. So, firstly, I do a newsletter, which I do once a month. You can go to my website and sign up. It’s pretty low traffic in terms of content; I try to keep it quite nice and brief. I provide updates on how the book is going. I also share information about what I’ve got coming up and things that I’ve found that are interesting that week or that month. I try to keep it high signal, low noise. If you want to read the current version of the book, which is in early access form, you can go to my website and there are links from there. You could go and access it as early access on oreilly.com. That’s probably it. To find out what I’m doing, whether I’m attending a conference, or to learn more about the book, please visit my website.

Olimpiu Pop: Okay. Thank you, Sam. Thank you for your time, and I look forward to reading your book.

Sam Newman: Thank you so much.

Mentioned:

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Do microservices’ benefits supersede their caveats? A conversation with Sam Newman

Transcript

A decade of microservices [01:13]

Microservices are more about team autonomy than anything else [07:28]

Is the cost of microservices worth their benefits? [12:27]

When not to implement a microservice? [14:13]

There are gradients of microservices architectures, and they shouldn’t be binary. [16:40]

Start with a monolith and evolve towards what’s needed for your organisation [19:09]

Architectural decisions should be driven by business-focused outcomes, not technology [23:45]

What generates the pain of distributed systems? [27:46]

It should be easy to understand your system [30:09]

Ensure a good quality service to most of your users [33:33]

By implementing observability on day one, you gain valuable insights into your system [37:35]

Ephemeral environments are an invaluable tool for your organization [42:26]

Leave a Reply Cancel reply

Stay Connected

Latest News

How I navigate when working at Microsoft with autism, ADHD diagnoses

China’s SAIC extends partnership with CATL on auto battery swapping · TechNode

Foods That Naturally Relieve Headaches, Backed by Health Pros

Best Delta-8 Gummies of 2025: Premium Picks for Potency, Purity & Flavor

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

A decade of microservices [01:13]

Microservices are more about team autonomy than anything else [07:28]

Is the cost of microservices worth their benefits? [12:27]

When not to implement a microservice? [14:13]

There are gradients of microservices architectures, and they shouldn’t be binary. [16:40]

Start with a monolith and evolve towards what’s needed for your organisation [19:09]

Architectural decisions should be driven by business-focused outcomes, not technology [23:45]

What generates the pain of distributed systems? [27:46]

It should be easy to understand your system [30:09]

Ensure a good quality service to most of your users [33:33]

By implementing observability on day one, you gain valuable insights into your system [37:35]

Ephemeral environments are an invaluable tool for your organization [42:26]

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News