Effective Error Handling: A Uniform Strategy For Heterogeneous Distributed Systems

Transcript

Olimpiu Pop: Hello everybody. I’m Olimpiu Pop, an InfoQ editor, and I have in front of me today, Jenish Shah, one of the engineers from Netflix that are making sure that our movies are streamed each evening and keeping us on the couch. Thank you for that, Jenish. Can you provide us with an introduction and more appropriate items for you?

Jenish Shah: Sure, sure. Thanks for the brief introduction, Olimpiu. Hello everyone. My name is Jenish. I’ve been working with Netflix for more than 5 years now, and before that, I worked for 3 to 4 years at Amazon, and before that, a bunch of other companies. From the official title: yes, I’m a software engineer – hardcore backend engineer. I’ve been developing highly scalable distributed systems for years now, and that’s my passion: solving real-world problems, helping my stakeholders, making incredible movies, and great content, so you guys can sit on the couch and enjoy that.

Olimpiu Pop: The company you mentioned has two products that are interesting and global in scale, and we each want to learn more about them. You could share a bit of your upbringing, your education, and how you managed to get to this interesting position.

Jenish Shah: Sure, absolutely. So I’m originally from India. I was born and brought up in India, back in the ’90s, ’95. Those were the years when people started using computers, and one of the first industries to adopt them was the financial industry. My dad used to work for a bank, and he got to start using a computer. He was pretty excited. I was good at mathematics and analytics, and my dad just told me, “Yes, it’s pretty exciting to work on it even if you want to play games”. And eventually, it was ’99 when my dad got a computer for my brother and me, and I started using it. Obviously, I wasn’t writing any code, to be honest. I started with playing games, then moved on to playing music. But while doing that, I was amazed by what other stuff computers can do.

In India, after you complete 12th grade, you decide which career path you want to take. 2000 was the year when I had to choose. It all aligned, and I was like, “Okay, if I’m getting an opportunity in computer engineering, why not?” I just go with it. And then I started getting into data structure, and it got me more and more excited. I know many people hate it, but I do love it. Right? And yes, one thing, then another. I completed engineering, got into good companies starting with Oracle, and eventually reached here. Along the way, I met many great engineers and learned quite a lot.

Olimpiu Pop: We had our first computer at home around the same time. That was the year for me as well. And actually, I was just thinking the other day. I was talking to some of my friends, explaining that I was probably around 9 or 10 when I got my first computer at home. And actually, my daughter doesn’t have access to one. We have a lot of laptops there, but she cannot say it’s one of hers; her school doesn’t encourage screen time. And then I realised that it’s about time to do something about it. So we did some programming last year, like Python, and we did some Scratch even before, so we’ve worked on that, but now it’s the moment when I’m just encouraging her to do stuff herself, mainly through Python, because they found quite good resources, and that’s quite cool. Thank you for sharing.

Microservices evolution from REST API over HTTP [03:59]

Taking a step back, Netflix is, at least for me, synonymous with, as you said, distributed systems for a long time. I was just looking, and I was amazed by what’s happening on the Netflix blog, the technology blog, and all the stuff through Chaos Monkeys and all the things. It was like pure innovation – things that are now used by a lot of other companies – but looking back over a long period of time, microservices were synonymous with REST, and then REST was synonymous with HTTP, and that’s a big misconception.

Still, it feels like in a village: somebody sees the movie, not everybody, and then somebody goes and tells the story, and it just keeps getting diluted, but you are at the source of it. How do things evolve? How are you looking at microservices? Is it actually tied to HTTP? Is it only REST? There is other stuff. Let’s try to create something like a small definition – not a blueprint – but a context from which we can evolve our conversation.

Jenish Shah: That’s a good question. How do microservices evolve? So initially, if you go back then, there was no concept of microservices. People used to create a single service that handled everything. Then people realised rightly that if you start building a more distributed system, not all the features coming out of your product require a similar kind of treatment.

Some components require high scalability, some require high reliability, and most require both. But you have a feature-specific domain, specific use cases, and what you want to do is you want to carve out a separate small service, and that’s where people started thinking about microservices. Why don’t we have a microservice? And again, to your point, microservices are not just about protocol; it’s not just about one microservice talking with another micro. They handled a bunch of business logic, and what eventually happened is to start with REST, right?

REST was very standard to start with, and that used to be the default protocol. If you have your service exposed to the outer world, like hypothetically, correct? I work on Amazon or Netflix, but if I have a third-party calling my service to get any kind of information or even internal service, we always used to follow the REST and then people realise, okay, one pill will not work for all this is right. What we need to do is, like, okay, for external, REST gives a good structure, JSON kind of an output, right? It’s easy to understand, easy to read. That makes sense. But when you have a lot of small microservices handling each business use case, raising is not the optimal way, it’s not the wrong way, but it’s not the optimal way.

Olimpiu Pop: You provide a lot of nuggets of information. Let’s put them in context, that first of all, when you’re talking micro, it means we are just focusing on a business topic. You have a business domain, and then a handful of microservices are just attached to a given business domain. Then, as you said, initially, HTTP was used because it was convenient, and it simplified things if we think about what was before, and that had a lot of issues around it. It was very hard just to change stuff, and obviously it was a good change, but it was not the most efficient way because HTTP at that point, at least the HTTP 1.1, was not considered for those kinds of things. Just think about it, we were just coming from the age where you had just plain text images and then other protocols started appearing, and you were just about to begin with gRPC.

Jenish Shah: Yes, so that’s what happens. gRPC is a highly optimal, highly efficient protocol for transferring information. And what companies realised, and what software engineers realised, is that if I need to make my service efficient and my service is also getting information from other services, I want to make sure that the communication pipeline is also efficient and not just my business logic. And that’s why people say, “Okay, let’s do this like for internal communication, let’s start using gRPC that gives us the base mileage to solve our use case to solve our customers and all that kind of stuff”. And that’s when I think of your question, things started evolving.

Oh, now my same services have a REST standpoint, same services at gRPC endpoints, because I don’t just have an external consumer, I have an internal consumer, which has a different kind of SLS expectation. And eventually, as the industry evolved, people started thinking about, and I don’t know if you want to touch on GraphQL right now. Still, people started thinking about a new paradigm for building applications.

Now, as a user interface, you want to build an application. Earlier, you used to go to one application and get that data. Still, now, because of microservices, there are many services on different kinds of business logic or business domains, and the UI or user interface wants to get part of the information from all the services. And that’s when GraphQL came, right? GraphQL builds an interface on all those services. So UI just call GraphQL, and GraphQL is smart enough to pull the respective information from the different services too, so that UI can enrich the user interface.

Olimpiu Pop: Let me expect a couple of points on what you mentioned. gRPC came as a solution to make things more efficient, and most often than not, this is used for internal communication. That’s what you mentioned as well. And now to use some fancy wording from the distributed landscape, this would be used for East-West communication for internal services, as they’re called by people that actually want to feel that means they know about distributed systems, right? Then, if you’re looking to give a moment, and probably I am expecting that, at least in my experience, where I don’t need to have very chatty services, you need to know how to go against it.

So, what you’re doing on the backend side, you want to be distributed to be scalable so that each entity or microservice service, in this case, can be scaled individually. Then you can use them wherever they’re needed. But at points, when working probably on the mobile side, because the mobile is more sensitive to having chatty services or even when it’s about cloud, and you don’t want to have that, you can aggregate it. So, go against what you actually did on the backend side, and GraphQL was a solution for this stock, where it just sends the request. Still, behind the scenes, through the Middleware, the GraphQL Middleware, it just aggregates all the information and then provides that so that you don’t have to go back and forth, right?

Failing gracefully in distributed systems [10:45]

Olimpiu Pop: Great, thank you. I was just going through my mind while listening to you that it’s so funny and such a beautiful thing in terms of open source and what the open source technology brings, because we’re discussing the concept that got started in Thoughtworks, a company that is known for the nuggets of innovation in the technology industry that they provide, and that’s the microservice.

Then we’re discussing REST. What came from a PhD thesis a long time ago was built on top of HTTP. That was an open standard for an extended period of time, and it still is an open standard, and then we are discussing gRPC. That’s a protocol put together by Google and shared by many people. And then we have GraphQL that comes from another company, and that’s Facebook, and in al,l a lot of the concepts of the systems or coined and built by Netflix on a potent Java stack that, well, Java has 30 years now, so it’s something that came from Sun and then now it’s Oracle. It’s such a nice ecosystem today.

What I liked a lot, and that’s very important for me, is the part we do business because in the end, technology should serve business, and this is one part of what we’re discussing now, it’s just plumbing. But the other problematic stuff is that you’re just getting tangled in the flow of the application, and so on. How do you keep mentioning SLA and SLOs? So well, I’ll just stay on the SLA side of things because SLOs are for you internally, so I shouldn’t care about that. But how do you make sure you handle things properly with grace? Because in the end, that’s what counts for me. And that was one of the benefits of the microservice: if the service degrades, I don’t know, I want to watch a movie, I know exactly which movie I want to watch, but I don’t know, the search isn’t working or something. How do you state from that angle? How do you make things work as expected?

Jenish Shah: So when we think about microservice in today’s world, in the fast-paced world, I think what we are trying to solve every day, if not every day, every other day, we try to come up with one new business use case and then try to solve those use cases. And what happens is we will keep building new microservices. Now, with all these out-of-the-box technologies, it’s easier to develop microservices, and that’s what we have been doing.

Now, to your point, the degradation. So while most of our focus is on building the business logic, I want to understand my business domain and solve that problem. And when it gets solved, everything is fine. Nobody cares how it works internally. The problem is when things don’t, and that’s what I’ve realised, and I’ve been discussing with my colleagues, and I’ll get to it in more detail, but when things don’t work as expected, right?

Like okay, you did something or one service call, there was some service, and something unexpected happened, you are interacting with the database, the database went down, you’re expecting some other input as a part of the request from the caller service, but they didn’t pass the appropriate input. How do we take care of this kind of situation? Because every business has a consumer, whether it’s an internal service or a user like you who is watching Netflix.

You don’t always want to show, oh, something went wrong. Sometimes giving extra context is useful, and that is pretty important. While you try to solve business cases, you also need to make sure that you are not overlooking, which honestly kind of gets overlooked, for a significant number of times when you are starting to visit, you are more excited about solving business use cases and overlooking what if things don’t go as planned.

So, providing the appropriate error code and providing relevant information when things don’t work out fine is extremely important from my perspective; it doesn’t matter whether your caller is internal or external. Just think, for example, you are filling out a form on the hotel reservation website. You put a bunch of information, you hit submit, and because you missed out on one of the text fields to maintain it, you hit submit, and it says, Oh, something is wrong. You get annoyed, right? You don’t understand what’s going wrong. You require a user-friendly error when you nudge the user to take a specific action, and it could be anything. So that’s what I see, how you have to scale both ways from the business perspective and from the experience perspective.

Olimpiu Pop: I was asking about that because there are two facets of it. One of them is you as a provider of the service, and then obviously, you need to communicate with the outside world about what happened. And then on the other side, you as a developer that are calling another service, you should know exactly what to do now, wait, retry, just drop it, and all these kinds of things. And I had the conversation earlier this year with Sam Newman, one of the people who is known for their microservices background, and that was what he was saying: that it all boils down to three things: idempotency, retries, and timeouts.

The challenges of handling errors in a multi-protocol setup [16:13]

How often should you try, and what should be the timeout of the service to know exactly how to do it, and not to overkill the service? And then, probably you can have this flow based on error messages, so that’s it. But if you’re just thinking a monolith perspective, well, that’s pretty easy. You have one thing that interacts with the other one; it’s just playing holes. But when you are over a network boundary, how are you doing that?

Jenish Shah: And that’s what I would like to get into, a little bit of technicalities right over here. Let’s just think, as you said, about provider and consumer use cases, and I am starting with the provider. What happens is we were just discussing over the period of time, we began serving multiple protocols even as a part of one service, and the protocols have a way to communicate information across the wire, across the network. Still, your business logic more or less stays the same. So you may have an API exposed to your PC, and another exposed over REST, but both might use the same business logic. Now, if something has gone wrong, let’s assume you are trying to book a hotel and that room is not available, like 404, you were trying to do something, it is not available.

I need to give that information as a provider, so I want to make sure that if Olimpiu is my caller doesn’t keep retrying, because if you keep retrying, it’s not going to work out. After all, that information is not with me. So I need to give them explicit instructions. Now I have a business logic which is the same, which is identifying whether that room is available or not, and then I have to give that information back over REST, and I may have to give the same information or gRPC. Now, the interesting part is that with the adoption of different protocols, every protocol comes with a different industry standard. These are the error codes over gRPC; these are the error codes in the REST world, right? In the REST world, if I don’t find anything, I’ll return 404, right? It is a standard status code. 404 doesn’t mean anything in gRPC; they have a separate status code, which means it was not found or something.

So what I need to do, and the onus is on me, which is the provider, and I know it is again, I keep telling you it’s an overload, right? I need to write a code to understand, okay, my call was via REST, and now this situation has happened, I need to give the response code 404, or the same call happened on gRPC, and to give the status code not found. So that means to understand and as a good service provider, I need to realise that what I need to do is essentially create an interceptor which will realise, okay, the business logic failed, it is telling me something I need to convert to a protocol-specific error code before I respond. So now that is the thing. Now let’s look at the caller side. Like my caller, they’re looking for as accurate information as possible.

If they’re looking for 404, if the room is not available, if I give them 500, 500 is a generic internal service error. And if you don’t handle things more or less, you’ll run into that giant bucket of 500 if you are on my call. And if you keep getting 500, you are clueless. To be honest, you don’t know what has gone wrong. Can you change so that your call will work? Or is that something temporarily wrong with my provider, saying, “What’s going on?” And it’s not just you; you might have your callers, you want to provide that information to your callers, right?

Again, I was going back to the hot on reservation page, you keep getting, “Oh, something wrong, something wrong”, okay, what are you going to do, right? So user experiences the vital thing, not only in the case of success, when you can stream a movie efficiently, if something goes wrong or the user is doing something wrong by booking a hotel, you want to give that information back to the user. The error codes are the ones which are going to drive that information with this specific error message.

Olimpiu Pop: So, usually, looking at the straightforward code, I don’t know, you are reading a file in Java, what you need to do is just wrap it in try-catch. Well, obviously, we’re discussing ancient Java because now Java is very cool and very thin and doesn’t have to do a lot of stuff. That’s what lands in the programming language. And then if you have this reading from a file system in the method, you have two options.

One of them is to catch the exception itself, or then bubble it up, and then it could just go somewhere else to be handled. And now in the distributed world, it’s a lot more complicated because internal communication is usually done to gRPC, outside calling is either through HTTP or GraphQL, and then all of a sudden, you have three different protocols, and each is different. And because we are elegant programmers, we will not do something like “if a given protocol” do that or if that, or it’ll have a different perspective, what’s the solution?

The service failures fall into four categories [21:27]

Jenish Shah: So what I am doing is that I developed a design pattern while at Netflix, it was three years back or four years back at some point in time. So what I did is now, as a backend engineer or as a distributed citizen, I want to have my primary focus on solving business logic and not writing code, which will keep transforming my error scenarios into protocol-specific error codes. How do I do that? So what I came up with is: in any system, if you divide your scenarios where things can go wrong into four major buckets. One is like, okay, you got some input or you got a request, and whoever is the caller is not authorised to make those requests, kind of an authorisation problem. Another bucket is that callers can call. They’re eligible to call, but they did not provide enough information; they provided the wrong information to execute the business logic, a kind of validation problem.

So that is another bucket. The third bucket is when my system has done something wrong. Okay, everything looks fine from the caller’s perspective. We are trying to process business logic. Something has gone wrong in parsing one of the files, which we are reading from the file system or something. So it’s like my application, a business logic problem. And the fourth bucket is my service, which also depends on other services, other infrastructure pieces; it could be a database, it could be some other service. What if the database has gone wrong? So it’s a dependency exception. What I came up with is this design button when I create four different exceptions, like authorisation exception, validation exception, application exception, and the dependency exception, and the validation exception comes with the enums. And again, those enums have nothing to do with any protocol, and I’ll go to it a little bit later.

Olimpiu Pop: Okay.

Jenish Shah: It is like you have a validation type that can go wrong, not found out of range or whatnot, all different. Now these are the exceptions. So when I’m writing business logic or anybody in my org is writing business logic, they don’t care what protocol my business logic is getting called. Suppose I find that there is a user who did not call or did not provide enough information. In that case, I’m going to throw a validation exception without worrying about what protocol is being used, what this design pattern is, and I will consider this as a library.

The library which I created, which exposes an exception that anybody can throw. I also have an interceptor already built, gRPC, GraphQL, and REST. And what it does is if any application consumes this library and that application throws, say, an authorisation exception or a validation exception, this interceptor automatically comes into play. They’ll automatically identify, oh, this call was made over gRPC, and now a validation exception is thrown with invalid user input.

Now, what does invalid user input mean in the world of gRPC? It’ll automatically transfer to the appropriate status code and send a response, and the same thing will happen with REST. Suppose that call was made on the REST, and then the validation exception would’ve happened. Like, suppose the room was not found and the validation exception was resource not found, right? The interceptor will return a 404 and send it back. So now what has happened after this, and now this library, at least in Netflix, is used by more than 150 services, and what is happening as a developer, generally, you own multiple microservices. If this library were not there, you would be writing this code in all these microservices, all these interceptors, all these exceptions and all that stuff.

Olimpiu Pop: Let me see how we can summarise it because you get me very excited about it, and then it’s harder just to push on that. So you looked at it as just referring still to HTTP, which is probably the most common one. You have these quite well-known error codes. It’s the four XXX exceptions and is the 500, and so on. And rather than just focusing on that because each protocol has its own way of doing things, you just broke it down into logical steps or business-related steps as categories. Usually, you have a problem with having access, and that’s the authorisation side. Well, oversimplified because then I will not go for the triple As that we have multiple things or whatever, but it’s the part where it’s authorisation, and then it’s about the validation side of things.

If something is not accepted, as you give an example, not having the resource available or an application logic, I expect this to be something in terms of computation, something is not working as you would like it to be. And then draw the fourth bucket downstream dependency. That service is not available. So what you said is that these are the main categories that are there, and it doesn’t matter. And that’s in terms of the logic part. And then in terms of the implementation phase, you’re using an interceptor that knows what language each microservice is using and then on top of that, it just takes it and refines it and then provides the response as expected. So bubbling up in the correct language.

Jenish Shah: Exactly, right? And consider you get all that stuff as a part of one. And that’s why I keep calling it a design pattern. It’s not like a great invention or something. It is everybody, all of us, who would know. But this is a design pattern which I created. Okay, put all the stuff in a library so that everybody doesn’t need to write; every service order doesn’t need to write the same library; they can start using this library, and then after that point in time, they can always concentrate on business logic. Keep throwing this exception when appropriate. Another thing I want to call out is that, say, industry standards are today. It keeps evolving. You start getting new granular error codes, and it has also happened in the HTTP world. We started with a very low number of error codes, but now it has started bubbling up to handle more and more granular cases.

Now, if a new set of error codes comes, it doesn’t make any sense for all companies like Amazon or Netflix to have thousands of microservices for all of them to go and change to handle these new error codes. New requires such a design pattern in a library where you can just go to a similar place and say, “Okay, now validation exception is one more granular flavour”. So when people start showing up, all these interceptors will start understanding this new error code and will convert it to it.

So it’s not just following for me, and again, just to give you a little bit of my motivation, when I joined Netflix, I was working on a production suite. I used to manage eight microservices, and as soon as I started writing, at the same time, I’m like, “Okay, now I’m doing something which is a pretty boilerplate”. It doesn’t give too much value to business, like writing the same code every day and just spending time. Why not just create something? Which will not just help me but all my colleagues in the whole org, and people acknowledge that fact, and that’s how it is being used now.

Olimpiu Pop: Probably, this is the most essential question of this conversation. How do you call it a chain of disaster if it’s a pattern?

Jenish Shah: In general, we just call it an exceptional library. People do. I’ve talked with a few consumers just to get their feedback and all because at Netflix, we do believe we are pretty open. We do get a good amount of feedback, we give a good amount of feedback and all that stuff to understand how people feel about it. And again, it was not like, okay, I developed a library three years back and if there hasn’t been any change. We have been changing and improving to handle more use cases, like retries and all. So we have some new additions made last year, which is like, okay, this library also gives you a retrieval exception thing, kind of a variety.

So anytime in your microservices logic, you think that okay, somebody can retract, and then give you more variety to your users. So that kind of stuff. But yes, I feel sometimes just as an engineer, I feel good when I make something which just solves a problem for either the whole world, in being at Amazon or Netflix, you always relate with the world as a whole. It’s not just like the US or somebody. And even as an engineer solving problems for other engineers, that gives me reasonable satisfaction. So when you talk, people say, Oh, now I just had a line in my build.gradle and then I start using your exception, and things just flow in how I would’ve expected them to fall.

How does observability adapt to incorporate the information [29:52]

Olimpiu Pop: I was just thinking that, in terms of programming, you have a lot of stuff. It’s like inputs on one side, then calling it, which allows you to focus on the business side of things, and all the other stuff is just pushed aside. But the hot words these days are observability. Obviously, in the microservices world, you need to touch the context from one side to another, and especially when the exception hits the fan, you would like to know about it. How does this change the perspective, if at all, on getting those nice, distributed logs and traces that will make your customers happier, quicker?

Jenish Shah: Yes, it’s a good point. You raised it. I completely missed that part. In observability, if things go fine, nobody cares; if things start breaking, everybody cares what has happened, because that’s when chaos and confusion occur. So this library interceptor has the base place, which has a context of request, and now has a context of response also. And when it realises if something has gone wrong for whatever reason, it identifies what has gone wrong, and it logs an appropriate error message or warning message. And also, we want to be sensible hypothetically in the case of a validation exception, we don’t want to log an error as a provider in my service, I want to log a warning, somebody is using my service in the wrong way. And why I’m saying the difference between error and warning is that I might have a stricter alerting on my side if the error is logged in my service versus a warning.

In warning, “Okay, alert me or page me if there are a hundred warnings. In error, page me if there are even firewall errors”. So this design pattern or this library gives us that benefit also, based on the exception, it logs even today warnings or errors. Another thing is, it’s not just about logs, it’s also about even now, you see counters, right? How many times has a particular kind of exception happened, and not only that, from which caller? So you can have all this information combined, and you can start emitting counters, and what you can do is you can put a nice chart at the end of a week or end of a month, or end of an on-call shift, or something, this particular caller is misbehaving. Maybe they’re out of date, they’re doing something which they shouldn’t be doing, or they’re not understanding our business use case.

And so that you can say, this caller consistently hits our validation exception and then now you see the nice graph, so it’s easy to pinpoint. So you don’t even need to go to logs sometimes. You can go to the charts and see, okay, which caller is causing more problems, or is there something which we don’t know but is silently failing on our end? Then these kinds of callers are calling us. So again, observability is a big, big part, right? Something which will play a significant role in maintaining the health of your service.

Choosing the proper protocol for the job [32:52]

Olimpiu Pop: So, to just sum it up, it’s another level; it doesn’t matter. The interceptors are doing their job, and then the information bubbles up, and then you just see it nicely in the dashboards. For my final point, how do you choose? So now we are going back to the greenfield side. We are in front of a white piece of paper, and somebody says, “Look, you need to do another microservice that does something”. How do you choose the best protocol? Or what are the things that we should care about when doing that? Because somehow we managed to put it like, “Okay, if it’s coming from outside, we’ll go to use this protocol and that and that and that”. What should you bear in mind when choosing that protocol?

Jenish Shah: I touched upon it earlier, but my rule is a little simpler because, obviously, there is no right or wrong to be honest in computer science as things evolve. If I were to expose something external, which is going to make use of multiple of my services, I would go with GraphQL. But obviously, there is a drawback to it. GraphQL doesn’t support it, and pardon me, my knowledge is not up-to-date, but we have had a tough time using GraphQL to upload and download files and all that stuff. So that’s where we go with the REST. If I’m building a new microservice, that’s my policy. If there are internal callers to my service and I’m exposing an API that’s going to be called 100 DPS or something by default, I’ll go with gRPC.

And it’s not just about the efficiency of the protocol itself, like the built-in structure, you can easily drive reproducibility and all those kinds of stuff by property. Obviously, Netflix has evolved a lot, so we have an internal framework to handle it, but gRPC gives all the things out of the box. I’m more open to retries and all the stuff for my internal services because I trust the quality of service, I trust that they do understand my use case better or pragmatically better. And that’s why they’re going to do gRPC, so I can solve them the best.

Olimpiu Pop: Okay. Any other point from your side? That’s something that I should have asked you, but I didn’t, and that will be worth listening to.

Jenish Shah: You covered the majority of the stuff. In fact, the observability part was pretty good. And while building the library, I spent a good amount of time thinking through it, not implementing, but thinking through. The majority of the time in our world, implementing it doesn’t take much time, but thinking through the process, how do we go about it? Takes time. So that was good. And again, in general, like in competitions, and that’s how I feel. If you think that you are trying to solve a problem which is generalised and it’s not just a matter of going to your service, it can help your organisation or your team to think of an idea, or a way so that others can make use of it. Again, in today’s fast-moving world, it’s an advantage for everyone if they don’t write the same code.

Olimpiu Pop: Yes, well, how we can frame this is that even boring stuff and annoying stuff like error handling can provide an engineering challenge, and then if you think outside of the box, it would make life easier for you for tomorrow, but also for your colleagues. And that’s important to follow because in the end, we do have quite a significant impact around the world. I’m just thinking about CO₂ emissions now. We are pretty big as a software industry, and I believe everything that we are just doing together actually has an impact, as you said, for your colleagues but also for the environment. Thank you for your time, Jenish. It was really a good conversation.

Jenish Shah: And it was nice talking with you. To be honest, it refreshes everything in my mind when I was thinking through it, and obviously, I speak with people at Netflix. Still, even outside, you are not part of Netflix. Still, you also echo the feeling that okay, exceptions are essential, user experience is necessary, and exception is one of the ways you can drive that user experience.

Olimpiu Pop: Well, that’s very nice to put it in the end. The trend is that just think about the car ride, it’s running always or any kind of mode of transportation, but if it doesn’t drive or work exactly at that particular point of time when you need it, that exception is problematic. So that’s the case for most of the things, and that’s why outliers are essential and you should handle them with care and gracefully, of course. Thank you.

Jenish Shah: Thank you.

Mentioned:

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Effective Error Handling: A Uniform Strategy for Heterogeneous Distributed Systems

Transcript

Microservices evolution from REST API over HTTP [03:59]

Failing gracefully in distributed systems [10:45]

The challenges of handling errors in a multi-protocol setup [16:13]

The service failures fall into four categories [21:27]

How does observability adapt to incorporate the information [29:52]

Choosing the proper protocol for the job [32:52]

Leave a Reply Cancel reply

Stay Connected

Latest News

Ubuntu Unity In Need Of More Developers To Survive

Best OLED TV deal: Save $500 on the LG 83-inch B5 OLED TV at Amazon

Duniya Healthcare says its distribution model helped avert 578 rural deaths

The Future of AI Isn’t Just Slop

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Microservices evolution from REST API over HTTP [03:59]

Failing gracefully in distributed systems [10:45]

The challenges of handling errors in a multi-protocol setup [16:13]

The service failures fall into four categories [21:27]

How does observability adapt to incorporate the information [29:52]

Choosing the proper protocol for the job [32:52]

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News