Transcript
Hugo Marques: We are going to talk a little bit about Java concurrency from the trenches. You might ask, why from the trenches, why I picked this particular word. The reason being because this talk is based on my own learnings through a project that I did through the last year. Everything I’m going to be talking about here is based on a project that I worked through the last year until now at Netflix. Before you guys join me on that journey, let’s talk a little bit first about, who am I? I have been a Dungeons & Dragons Dungeon Master for over 20 years, but unfortunately, or fortunately for some of you, we are not here to talk about Dungeons & Dragons.
On my free time, I have also been a software engineer for I think the last 15 years, which most of the time I have spent on known companies such as Amazon, AWS, Twitter, and currently working at Netflix. One interesting fact is that even though I worked at all these companies in some platform engineering teams, search infrastructures, code reviews, things like that, most of the time I consider myself a product developer. I am not a performance engineer. I am not a JVM engineer. I don’t have actually deep expertise on those topics. You are going to notice why this is important for this talk.
Another interesting fact is that I didn’t have to deal with the complexities of concurrency for most of the time. Most of the time I was just doing my application, and then my Spring framework was taking care of the threads for me, was taking care of all of that until it couldn’t. In this project, I had that idea that I had actually to learn a little bit about concurrency myself and try to do some experiments and learn things the hard way. I’m actually here to share those learnings and mistakes.
First of all, who is this talk for? If you’re writing concurrent code in Java, I think this talk is for you. What I expect to happen is that you’re going to either see similar issues that you faced or you are going to see issues that I faced that I hope you’re not going to face anymore. That’s my goal. The other thing is that if you’re dealing with scale, especially if you’re dealing with IO, this talk is also for you because our problem is mainly an IO-bound problem. I’m going to talk a little bit about that later.
Finally, if you are not a Java developer, but you are curious about concurrency problems and you are interested in the problem, solutions, things like that, or you are here just to make fun about how we Java developers try to solve those problems, I think this talk is also for you. What this talk is not about. This is not a benchmark or a performance deep dive. Even though I did some comparisons, I did some experiments, I have metrics to show here, they are mainly to inform my decisions and what I was thinking at the moment that I made that decision.
This is not actually a deep dive performance talk. I will also not cover JVM or low-level app tuning. This is not a talk about that. Like I mentioned, I do not come from a background of performance engineering, and I think there are a lot of great talks about that. I have some of those talks actually in the references. Finally, this is not a full reference of the Java concurrency API. The API is super extensive, I could not cover everything. The way I’m actually using API is as a tool to solve my problems. I’m just going to go through the small bits that I actually try to use to solve that.
Foundations
First, let’s start talking about some of the foundations so we have a common ground about understanding concurrency. The first thing I want to talk about is the difference between sequential, parallel, and concurrent code. Sequential is pretty straightforward. You have your CPU, and it’s going to execute your task, and you have to finish all your task 1 before it goes to execute your task 2. When you have parallel, things start to become a little bit more interesting, because now you have multiple CPUs or multiple machines, and you can execute those tasks actually at the same time. Concurrent is a funny one, because concurrent gives you the impression that it’s actually being executed at the same time, but what’s actually happening is that it’s just jumping pretty quickly between your tasks, and you think that it’s actually being executed at the same time, but it’s actually not. It’s just switching super quickly, and that’s how it goes.
The other interesting concept is synchronous versus asynchronous. When you have synchronous code, you have your process A executing something, it calls your process B, process A is going to wait. It’s going to wait until you get the response back from process B. You can see that it’s going to wait all the way for the response, and then it continues executing.
Finally, when you have the asynchronous code, it’s actually quite different. Process A sends the request and continues doing some work. It’s actually doing some work, and later on, when it needs the response from process B, it’s going to ask process B, do you have the response ready for me? We are going to use these a lot today. I think this is probably one of the most important concepts here, which is CPU bound versus IO bound. When you have CPU-bound problems, it’s actually things that depend on your CPU, your cores, your machine. Think about doing math, doing compression, doing those loops and doing transformations and things like that. IO bound is actually when you are waiting on network, writing to files, calling your database, anything that’s related to that is going to be IO bound. This difference between CPU-bound problems versus IO-bound problems is actually going to be helpful to inform some of our decisions today.
What is the toolbox that Java gives us, actually, to attack some of these problems? The first thing that we have is the old good thread. This is very old school. Even though it’s the foundation on everything it’s built on today, rarely we actually manipulate threads by hand. Instead, most of the time, we’re actually using a thing called ExecutorService, which is basically working as a thread pool. Threads are expensive to be created, actually, until the new thing that I’m going to talk about. Most of the time, you are actually going to use the ExecutorService to create your threads or give you threads that already exist.
The other thing is CompletableFuture, if you are coming from a JavaScript background, it’s a similar thing to promises. We Java developers don’t want to say that we copied JavaScript, so we decided to call this CompletableFuture. The idea is the same. This is our way of doing async code. We do something, ask to be executed, and later on, we are going to try to get a response. One new thing, actually not super new, because this is around since Java 8, is parallel streams. This is an easy way for you to try to do parallel processing streams in Java. The new shiny kid in the block is virtual threads. I said before that threads are expensive, not with virtual threads. The idea with virtual threads is that they are pretty cheap. You create virtual threads, you destroy virtual threads, you don’t have to keep a pool for them. They are pretty good for IO-heavy flows. I’m going to talk a little bit about them today.
What Were the Problems, and Solutions?
Let’s talk a little bit about what was my problem. One thing to keep in mind, though, is that I changed the domain a little bit, but keep in mind this was a real problem. The traffic pattern, the code, everything else is the same. I just changed some names around. We have a scheduler, a job. This is not your application that’s actually getting requests. It’s actually a cron job that runs once every 30 minutes. That job has to process 10 regions in the planet.
Each region, we are processing somewhat around 10K orders. For every order, we have to process somewhat 5K products. We do gRPC calls, one call for every 100 products. If we want to think about the scale of this, in the end, we have to resolve somewhat around 270K RPS, requests per second, of products that we have to solve. Or if you look at the number of batches, it’s actually 2.7K. You might think, 2.7K RPS, that’s not super high. Yes, that’s only if I am actually distributing the load equally in 30 minutes, which most of the time, that’s not the case. Of course, keep in mind that as I keep adding regions or I keep getting orders, this problem just goes up.
Our code architecture is pretty simple, actually. I have my job. It is going to run my regions. My regions are going to call my order service. Think of this box as a Java class. I’m going to have a products class that’s going to be a proxy for my gRPC. It’s going to call gRPC one time for every 100 products. After all this is done, it comes all the way back onto my job, aggregates everything, and I’m going to write the output result. You might ask, why are you not doing this for events? Or why you have to get everything right in the end? Let’s abstract those details. Just keep in mind that it needs to be written at once in this file.
Our first solution, or my first instinct, was like, I’m going to go simple, and I’m going to start writing a sequential code. It’s going to be pretty straightforward. I have my regions. I go do a loop over them. I go to my order. I group them in my products in hundreds. Then later on, I start to do my gRPC calls on my products. Pretty easy, straightforward. A couple of lines of code, you can solve this. It works. The code is correct. It’s doing what I expect to do. It’s pretty easy to understand. I could talk about this in 10 seconds. It’s memory safe. You are not sharing state between multiple threads. You don’t have other things that try to access the same resource and corrupt your data. It’s pretty safe for your dependencies. You only have one thread calling your dependency at a time. There is a good chance that you are not going to overload another service. However, it’s pretty slow.
The first time I ran this, I did not have the patience to wait, actually, for the job to finish. It underutilized our resources. Imagine that if you are in a machine that has eight cores, you are actually only using one of those cores. You have seven other cores that are doing nothing, just waiting there. I felt that if I keep waiting like that, I would be just like this lady, waiting until my job was done. I thought that we have to do something.
My first idea was like, let me look at the CPU-bound problems that I have and how I can attack those problems. To help me there, I decided to use parallel streams. Why parallel streams? Because by default, parallel streams are a one-line code change. They are pretty easy for me to go and introduce there. For example, I have this list here. If I do a parallel stream of that, the output, you can see, is going to run automatically in multiple threads. It’s like minimal change, it’s going to run in multiple threads. I decided, easy win for me. Why CPU bound? Because parallel streams, the default common pool that parallel streams are going to use is based on the number of processors that you have. It doesn’t matter, if you want to say, I’m going to execute this in a ForkJoinPool that’s bigger than the number of processors, because the common one is the number of processors. It’s like, I want to use all my eight cores, so I’m going to start using parallel streams. You can actually, like I mentioned, customize your ForkJoinPool.
In some cases, you might want more than the number of processors that you have. I have done this in the past for writing a script of some sort. Once, I think I wrote something like this. Most of the time, you don’t do that. You just use the default parallel streams. You can also pass a JVM argument and say, I want my parallel streams to run with this number of threads. That’s actually different. Just be careful, because you are changing the common ForkJoinPool for all the libraries that are actually using your parallel stream. You are going to see actually in the presentation down the road, there is a library that’s actually using the ForkJoinPool.
If you are doing this here, you might impact some libraries that you do not want to impact. One place that I noticed that I could actually try to do that was in the beginning of when I was starting to process my regions. I decided, I’m actually reading the data, and I had to parse the data, validate this data, do some regions. There is no call. It’s everything on my machine. I think this can then be done. It’s basically CPU bound. I can do that a little bit with parallel streams. Raw data is actually pretty big. It looks like a good candidate for us to try to parallel write.
Problem 1: Nested Parallel Stream
Here is where our battle begins, because this is the first of a series of mistakes that I started to do. Let’s see how it goes. The first thing is that when I started to do my parallel streams, I got greedy. I decided I wanted to parallelize everything. I said, I’m going to put parallel streams everywhere, because I’m going to go super-fast, blasting fast. I had my first loop. Then on my toRegion method, I had another loop. I was like, another parallel stream. Imagine that you have to parse or validate loops, you put more parallel streams, because more is faster. Not actually, remember that your parallel streams, they run in one limited resource, which is the number of your cores, your CPUs. Now you have two types of tasks being created, going to the ForkJoinPool. Now there is more context switches for you to do. There is more tasks going on that particular queue, more things accumulating memory. I actually did some benchmarks. It’s actually pretty interesting that for low numbers, if you only have a single line of parallel stream, it’s actually faster.
For high numbers, if you look at the benchmark, it gives you the impression that it’s going to be faster if you have the nested. If you look at the difference, like what’s the max and what’s the minimum of the graph, you see that’s actually going to fluctuate a lot. That’s because it’s allocating a lot of objects and doing a lot of context switches. The decision here to avoid this problem is actually parallelize where it matters the most. My raw data is actually the source of all the data, like where I have most of my objects. I want each one of these transformations to run in a thread. For each object of my raw data, that’s actually where I’m going to parallelize. Try to focus your parallel stream where you think it’s going to cause the most impact, and try to avoid nesting.
Problem 2: A flatMap Surprise
The second thing that happened to me was another interesting finding. I don’t remember what I was doing, but at some point, I had a flatMap transformation. I decided, again, I’m going to probably use more parallel streams. Parallel streams everywhere is good. I decided to put a parallel stream inside a flatMap transformation. Then I noticed that my code was pretty slow. I was like, what’s happening? I added one line of code to actually print the thread name. I realized that it’s actually not running parallel at all.
Then I went to talk with my fellow Brazilian devs on Twitter. After some discussions, we discovered that the code, actually, the OpenJDK for flatMap transforms the stream you’re passing to the flatMap operation into sequential code. It doesn’t matter if you pass a parallel stream to a flatMap operation, that’s going to run sequential. I do not know the details about that, why this is done. There is an extensive answer on Stack Overflow if you guys are curious about. I put those in the references. You can go take a look about why this is done. The answer to that, I realized that I can actually do just a parallel stream on the outer loop on my users. Because for me, the result would be the same. I remember that on the Stack Overflow answer, there is actually a workaround to go around this that does not involve you putting in the outer loop. I recommend going to that particular answer.
Problem 3: Parallel Stream and Thread Context
The last thing I want to share about my learnings with parallel streams is that you have to be careful when you are dealing with thread context in parallel streams. You can see on this first line of code that I’m setting the context to be user-123. Later on, I go through the loop, we’re running on streams. I’m trying to grab that particular context. If we execute this code, you are going to see that for all my ForkJoinPool threads, the context is actually null. Only on the main thread, I get the context. There are two things here. One, you might be asking, why are you getting the main thread?
The main thread, actually, when you are executing something, the ForkJoinPool tries to get the invoking thread to actually be part of the pool itself to do the work. That’s why main thread is actually running here. It’s the only thread that has access to the context. All the other threads, they do not have access to the context at all. If you are like my company, like Netflix, we do a lot of things saving the context, like request information, authentication between services, things like that. You cannot do that from inside of parallel stream. For that, you probably have to use a custom execute or a custom pool, or some other thing like that.
Executors and CompletableFutures
Now, that solves my CPU-bound problem. As I have seen, the book of my problem is actually IO bound. I’m talking to a gRPC service, and I’m doing 270K calls to that. To solve that, I decided to use a combination of executors and CompletableFutures. Executors recap, are thread pools, and CompletableFutures is a way for me to go grab the answers when I have them. The old good Java way of creating executors is just calling these factory methods.
On this first one, I’m creating a thread pool of four threads. The second one, the new cache thread pool, I’m creating a thread pool that as much as I have work, new threads are going to be created, or it’s going to try to reuse threads that are idle. The last one is our single thread executor that only creates with a single thread. It’s actually pretty useful if you’re doing testing. I like to use that a lot. The way you use them with CompletableFutures, you can just pass CompletableFuture.supplyAsync, your task, and you pass the executor that we want to use. This is the plain old Java way of doing it. If you are doing it in Spring, you can create as a bin, and you can inject then directly on your component, and you can just do the same thing, CompletableFuture.supplyAsync, or run async if you want to. This is the Spring way, option one.
Another way of actually using with Spring is that you can annotate a method with @Async, and just pass the qualifier of the executor you want to run. This here, fetchAsync, behaves the same way as if I call this method, this CompletableFuture.supplyAsync, passing my method. It’s just syntax sugar to go around not having to write CompletableFuture.supplyAsync. For that to work, you have to actually do that @EnableAsync. You have to do that when you are starting your Spring application. The last thing about Spring is that actually Spring has a default executor by itself, and you can configure that on your application YAML. How do you use that default executor? You just call @Async in a method. You don’t pass the qualifier.
Automatically, you are going to be using the default executor. We are going to use down the road when you’re dealing with virtual threads. Now my solution is something along the lines of, ok, I’m going to loop through all my orders. I’m going to pass them to be executed in a CompletableFuture. Then later on when I call the orders and I do the gRPC calls, when I get the response back, I’m going to process those responses back at my executor. This is going to work. My code is going to fly because I’m not going to block anymore. This is what I thought was going to happen. I was like, that’s in the stars now. Actually, this is actually what happened. This was my first experience dealing with this code.
Problem 4: Out of Memory (OOM)
What happened? What went wrong? This was my first out of memory. If I’m saying first, I’m already giving a spoiler that this is the first out of many. What happened here? When I started doing this code, you can see something that I did. I did my executor when I was actually enqueuing the work and I didn’t use my executor when I was receiving some responses back. To put that in our heads in the model is almost like my regional services enqueuing tasks, the blue ones, to my executor. When I call my gRPC service, I got a response back. I’m enqueuing the red tasks to my executor as well. The problem here is that the red ones that are arriving, they are going to the end of the queue. My response objects are way larger than my request object. They just keep accumulating because my thread pool is actually trying to deal with my request tasks.
Then, it just blows up, I just ran out of memory. This actually illustrates the problem a little bit better. I keep getting the red tasks, they just keep going to the end of the queue. I’m still in the middle of the queue trying to process all the blue ones and it explodes. A quick way to solve this is actually, you can split the queues. You actually allocate a thread pool to deal with every particular job. You have a thread pool to deal with your requests. You have another thread pool to deal with your responses. If you look like, why is that the gRPC default executor? Yes, because when I’m processing the responses back, I realize that these responses, they are actually pretty lightweight and fast to be processed. I can just use the gRPC default executor that I already have at hand. That’s going to be processing my responses. How do I do that? The first way to do that is actually to use the gRPC default executor.
Instead of doing a thenApplyAsync, I’m going to do a thenApply, because CompletableFutures, they have these three versions of all the APIs. They even have a thenApply, a thenApplyAsync where you don’t pass the executor, or a thenApplyAsync where you pass the executor. It’s got three flavors. The thenApply is actually going to run the code in the thread that was running in the previous stage. If you remember in my previous stage, it was actually dealing with handling my gRPC response. That was the gRPC default executor. That’s why I used the thenApply. You can also solve the problem with the thenApplyAsync, the default one. This default one is actually going to run in your ForkJoinPool, the same pool that I said in the beginning to not mess up with the JVM because it’s going to affect things like this. Finally, you can also do with the thenApplyAsync executor, but different now from what I did initially, you do not specify the same executor. Please specify a different executor. I call this one the response executor.
If we look at a comparison between them, like I said, thenApply is going to run the same thread as the previous stage, thenApplyAsync at the ForkJoinPool, the other one in the executor you pass. In terms of performance, thenApply is actually faster because you are not submitting new tasks. You are just keeping the same thread that’s already running. There is no context switch or things like that. The other two are slower just for the fact of doing context switches and things like that. The flexibility of the thenApply is actually low.
The thing is, sometimes you do not know exactly what was the thread that was handling the previous stage. Sometimes it might happen that you have a chain of CompletableFutures that arrive on your direction. You do not know the thread. You cannot change that. If you just do a thenApply, that’s not going to be changeable. The flexibility of dealing with that is low. The thenApplyAsyncs gets more flexibility. That’s the tradeoff. The tradeoff is that with the thenApplyAsyncs, you know what threads and what thread pool is going to execute your code. You are paying for performance. That’s our tradeoff. They are inverse in terms of simplicity. One interesting thing is that the context is preserved if you use the thenApply because as you keep coming in the chain, if that chain has a context in that particular thread, that means the other threads that keep doing the thenApply operations, they are going to keep using the same context. When you do the thenApplyAsync, remember, it can go to any thread. It can fall into a thread that has a context and it might be the context that you want or it might be a totally different thread.
Problem 5: DDoS My Dependencies
However, there are still problems. These are the faces of my colleagues at Netflix when they saw me calling their services with 30K RPS, as fast as I could. They came to me and said, what are you doing here? That’s the thing, my app was running fine for me. When you look at the entire system, it was actually causing then a DDoS. How do you solve this? There are things that I do have to self-throttle. I have to slow down. Remember when I said that I want to do a smooth traffic? That’s what you want to do. On my case, I thought that I could use two mechanisms here. One was semaphores, where I’m going to give you a permit, and let’s say I’m going to give 50 permits at a time, then that means I only have 50 tasks executing at a time. Every time one of these tasks is done, when it’s completed, I release the semaphore, so another task can go wait and get in flight.
Or I can do a rate limiter that’s a little bit different, that’s going to say, I want to do x many permits per second. In my particular case, that’s the one I use because I want to protect their services to say, I’m going to only allow 3,000 requests per second, and that’s the rate limiter I use. Sometimes you might need both, but comparing them, the semaphore is going to control the number of concurrent calls you have, is the number of calls in flight, while the rate limiter is going to control your call frequency. The semaphore is going to be the number of in-flights, while the rate limiter is the number of tasks that you start per second. Semaphore is more about protecting your own resources. You only want to keep x many things in flight, so you don’t blow up my memory.
The rate limiter is about protecting your dependency, like you don’t want to overload them with too many tasks. Semaphores are actually really good if you have a task that runs super long. Imagine that I have a task that takes 2 seconds. If you put a rate limiter that says, I’m going to release 1,000 requests every second, they’re going to build up, and then you are going to destroy the other service. The risk of missing is that the semaphore depth overloads itself, and the rate limiter, you flood your external service. Sometimes, like this, you might need actually both of them.
Problem 6: OOM Strikes Back
What happened? I added my rate limiter now, I’m protecting my gRPC service. I’m calling only 3,000 requests per second, but remembering that originally, I was calling with 30K RPS, and this is what happened to my application. All the blue tasks started to accumulate in the classes before them, and then I got another out of memory, of course. The solution for that one was actually to introduce a semaphore when I was creating my big tasks, between my regions and my order, I create this semaphore that says, you want to submit 50 tasks at a time. That’s enough for us to keep our limit at 3,000 that we want, and so we do not go below the 3,000, and we do not keep too many tasks waiting here. I did that first.
If we look at our solution now, it started with Regions, Order, Product, and gRPC. Now I have a semaphore in the middle doing like 50. I have these 3,000 requests per second that I have also to control. This is how I feel talking about this to all of you at this moment, and this is how I actually felt talking about this in a meeting with my entire team, talking about all the levers and all the control, and why I have all these thread pools and things looking like that. One colleague of mine asked a very good question. “Hugo, can you make this simpler?” I was like, interesting question, let me see. I took that challenge to my heart, and I went to look at a non-blocking IO solution, like, can I remove the rate limiter? Can I remove the semaphore? Do the entire thing as a chain of things, and do completely non-blocking, just using the CompletableFutures. My solution started at the top again, I do my regions, and then I start to enqueue work to my process order. Let’s not look at the when complete for now.
Then I go to my next layer, enqueue the work to my gRPC service, and on my product, that’s when I’m actually doing the gRPC call. That gRPC call, I’m actually using the non-blocking client from gRPC, and that’s going to return me a future. In that future I’m going to pop up all the way to the top, so something like this. You can see that only on the top, that’s when I say, when my future is complete, that’s when you start writing to the file. There on the top, actually, on the regions, that’s when I do my join, that’s actually the blocking call. It goes all the way down, unblocking, and goes all the way up. If you’re asking me like, but how are things holding up here? What’s applying the backpressure? The backpressure is actually being applied by the gRPC framework itself. We internally have some semaphores and controls on gRPC, so I was actually using those to my advantage, to have one single lever to control all my concurrency. The result of that was actually pretty good.
This entire code, I think it run in a single machine, like 22 minutes of some sort. I was not flooding my service and everything. However, this has a problem. If you remember my sequential code in the beginning, it was three or four lines of code. I have all these just to deal with complexity of like CompletableFutures and non-blocking IO and all that stuff. It becomes pretty complex and sometimes pretty hard to debug. These are my colleagues coming back to me again and saying, “I said simpler, not more complex”. I had to go back to the drawing board.
Another solution that I tried is like, what if instead of using semaphores, rate limiters, and all that stuff, I try just a bounded-executor. I’m just going to try a simple fixed thread pool that has a limit on a queue, instead of semaphore, I have just this queue that allows a couple of tasks. I wrote my own executor that says, I have this blocking queue that’s going to only allow this queue size. The second line here in highlight is that, if the queue is full, you block it. It’s basically doing the same thing that the semaphore was going to do, but it’s already embedded inside of your executor. Now the code becomes a little bit simpler. Like you have your regions, you don’t have all the complexity anymore about that. You do your orders.
Finally, you still remember the async boundedExecutor right there. That solution works. It’s actually great, but it could be better. How could it be better? We have a new shiny kid on the block that can help us simplify this code quite a lot and not have to deal with a lot of things like managing pools and things like that. I decided, I’m going to give virtual threads a chance. Virtual threads, you have your traditional platform threads, one-to-one if they’re OS threads, and you have a lot of virtual threads.
The virtual threads, as they keep getting blocked by your IO, they keep getting a mountain, and another virtual thread goes to a platform thread to actually do the work. It’s pretty easy to enable them with Spring. You can just say, threads virtual enabled true, or you can create your own executor with passing the new virtualThreadExecutor Spring and inject them. Now, the code becomes something along these lines, like, I have my regions, or I’m going to loop through them. I’m going to use @Async, because now I’m using the default executor from Spring, and because now my default executor is a virtualThreadExecutor, that I’m going to create virtual threads here by default. I’m also creating virtual threads when I’m calling my gRPC client, so I’m going to have two layers of execution there.
Problem 7: The Pinning Problem
However, again, more problems. The first problem I face is the first time I tried to run this was actually last year, I think around July, and we were running on Java 21. In Java 21, we have this problem with virtual threads that have gotten to be known as the pinning problem. I think Netflix has a cool blog post about it, that we found this actually blocking a lot of our operations. Going very succinct here, what happens is that sometimes there are certain conditions that might cause your virtual thread to actually get pinned on your carrier thread. It’s not going to mount. If that happens enough, it can cause problems like slowing your application, or on my case, I actually saw this causing my application to completely stall, like the application just stopped working. This is a very quick code that demonstrates that.
Every time your virtual thread here has gotten to the synchronized block, that’s going to pin your thread to the carrier thread. This is a good demonstration of the problem. The good news is that if you’re updating your JDK, you don’t need to do anything to solve this problem. It’s already been solved for you with JDK 24. Then I saw this, got super excited this year, in March. I was like, JDK 24 is there, I’m going to try it. If you don’t want to try 24, you can wait for 25 in September, which is going to be the LTS version.
Problem 8: Yet Another OOM
Then when I try again, another out of memory, of course. Why? If you remember what I did with the parallel streams, where I got greedy, you might have seen a pattern here. The pattern is that I also got greedy with virtual threads. People told me they were lightweight, so I thought they were magical, they’re going to solve all my problems. They were not going to do that. I created two nested actual virtual threads. I’m creating async at order task, and I’m also creating virtual threads at gRPC task. You can remember that is one per hundred, so I’m going to have a lot of those gRPC tasks being created inside of each one of these order tasks. I have 10,000 of them. It’s a lot. You do not simply replace your platform threads with virtual threads and just expect them to work fine. Back to my semaphore to apply some backpressure and control the pace where I create virtual threads. The code is pretty straightforward. Just before I create my supplyAsync, if I pass that to the virtual thread, I create a semaphore.
Results
What were the results of this adventure? The first thing is that if I compare the bounded-executor, which I was actually running in prod, that solution was somewhat complex, but it worked really well. I have three nodes, each one running my service at 12 minutes of execution time. What about the solution of the semaphore with virtual threads? No changes. Actually, it worked fine. It runs somewhat in 12 minutes. It still has its complexities. It’s almost one reflection of the same. One is using a bounded-executor with a queue, the other one is using a semaphore to control the flow. The pressure is that, ok, I’m having peaks still at 30K, 34K RPS, but I’m average, it’s just an initial peak after down, it stabilizes on 5K RPS for my bounded-executor. For my virtual threads, it actually got pretty good. I don’t know exactly what’s happening under the hood that the peak is lower, but I got a peak of 16K RPS to my downstream service.
Then I ask a question again, can I make this simpler? If I look at my metrics, I noticed that that one right there, that’s the number of in-flight requests that I have to my gRPC service. The max is 70. The number low here is going to be something around 40. I said, maybe if I have 50 requests at a time, I’m going to get a very steady traffic to my downstream service. How can I get 50 requests with virtual threads and keep my code even simpler? The solution for that was actually to remove the inner virtual thread from all my tasks and focus on running the entire block, the order tasks, one at a time. It’s going to run all this code that’s inside of the virtual thread, but just sequentially. It’s going to call each one of those calls to the gRPC sequentially, but it’s going to have a lot of other tasks running in parallel. How many? I want 50, so it gets easy. I just want to put 50 here in my semaphore. If I compare my two solutions on virtual threads, the first one I got, of course, control about my levers and controlling concurrency because I don’t control how many products I have. I don’t have a very fine control there.
This one, I actually have pretty fine control. I have two levers right now, order and gRPC level. This one, I only have the lever at order level. I have the semaphore size. That gRPC semaphore is just the internal one. I have an internal one there as well, but because the numbers are the same, I don’t have to be concerned about that. The first version is better than my IO version, for sure. It’s less complex, but this one is even simpler. Look at the numbers for this version, even removing the internal virtual thread execution, it’s still the same. Actually, faster, maybe a minute or so. When it comes to pressure to my downstream service, I got down from 16K RPS to 13K RPS while keeping the execution time the same. Now I can finally agree with the guy, now I have a good working solution that I can deploy.
Key Takeaways
What are our key takeaways from this talk? The first thing is that you have to experiment and measure your solutions. You cannot just throw parallel and concurrency and just hope and pray that it’s actually going to work and it’s going to solve all your problems. You have to measure and do some benchmarking. Sometimes you don’t need to do load tests or something like that, just being able to run your two applications side by side and compare and look at the metrics is enough. That’s actually what I did. The other thing, and this was a key learning for me, I did a talk with a performance engineer from Netflix, and he told me, you have to look at what you want to protect.
If you keep the pattern that was going around here is that I want to protect first my dependency service. I don’t want to flood them. I work backwards on that. What is the metric that I want to keep steady and stable? In my case was the number of in-flight connections and my RPS. That’s where I want to go. After that, I work backwards to see how much I can extract from my own application without flooding my memory, or my CPU, or other resources. The last thing is that concurrency is hard.
If you can actually not do this and rely on your tools, and plus one to virtual threads coming and help us to not have to keep dealing with a lot of these thread pools and things like that, I think that that’s for the best. Sometimes you’re going to have to get your hands dirty. If you do, try to keep the solution as simple as possible because it’s easier to operate and easier to get your head around and to explain to your teammates. Those were my learnings. All the references and things that I have researched, I shared them on this QR code, a lot of cool talks about performance, threads, and all between.
Questions and Answers
Participant 1: It seems like your application was mostly CPU bound and IO bound. Did you encounter memory bound issues up front or was it only causing own problems? Because sometimes memory is quite constraining.
Hugo Marques: At some point I thought the memory was going to be a problem that I thought about, what if I have to increase my memory just a little bit? Maybe I’m going to get the performance I needed. It happened that it didn’t matter how much memory I created on my virtual machine, it was not enough. My application was not memory bound at all. As soon as I got control of the flow of the requests and how much I was using my CPU, the memory got down and got stable. I never went above 50% of the usage. I was actually able to reduce the memory that I had on that virtual machine.
If I had to approach a problem that’s memory bound, I would do the same thing, work backwards. Like, now I have to protect this memory. How many tasks can I run on my machine that’s actually going to fit the memory? It becomes almost like, instead of looking at the dependency that I’m calling, now we are looking at how can I protect my machine from running out of memory. You put that limit, you know how many tasks, and that becomes the resource you want to protect.
Participant 2: Each iteration, did you test it in production or was it just staging?
Hugo Marques: Some of the early ones that ran out of memory, they were in production, actually. Of course, before I dialed up my experiments to start actually producing the data for real. It was running in my production machines, consuming production data, but not serving data and the like, there was no dependency on me yet. Down the road, I think when I was experimenting with async IO and with virtual threads, they’re not production, but I was able to just clone production. I had this shadow cluster running the same type of machines as production, consuming the same data as production, but writing data to something else that had no dependencies. When I show the graphs that compare bounded-executor versus virtual threads, that’s exactly what I had. Like one of those, the bounded one, is actually production.
The other one is shadow production. I don’t have my virtual thread deployed yet because I was like, I’m going to go to a conference, better not deploy this. I have this feature that allowed me to run these clusters side by side and compare the metrics and that was actually pretty helpful.
Sean: In this test, you guys had a fairly constant number of requests going from one service to another. Did you ever look at what would happen with scale if that were to change and that were to explode? At any point, did you ever consider distributing the load to multiple systems or multiple pieces of hardware and so on?
Hugo Marques: Yes, actually, this thing here, I think that maybe it’s not clear down on this graph, but you see these three lines, these are actually three machines. At some point, I noticed that one machine was not going to be enough to process what I needed, so I designed my application that I could shard the load between three machines. The way that I designed the code was actually in such a way that now if I have more traffic, if I have to run more things, the only thing I need to add is add one more node to my cluster and then I got the scalability that I needed. That’s actually another trick that you have to think about, ok, how can I scale my cluster horizontally so I can keep adding things and can scale my application, not only at the node level, but now at the cluster level.
Po: Did you ever have to take into account retries if the backend service starts erroring out or failing?
Hugo Marques: Yes, I did. I think there are some retries. There’s another graph that’s not shown here that shows the number of retries that are happening. I take that into account, just to measure like, if there are a lot of retries, maybe I am saturating the other service. That was one of the signals that I was looking for. Then one of the iterations that I was doing for 30K RPS, I was getting a lot of retries. A lot of also the deadline exceeded because the service was basically dying. I took that into account. The other thing is that for errors, for example, if I do a retry, it does not fail. For this particular job, it’s fine. This job can fail, and the next iteration, I’m going to pick it up. I accept failure as part of this particular design. For retries, specifically, you have to be careful. This is a learning from my AWS times. You have just to be careful about the retries not flooding your downstream service as a consequence of the retries themselves.
See more presentations with transcripts
