Transcript
Startin: My talk is about the performance complexity curve. What is a performance complexity curve? It comes from the performance work phase diagram by Aleksey Shipilev. It’s an observation he made on how performance and complexity change as we engage in performance work. It’s quite common, before we do any kind of performance work, that we actually have a low-performance and high-complexity system. This often comes from the mantra, make it work, make it right, make it fast, in that order. What we do, in a lot of systems, at the application layer, is we’ll incur technical debt to meet deadlines, maybe to get an MVP out.
Then, later, people will start using the app, and we’ll need to improve the performance of it. The characteristic of the first phase of the performance tuning is usually bug fixing. These are simplifying changes. Later on, if we still need to tune the system more, we might actually have to trade some complexity for performance. This is something that’s always really resonated with me, because it’s how I’ve seen things play out many times. What we can see is, in the different phases, as we get to the most simple possible point, we get these really big gains at the beginning. If you hear about 10x gains in performance, it’s probably that nobody’s tried to optimize the system before.
Later on, it gets much harder. We have to incur a complexity for diminishing returns as we improve the performance. I split this into three main phases. We have the low-hanging fruit, which is from the initial state to the simplest possible point we can get to. We make lots of performance gains in this stage. After that, we might get into tuning. After that, there’s this competitive advantage stage, which is a stage we shouldn’t really go into unless we’re in real-time competition or we want to reduce costs to improve margins. Because at that point, we’re taking on a lot of complexity for marginal gains. We can use it as a framework for thinking to compare problems.
If we have a problem which gives us more performance for the same level of complexity, then that’s a nicer problem. If we can simplify it further, then we’ve got an even nicer problem, because we should usually prefer simplicity in the application there. It also informs the tools we might use. Early on, when we’re doing performance optimization, we might want to use metrics. Something I work on, which is continuous profiling, can help you find bottlenecks. Later on, we might change tack and move into microbenchmarks, perf counters, instruction profiles, as we try to improve the code.
Who am I? My name is Richard Startin. I’m a software engineer at Datadog. I work on continuous profiling. I will talk about continuous profiling here, but we’re not the only vendor. It’s not a sales pitch. I’m splitting the rest of the talk into three phases based on the low-hanging fruit, the tuning, and the competitive advantage stage that I’ve shown you before. We’ll start in the green zone, where we’ll look at how we might start a performance tuning exercise, and some of the kinds of low-hanging fruit that we might encounter.
Decide on an Objective
Before you do anything, if you want to optimize, you need an objective. You need to really know what you’re trying to do. It could be reduced latency, or increased throughput, or maybe reduced cost. These are activities which might actually trade off against each other. You could imagine, to improve latency, you might take some processing off the critical path onto a background thread. That might cost you some throughput if you have more context switches, and you have the concurrency control that you need to take care of. It may even increase cost. You really need to know what you’re going for, what your objective is.
Most of the focus of this talk will be in terms of cost reduction, because that’s quite generic. The ultimate aim is to reduce hardware, so we want to focus on what’s using memory or CPU, so we can use less of it. For the sake of simplicity of the talk, I’m focusing on CPU bottlenecks, because this is a fairly ubiquitous problem. It applies nicely to whatever your objective is. I’m going to talk about continuous profiling later, but it’s a good idea to start with metrics, so you actually know that you have a CPU utilization problem before you start looking at profiles. Otherwise, going bottom-up, it can be too much information, and you’re not even sure you’re going to get anywhere useful.
Continuous Profiling
Something dear to my heart is continuous profiling. It’s really useful if you’re always running a profiler, whether you know that you have a problem to analyze or not, because when an unexpected problem comes about, then you have the profiles. You can just go and look at them, and you’ll get a breakdown of what’s using the resource that you’re concerned with. This is only actually possible because sampling CPU profilers are very low overhead. What they do is they continuously sample threads in proportion to CPU time. The Linux kernel has tools which allow you to generate a signal whenever 10 milliseconds of CPU time or a period of CPU time has passed, and then take a sample.
If you take a sample every 10 milliseconds of CPU time, and it takes less than 100 microseconds to record a sample, then you’re staying under 1% CPU overhead. It’s possible for a lot of applications to actually just profile, or at least do CPU profiling continuously in production. Once you’ve profiled your entire fleet, you can aggregate all of the profiles together, and you can look at fleet level bottlenecks. There are numerous technical options. There’s two that I know more about through my work, which is JFR and async-profiler. These are JVM profilers, which run in-process. These have their merits and their drawbacks.
JFR, for example, it doesn’t actually have a true CPU profiler. It’s got execution sample events. They’re not scheduled in response to CPU time. It doesn’t actually report any of its errors. When you’re taking profile samples, there can be a lot of error cases, and it’s important not to destroy the statistics of the dataset by misreporting errors. Async-profiler, which is written by Andrei Pangin at AWS, solves an awful lot of problems. It’s got two true CPU profilers. It uses perf events, and it has a timer-based solution. It has very rigorous error reporting.
On the downside, it’s using undocumented JVM internals, unsupported APIs, which in principle could be taken away at some point in the future. Those are the only two in-process JVM profilers which have low enough overhead to use in production. There’s eBPF and perf, which have their own pros and cons.
Low-Hanging Fruits
In this first phase of optimization, once we’ve got a profiling tool that can show us where problems are, we want to focus on low-hanging fruit. These are generally problems that are essentially bugs. I’m going to start with something that I call non-functional bugs. This is code which produces the correct result, but goes about it in the wrong way, and in such a way so the error can only be detected by performance measurement. I’m an Apache Pinot committer, which is a distributed OLAP database. It originated at LinkedIn. I used to work on Pinot when I was working at StarTree.
At some point, I wanted to speed up our test suite, because it took quite a long time, so I profiled it. I profiled the test suite rather than Pinot in production. I noticed that the profile was dominated by this library called Apache Helix, which is used by Pinot’s controller to schedule work to other participants in the Pinot cluster. Here we can see the scheduling of jobs to something called the minions, which are basically workers which do work on behalf of the rest of the cluster. The profile shows that most of these controller CPU timers is scheduling this job list, which seems a bit unusual. The problem here is quite a common one in libraries. It’s an expectations mismatch on either side of the module boundaries.
We can see at the top, this is on the Pinot side, it’s setting a parallelism level, and it’s saying, give me all you’ve got, because it sets Integer_MAX_VALUE. You can’t have more than that. On the other side, it’s assuming that there will be many more jobs than the level of parallelism. The loop termination condition only checks for the level of parallelism, and it will just keep on looping round until it gets to Integer_MAX_VALUE. That can easily take 20 seconds of a busy spin on CPU. It’s not a very difficult problem to fix, because all you have to do is check that all of the jobs have been consumed. LinkedIn upgraded the Helix library after this fix had been made.
They actually published a blog post about it, which is really great, because you can really dig into the details of the outcomes of what happened here. This led to a massive reduction in processing latency for jobs by 7x. You can figure out whether you have a performance problem with things like back of the envelope calculations, where you can figure out what’s reasonable for a task to take, how much resources should it take with back of the envelope calculations. It’s very difficult when you have the abstraction of libraries in the way. If the libraries are doing things for you that you don’t know how to do it, you don’t understand the problem. Abstraction can make it hard, whereas a profiler will just show you that there’s a problem, and you can go and fix it. It’s normally quite easy.
There’s also poor programming practices. This is something that I dug out of a profile at some point. It’s basically programmed by exception. To convert a Double to a Long, if the mantissa is zero, if it’s an integral value. Then construct a big decimal, which is expensive, and then get the exact Long value, which will throw an exception if the input was a Double. Then catch the exception and return that. It’s obvious that this can be done better by doing some arithmetic. If you compare this, it’s like a 150x improvement. These are just the kinds of problems that work their way into application code when you’re bootstrapping a product. Also, algorithmic issues. This is an interesting one from Datadog.
We can get algorithmic issues for a number of reasons, mostly because of abstraction. You can get accidentally too high algorithmic complexity because of composition in ways that wasn’t expected. Or you can have rare cases that you didn’t expect to be rare. In fact, you might have a comment saying, this never happens, but this is a fallback, and you don’t put too much effort into implementing the most optimal fallback. Then it turns out in production that that fallback is not so rare. This actually came up at Datadog in one of our backend processors, which processes JFR files from our customers. We used a JMC JFR parser to parse the JFR files. All of a sudden, they started timing out, but only for one customer.
Fortunately, we profiled that service, which is doing the parsing, so it’s like profiling inception. We quickly identified that all of the time is being spent in this method, DisjointBuilder.add. This is just a quick overview so we can understand the problem. This is trying to turn events which have a start and an end timestamp into lanes so that the intervals are disjoint in a lane. If none of the events overlap, then we only have one lane. That was expected to be the common case. If all of the events overlap, then we have lots of lanes. What the code was doing to maintain this, it was basically saying, I don’t expect it ever to have more than one lane.
If that ever happens, to maintain an invariant of sort order, I’ll just sort the entire set of lanes over again. Because of a bug in the JDK, actually, where some tracing around file write events was recording the end timestamp rather than the duration as the duration, so we would always get events which don’t overlap if you’re using that file API. We’d be taking these events in, and the assumption would be quite catastrophic. We’d end up with a cubic time algorithm. It’s very easy to fix this just by finding the insertion point. We still don’t get a really optimal solution. It would be very difficult to change everything else. This was enough to fix the problem here. Just something unexpected happened in production, basically.
There’s also going against the grain. There are lots of existing optimizations in frameworks which target idiomatic code. If you don’t write idiomatic code, then you won’t profit from these optimizations. A common one is composite lookups. You have a hash map with two or more values. You could concatenate the values together and look them up in the map, or you could construct a record. If we compare the performance here, there’s basically a 3x to 4x performance difference. I called it stringly-typed and typesafe because there are prefix and suffix overlap bugs in the string version, which you just don’t need to think about with typesafe. You’re getting better performance, but you’re avoiding bugs.
Ultimately, you’re just writing objectively better code. Why not just find all of the bad code in your codebase with static analysis? There might be quite a lot of it. Some of it might never run. Some of it might not run very much. The benefit of having a profiler is that you know it’s worth fixing if you see it in the profile because it definitely happened that way. People might have different thresholds on when to act here. Normally, these kinds of optimizations, they’re just win-win.
All of this assumes that the profiling data is accurate enough to act on. Profiling, you think that you’re just interrupting a thread and collecting a stack trace, and that should be quite simple. It’s actually not because there are so many different kinds of code running on a JVM stack. We have code running in the interpreter, which hasn’t been compiled yet. Then it gets JIT compiled if it gets hot. We have generated native code, which is basically C++ code, which is generated at runtime based on the platform capabilities, which is slightly different to JIT compiled code. It’s not come from bytecode in the first place.
Then, finally, we have native code that we might call from JNI, and that’s dangerous as well. There are complications that I won’t go into now in unwinding Java stacks outside of a safepoint, because some of the code in the JVM to do this just never expected to be called outside of a safepoint. It’s not always safe or possible to get a sample, and this can lead to errors. Let’s compare the stringly-typed lookup profiles taken by JFR, where we get these figures, notably 27% in String#hashCode, and 8% of samples in String#getBytes. An async-profiler reports something completely different.
It reports much more time in String#getBytes and a little less time in the hashCode method. This is because of this red frame here, which is a runtime stub. jbyte_disjoint_arraycopy is a runtime stub. This is code which is generated at runtime. JFR can’t unwind these, because the unwinding code never expected to encounter them, because they can’t be encountered outside of a safepoint. Whereas async-profiler knows about them. It pops this frame, and then it gets into the Java code where it can unwind from.
If you have two profilers, you’re not really sure where your bottlenecks are, because they report completely different things. They do agree on the typesafe lookup. We get some benefits here. We’re not constructing a new string. We’re not concatenating strings. Also, interestingly, there’s next to no time in String#hashCode. JFR does agree. It’s not like JFR’s a hopeless case. There are specific cases where it goes wrong, and it doesn’t tell you that it’s gone wrong. It can agree with async-profiler.
This is important for the rest of the talk, because we’re going to change tack as we get into the different phases of the curve. The important thing here is that a string caches its hash code. If you write code which does things with strings, where you have to hash the strings in such a way that you work with the caching of this hash code, you don’t need to compute the hash code. If you don’t, then we’ll be computing the hash code, and that might get to be expensive.
Tuning
We’re moving into the next zone in the talk now. This is the amber zone. This is tuning. Some of the stuff I’ll talk about here I don’t necessarily suggest that people try. It’s not necessarily helpful. In some cases, it’ll make a difference. What do we do at Datadog? We ingest a lot of observability data. A lot of our backend is written in Go, but we also have a lot of Java services. They’re large enough to be worth optimizing for the sake of cost, and also meeting SLAs. When you profile your entire fleet, you can do things like take all of the data, aggregate the data together, convert it into cores, and then multiply the number of cores in a unit time by the cost from your cloud provider in that unit of time.
You can basically compute the cost, how much you’re paying for different methods. It turns out that actually at Datadog, for some reason, we spend an awful lot of time in String#hashCode. This turned out to be the most expensive method. What we could do is we could find all of the uses of it and optimize them individually, but it’s death by a thousand cuts. If we went and found them all and we optimized them in a way to do something different, there’d be more of them in two weeks’ time. We actually have to optimize this method. We found this hot method. What can we actually do? We can stare really hard at the code, but the problem’s not actually jumping out at me just looking at this. What we could do is we could relax the requirements.
In the context of String#hashCode, we might want to compute the hash code in a completely different way, produce a different result. This is like moving to have a nicer problem. This is going to have a dual effect, because a lot of people in the Java community know that String#hashCode is not particularly good in terms of randomness. Why are we computing hash codes in the first place? It’s because we’re using hash maps, and so we don’t want collisions in the hash maps. If we had a better algorithm which produced a more random result, then we could solve two problems.
We’d have a faster algorithm to compute, and we’d have fewer collisions in the hash maps, which would have secondary benefits. The problem with changing the algorithm is it’s specified in the JLS, but it’s hard to believe that this is really an immovable obstacle, because it was originally specified incorrectly in the ’90s. The specification was reverse engineered from the implementation, but it was reverse engineered incorrectly. They changed the specification to make it correct. Then it was noticed because the hash code algorithm, the original one, it only depended on the first 16 characters of the string, which trivially produces a lot of collisions.
For the sake of reducing the number of collisions, the algorithm was changed. I think this was resolved in 1999, so a long time ago now. The real problem that this can’t be changed is that Java 7 introduced switch statements on strings. This is a switch expression, which is a more modern way to write Java code, but it’s fundamentally very similar to a switch statement, which happened back in Java 7. If we actually look at the bytecode for how this is compiled, we’ll see that the invokevirtual up there is on the String#hashCode, and then we basically have a lookup switch on the hash code values.
The output of the function, as of Java 17, is hardcoded into class files. We don’t know how many of those there could be in the last 13 years or whatever it’s been since Java 7. To maintain backward compatibility, we just can’t change the algorithm. We can’t avoid solving this problem.
We have to change tack. We can’t really stare at the code. The line number information we might get from a profiler isn’t going to be very revealing because the code is so simple. We can produce a microbenchmark with JMH. What this microbenchmark here is doing is creating random byte arrays of parameterized range of sizes. Try not to choose multiples of eight, because that’s going to give the hardware an easy time. Try to choose something like seven. It’s constructing a new string to make sure that the hash code doesn’t get cached.
If we compare this to anything to have a fair test, we’re going to have to also do these same operations so we have the same baseline cost. JMH gives you all of these microprofilers. All of this was written by Aleksey Shipilev, very useful tooling, to help you explain the benchmark results. You’re not just saying A is faster than B. You have things like perf counters from the perfnorm profiler, which basically normalizes the perf counters to the benchmark invocations. Then it will give you something like IPC, which is instructions per cycle. This is a good measure of how efficient the code is. It’s essentially dividing the number of instructions during the benchmark run by the number of cycles per invocation.
We can see we’re starting at 4, and as the string gets longer, we’re going down under 2, which is not particularly good. There’s another profiler for explaining benchmark results in a different way, which is the perfasm profiler. That works by loading hsdis, which is the HotSpot Disassembler, putting that in a location where the JVM can load it. Then that will allow the JVM to decompile code blobs wherever they land. You can build that with binutils or Capstone or whatever disassembler you like. Then it goes further than that, and it links the program counters, which is what perf samples with the JIT compiled code blobs.
It can produce a histogram like this, which is very useful because you can check whether what you think should be running is running. Here we can see that 80% is in the benchmark stub. A total of 8% is in constructing strings and copying the strings, which we know is coming from the benchmark stub. Then there’s another interesting thing, which is, in my system, I haven’t set up debug symbols for the kernel. It’s saying that 9% of the time is spent in the kernel, and we don’t know where. That’s a really important thing to check before you interpret the data you get from here. It produces this output. It’s worth getting to understand this before going any further, because it’s important for understanding the rest of the talk.
On the left-hand side, you have percentages. The scary-looking addresses in the middle are program counters. The percentages are how many times of all of the program counter samples that program counter was sampled by perf. Then we have the instructions on the right-hand side. We have an instruction profile. It’s very hard to interpret the percentages on the left-hand side, because there are lots of confounding factors. There are things like skid, where a few instructions late gets blamed, so you have the wrong suspect. We have things like pipelining, but there’s only one instruction pointer. If you have multiple instructions in flight, which one do you blame? It does give you very useful output, which can help you to understand problems.
This is the output for String#hashCode, which should have a lot of multiplications in it, but we don’t see any. If you don’t read x86 assembly, you can see there’s an add, mov, shl, sub, but there’s no multiplication. This is actually intentional. Compilers do something called a strength reduction, which is an optimization the compiler can apply when it has information about constants. In this particular case, we know that a multiple by 31 is the difference of a multiple by 32, and the value subtracted from that multiple. Multiplying by 32 is nice, because the compiler can replace that multiplication with a left shift. The value of 31 was chosen back in the ’90s to enable this optimization.
If we look at the structure of the code, we can see that this code has been unrolled. We can see the same code over again. In between each of these blocks, we don’t have any control flow, so we’re not checking the loop conditions. That’s good. That’s saving some overhead per byte. Each of these blocks is processing 1 byte. Unfortunately, we have what’s known as a dependency chain. First, we start by bumping the loop induction variable. Then we load a load of data, four at a time. These instructions don’t block each other. In principle, they can happen at the same time. Then we have this long sorry chain of instructions which can only happen one at a time. That makes us sad.
It may or may not seem like a problem to you, because the program’s sequential, and the execution is also sequential, so where were we expecting to get parallelism from? The instruction latencies are very low. It’s worth looking at how instructions are actually executed. First of all, in the frontend, they’re decoded into these things called micro-ops or uOps, by the frontend. The uOps are then executed out of order and scheduled across a number of ports. Instructions have affinity with ports, so certain instructions can only execute on certain ports. The execution engine is designed to exploit any non-sequentialism in the program, and so it can do things out of order.
It’s designed to execute out of order, and this works so long as you don’t have a data dependency. We can think of IPC, which I showed you earlier, as almost like a utilization metric for the execution engine. If we get a lot of instruction-level parallelism, IPC is high, that means that we’re executing lots of instructions over the execution ports. There’s a great simulator where you can just load some assembly code on uops.info, which shows you basically like a Gantt chart waterfall for how the code is actually executing. Basically, we can see this long diagonal waterfall of green retired blocks, and that’s something that we’d like to avoid.
The next thing to do, to think about how to improve the performance of the hash code is to look at the chain latency, these instructions for each byte, and go back to Pentium when the optimization, like choosing 31 as the multiple was applied. Back in those days, multiplying integers was really expensive. It cost nine cycles, whereas the add, mov, shl, sub routine that we’ve seen in the disassembly only cost 4. This was actually a pretty decent optimization back in the day.
Nowadays, multiplication is much faster, and there’s nothing between them. I got these numbers from Agner Fog’s instruction tables, which are really useful. I don’t actually have a Pentium to measure this on. That’s where I got those numbers from. If we want to optimize this, then we need to replace total order with partial order. The approach we’ll take is to set up a recurrence relation, and we’re going to substitute the last value into the next iteration for each iteration. We’re substituting this in, and then keeping track of these multiples of 32, and the subtractions gets really unwieldy.
If we just cave in and we just multiply by 31 instead, we start to get this polynomial expression, and it gets very simple. If we go all the way down to the base case, we can see this is just a polynomial, so it’s quite easy to deal with. This leads to this possible, I call it crazy unroll. This code looks pretty ugly, but there’s something to it. I don’t suggest people do this, but people tend to assume in high-level languages that it’s not possible to outdo the JIT compiler. If you have two programs which produce the same output, the compiler will somehow magically optimize them both the same way. By understanding the gaps in the JIT compiler, you can find ways to beat it.
I wouldn’t bother most of the time, but this was a case where we could do this. First of all, we need to deal with the remainder. Then we have this big unrolled loop which does all of these multiplications, and it’s basically looking ahead in the loop to pre-compute some of the multiplications. We’re not doing lots of extra multiplications because we have constant folding, and so we can see that those powers of 31 are pre-computed and loaded into the bytecode, so we don’t need to worry about that. If we actually measure the performance of this, we’ve improved IPC, so we know that we’re making better use of the execution engine with this code, and we managed to outsmart the JIT compiler.
If we actually inspect the JIT compiled code, we can see why this is. Here we are. In principle, we can load all of the data, 8 bytes at a time in parallel. They’re not going to block each other. Then we can do these at the same time in principle and so on. That gives us a faster implementation by about 2x. Why is that? It shouldn’t be faster, really, because the strength reduction sequence of instructions has the same latency. If you look at the reciprocal throughput of the instructions, reciprocal throughput of 1 means that once you’ve issued one instruction, you can wait one cycle to issue another.
If you have a reciprocal throughput of 0.25, you can issue four per cycle, and you don’t need to wait for the first instruction to complete. We have a Gantt chart like this, and we can see that after six cycles, we’ve actually processed 3 bytes, which gives us two cycles per byte versus four cycles per byte, so that’s twice as fast. That explains why we’re twice as fast to compute the result. We can also put the disassembly into the instruction scheduling simulator, and we can see that it’s much more parallel. If we put them side by side, we used to have this diagonal waterfall, and now we have something which is much steeper.
We do have a problem, though, because we have this data dependency, again, on this ECX register. We can see all of these add instructions are trying to add to the same register, which creates a data dependency. That is because we’re loading the hash at the start of each loop iteration, then we have to add all of these numbers together into the hash, so we’re doing this computation effectively in parallel and then merging back into the intermediate result for each iteration.
That’s basically limiting the amount of parallelism we can get. This is arguably an optimization because it’s twice as fast. It’s pure Java. It’s pretty easy to understand, even if it’s ugly. It does increase the bytecode size, which is a bit of a problem if we want to get this into OpenJDK. It may penalize shorter strings if we no longer get inlining because the bytecode size is too large.
Competitive Advantage
This was actually submitted as a proposal to OpenJDK, to solve our cost problem because we’d identified that the hash code was costing us so much money to compute. We get into the red zone now because straightaway, rightly, OpenJDK came back to us and said, “Don’t do it this way. Do it another way. Can you implement a vectorized intrinsic? Here’s an idea”. What is vectorization? What we’re trying to do is synonymous with SIMD, which is Single Instruction, Multiple Data, and so we’re trying to process more data per instruction. This can be really useful for increasing the throughput and reducing the number of instructions that we need to apply per byte by operating in chunks.
We’re basically applying the same operation to a vector value simultaneously in one go. It’s a bit like instruction-level parallelism, but there are explicit execution units, special instructions, and pipelining applies to vectorization too. Schematically what we’re doing is we would no longer have this reduction bottleneck. We could keep all of the data in separate lanes and then reduce horizontally at the end. There’s autovectorization applied by the JIT compiler in OpenJDK. Why can’t it do anything with this? We’ve already seen the disassembly for the improved version, and so we know that the disassembly doesn’t have any vectorized instructions because we’ve seen it.
The problem with the optimized solution that we’ve shown is basically we’ve unrolled the loop. The JIT compiler basically competes with the application for CPU cycles, so it’s limited in the analysis it can really do. It’s not going to look for similarity in blocks of instruction. If we do manual unrolling, we’re basically switching off vectorization because unrolling is actually the trigger for vectorization analysis. Why don’t we just try and make it easier by pre-computing the coefficients, and then we’ll evaluate the polynomial as a dot product, which should be pretty simple to optimize.
Unfortunately, we don’t get any SIMD. We get these register spills. Register spills are where there’s too much register pressure and integer values are saved into floating-point registers, and we still have this per iteration reduction anyway, so it’s not really a very attractive solution.
What we could do is pre-compute enough powers of 31 so that we can just keep on generating the next set of powers in a loop, and we’ll multiply the vector of values by the powers of 31, and compute a dot product one vector at a time. Then, finally we’ll reduce the vector horizontally at the end. There’s a blog post here about how to do that. That code looks like this using the vector API, which still hasn’t been released, and this code won’t compile anymore. This had very attractive performance back in 2018 using a prototype of the vector API. Here’s a diagram of how this algorithm is actually working. We’re going to load the data backwards. We’re going to load that into a small register.
Then we’re going to have to sign extend, so we’re going to put a load of zeros next to the data. Then we’ll multiply that data as an integer with the powers of 31. We’re going to add that to the accumulator. Then we’re going to update the coefficient so we have the next powers of 31. Then, finally, because we’re going backwards, we’re going to go and load the next vector. We’ll keep on going backwards until we’ve got nothing more.
Finally, in the end, we’ll reduce the accumulator horizontally. This works pretty well because we’re processing so much data per instruction. Adding up the latencies of these instructions, we have 24 cycles, but we’re doing 8 values per chain, so we’ve basically got 3 instructions. We got three cycles per byte, which isn’t that much better than we had before, so we should probably find out why. If we have wider vectors, so we have AVX-512, and so we can process 16, then we’re down to 1.5 cycles per byte, which is quite a big improvement. Vector multiplications are quite slow, we have a latency of 10.
The nice thing is in AVX2, the multiplication has a throughput of 1, so we only need to wait one cycle until we can have another one. If we just wait 27 cycles by doing four of these, unrolling this four ways, in 27 cycles, we can process four times more data. Three cycles extra, we get four times more data processed. Finally, we get under one cycle per byte with this unrolled approach.
Ludovic Henry, who used to work at Datadog, went off and implemented this. Then he left Datadog before it could get finished. We were really lucky because Oracle and Intel saw a lot of value in this approach because it massively improved the performance of such a common JDK method that they came in, they helped, and they productionized it, and it went into JDK 21. If you’re running JDK 21 now on x64, then you’re benefiting from this optimization. We can see, here’s a kernel of the loop now. We can see that we have a vectorized loop. The instruction simulation is much more vertical.
That’s a nice comparison with what we started off with. We have much more parallelism, is the interpretation of these two charts. Crazy unroll was somewhere in the middle. With JDK 21 for a range of lengths of string, computing hash codes is much faster. What kind of impact does this actually have? It’s a shame it’s had limited impact at Datadog because this hasn’t been implemented for ARM. It took about 18 months for this change to get into OpenJDK.
In the meantime, we actually moved most of our backend processing to ARM. We need a new intrinsic now. It’s going to be a little while more to finally profit from this. Although we have profited a lot from that very simple migration to a different microarchitecture. Potentially, it’s got a huge impact outside of Datadog, if you’re running Java on x64, you’re running Java 21, that is. That’s the state of things now.
Key Takeaways
The key takeaways are, early on in performance work, we have these simplifying optimizations, which, they’re objective improvements. It’s hard to argue with them because you’re probably improving your code quality at the same time. 10x really sounds amazing, but it probably means that the system hasn’t been optimized before, and you’re just the first one to get there. Don’t be too disappointed with 10% later on. I think that continuous profiling is a great tool for finding low-hanging fruit. It’s not so useful for fine-tuning of the kind of work that Datadog did with the String#hashCode.
Thomas said at QCon last year that all software is optimized for out-of-date hardware. That’s an interesting example of this here because String#hashCode was optimized for the Pentium in the ’90s. The algorithm was chosen for the Pentium to be more efficient. Over time, that optimization became less relevant. Specifying that algorithm and relying on it put OpenJDK in a difficult place because it couldn’t be changed when hardware changed.
Questions and Answers
Participant 1: How long did it take from identifying the issue to actually implementing the solution?
Startin: Actually implementing the solution was quite quick. Obviously, OpenJDK has a massive impact. If you break something like the String#hashCode, it can really affect a lot of users. OpenJDK is a very cautious project. Actually, getting it merged and released took more like 18 months. Implementing the solution was relatively quick.
Participant 2: You said earlier that profilers will often report different things. Does this mean that we should be using different profilers in different circumstances, or is there a way for us to know when a profiler is best suited for a certain problem?
Startin: I think it’s worth having more than one observability tool to test the observability. I think in general for Java, anything based on async-profiler is a really accurate profile. JFR has some gaps. We’d like to actually fix those gaps to make it better. That’s something we’re working on at Datadog now. I wouldn’t suggest running two in production, because it’s going to increase the overhead. Profilers, especially if they use signals, they don’t play too nicely. I think it’s definitely worth not trusting your tools and testing them against each other.
Participant 3: The condition is to start using continuous profilers to do these jumps, [inaudible 00:48:50] for services under load. It’s a little bit risk averse to add a little bit more to it. You mentioned it’s below 1%. Are there cases where this isn’t the case or stuff you need to watch out for?
Startin: Anything that applies tracing is difficult to reason about, because when you apply tracing you don’t know all of the programs that the instrumentation will be applied to. The overhead depends on what’s being traced. With CPU sampling, it’s by construction. If you set the sampling frequency low enough, you can’t have more than a certain level of overhead. Allocation profiling, on the other hand, that can be quite high overhead. It can be more like 5% depending on the workload. Be careful with allocation profiling if you want to enable that in production. Often, it’s lower than that, but it can be higher depending on the workload.
See more presentations with transcripts