Transcript
Madelyn Olson: My name is Madelyn Olson. I am a maintainer of Valkey. I’ll explain what that is. I’m also a principal engineer at AWS. I want to start off talking a little bit about a story. I assume basically everyone here reads Hacker News like we all do. It’s so much fun to read. In yesteryears, starting in 2018, I contributed a lot to Redis. I became a Redis maintainer. I had a bunch of notifications set up to see when people were building Redis clones to see what they were doing.
Building Redis clones is always one of the things that I really enjoy because you can take it so many different ways because it’s so simple. All Redis really is, it’s a map attached to a TCP server with a custom wire protocol. All those different things you can optimize. You can play around with multithreading. You can play around with memory efficiency. You can play around with SIMD instructions to do the encoding and decoding. I think it’s a great way to really build a lot of systems intuition on how to build fast software and good software. I was always a little sad looking at all this code, because this is all in like Python and Rust. I’m just a sad little C developer that can’t take advantage of any of this code.
On top of that, Redis was a very mature project. Almost all of these clones, more specifically, don’t do one thing, which is they don’t implement everything Redis does. Redis brings a lot of baggage with it. One of the things that Redis is usually used for is a cache, and the cache is one of the most important parts of the application. The cache cannot go down. We can’t be trying to be really smart and be like, maybe if we adjust this, it’ll be a little more memory efficient. It might be worse in some cases. That can bring applications down.
Context
What I want to do in this talk is talk about how we tried to do that though. We didn’t want to stay stagnant. We wanted to build new things. I want to talk about how we basically took the core data structure that was in Redis, and we made it more efficient and faster. How we tried to do that without doing any regressions. Put another way, I’m hoping this talk will help build some mechanical intuition on how to think about system software so that you can also take some of these practices and build and update and modernize software so that it’s better and it does not break all your users.
Introduction to Valkey
I do want to make the switch to explain what Valkey is first. I was a maintainer of Redis. Redis decided they did not like me as a maintainer, and in 2024 they changed their license. Me and some of the contributors from the Redis project went and created Valkey. This was a great time for us to try to be more technologically on the forefront. We were trying really hard to be a drop-in replacement. We didn’t want to break any users. We still have a very strong commitment to be backwards compatible with Redis. Since we have a new governance, we were willing to try to be a little bit more on the forefront of development. That was our motivation to go down this pathway and start to try to build more cool things.
Project Start – Landscape
I assume people don’t have to know anything about Redis. We’ll explain everything as we go, talk about data structures, how everything fits together. I started by saying that Redis is just a TCP server attached to a HashMap with a custom protocol. What does that look like? In here, we have a very simple key-value pair, FOO and BAR. You can interrogate that data out of Valkey with the command GET and the key name FOO. Some stuff to know about Valkey is that most of these keys and values are relatively small. If you have large values, you’re probably storing it in a different data structure, because memory is very expensive. It is byte addressable, so you can have very small allocations. I said in my day job I actually work for AWS, I actually work for AWS ElastiCache. We were able to bring some knowledge from that place, which is that we know the average key size across all of the clusters in ElastiCache is about 16 bytes. The keys are very small. The long tail is much bigger. There’s lots of people with 100 kilobyte values, but many of them are also still quite small.
The P50 of P50 clusters, so that median size of the median cluster in ElastiCache is only 80 bytes. These values are quite small. This also isn’t a static map. Valkey is a dynamic system, so you can also add more data in. We need to easily be able to handle net new data that can be of any value and size. Some other caching systems are able to make some optimizations by saying, keys are only X size or only Y size. One of the important things of Valkey is that it can support any type of data of any type of size. The last fundamental you need to know about Valkey is it’s primarily built to be a cache, which means that time to live is very important. In this example, you can set an expire on a key so that once that time has elapsed, you will no longer be able to access the key, and that memory will eventually be reclaimed because the key will be deleted.
Another nice assumption we would like to be able to make is like all keys have the same time to live, but that’s generally not true. A lot of people have multi-tenant caches. They like to stick a lot of stuff in there. We have to be able to support both what we call volatile items, items that have TTLs, and non-volatile items, ones that don’t have TTLs. Those TTLs can be anything. We need to be able to support a very wide range of activities. Bringing another example back from ElastiCache, we saw only about 20% of clusters have very uniform TTLs. Another 20% have basically no TTLs. They’re using stuff entirely durably. Then the other 60% is just this huge variance of people with multiple different types of caches with different TTLs.
This was the landscape that we started in. One of the things we noticed in 2022, there was a bunch of different caching systems that were coming out. One that comes specifically to mind was Dragonfly. Dragonfly had this big claim about doing 4 million RPS, and we didn’t care too much because we were horizontally scalable. You could do 200 million requests in a cluster, so we’re like, that’s not a very big number. One of the things we did notice is they were much more memory efficient than us. We took a step back and realized the core data structure powering, at the time Redis, but also Valkey, was basically what looked like a HashMap from a 2000s CS textbook. It had a lot of pointer chasing.
In this diagram here, you can see all the red blocks are basically pointer addresses, and all the green blocks are user data. There’s a lot more red than there is green, and that’s because we wanted it to be consistently fast and simple in a way. We had this general concept that we wanted to reduce the number of pointers we were chasing to make it better. We had a broad idea we wanted to make it faster, but if it was the same, that’s also ok, but we didn’t want to regress. We talked about it earlier, we don’t want the cache to get worse, but we’re ok if it gets faster. Then the other super important thing is we don’t want to break anything. Valkey and Redis are known to have so many weird esoteric commands. There’s ability to incrementally scan through the keyspace, we want to make sure that still works correctly. There are lots of esoteric data types, like sets and sorted sets, and they have esoteric commands like SPOP and ZRANGE that allow you to interrogate different parts of those data structures, and so those all still need to work relatively as expected. That’s the start of our project. We’re like, we should make this more memory efficient.
The Data Structure
Let’s go one level deeper and actually look at what our data structure looked like at the start of this project. Like all hash tables, we basically have an array of buckets, and to determine which bucket we’re going to stick an entry in, we’ll do a hash function on top of it. For simplicity, we’ll just assume this hash function on top of FOO generates 001, and then we had a dictionary entry, which was our container object. This is a fixed struct size of memory. This is 24 bytes, 8 bytes for each of the three pointers for the object. Next we need to determine, is this the right object? We’re just pointing to a generic structure. Then we have a pointer to what we call in Valkey an SDS, a simple dynamic string, which basically says, here is the actual key name for this object. In this case, we can see it’s a match. It’s the same. The string is FOO. SDSs can be interpreted as a C string, which is just a character array. We can do a string compare on it and see they’re the same.
Then we can go fetch the actual value out of the dictionary, the container object. From there, we have a pointer to our actual Valkey object. These are the generic container for any type of object stored in Valkey. This includes metadata about the type. Another cool thing about caches is like least recently used to figure out, when we’re doing evictions, which items should we bias towards kicking out. We have a little bit of overhead, and then a pointer to the actual underlying value, in this case, was BAR. One smart thing we did do here is sometimes that this BAR is actually in the same memory allocation as server object, as long as it’s relatively small. We can put it in one block of memory. If this wasn’t our actual object, we would then follow the next pointer. This means that two objects hash to the same value. This is what we call a hash collision. We need to resolve it somehow. In the original implementation, it was just a linked list. The reason we used the linked list was that we wanted to avoid long chaining. I’ll talk a little bit more about that when we try to get rid of this. We follow these linked lists until there’s no more objects.
Another thing that we actually realized when we were building this out and trying to make it faster was that most of the time we actually spent looking up objects in these dictionaries, is fetching the data from main memory into the local CPU caches. That takes on the order of hundreds of nanoseconds, where executing stuff within the CPU cache is on the order of tens of nanoseconds or less. What we started to do is before actually executing commands, we actually tried to go out ahead and prefetch the relevant memory into the CPU caches. We do that first by fetching the dictionary entry, and then we can fetch all of the surrounding memory as well after we’ve prefetched that. We can’t prefetch the second set of memory until the first address has been prefetched because we don’t know where any of it is yet. This is the big performance improvement we talked about in Valkey 8. When we think about trying to make this more memory efficient, we can’t degrade this behavior either. There’s one last important bit about our hash table that we have to make sure we keep when we modernize it, which is, some caches are able to say, my hash table only has 100 elements, so it’s a fixed size array.
In Valkey, users really like to not have to pre-size anything. They change their workloads. We need to be able to dynamically grow and shrink this hash table. Some implementations do this. Once you exceed the number of items, you just resize the whole dictionary all at once. You go through, you reprocess all the items. That has a very heavy performance penalty. As we said, caches, we do not want to be doing anything that’s very expensive all at once. We amortize it out. For a period of time, we actually have two tables.
Then we slowly incrementally rehash the items from the old table to the new table. That’s a lot. Still not quite done. We talked about TTL in the beginning. We have a second hash table just for storing that information. It’s broadly the same. We follow almost the entire same flow, except that inside the Dict Entry, we point to the same key. We’re basically sharing that memory address. We also have time to live information embedded inside the Dict Entry. Here we have another 32 bytes of overhead, basically just to store an 8-byte unsigned long long value.
The one last thing I want to get into before we recap and then start thinking about what we actually changed is, I want to go through the memory layout of these server objects. The 64-bit or 8-byte pointer is the stuff I mentioned. There’s also some metadata. Internally, we used jemalloc, which has arenas of fixed memory allocation sizes. If we were to allocate 11 bytes, we would still be given a 16-byte memory allocation back. Because we need 8 bytes for the actual pointer, we basically have 8 bytes of metadata we can use. We have stuff like type and encoding. Valkey only supports six data types, but we have 4 bits, which could give us 16 types. We have another 4 bits for encoding, even though no type has more than three encodings. We have 24 bits for LRU, which is probably reasonable. It gives us about 194 days of accuracy.
Then we have a refcount. The highest value the refcount can be today, I think, is 2 to the 20th, which is much less than what we have. We actually have a bunch of extra metadata here. We’re not currently using it, but we’re just preallocating it so that we have it for the future.
Hopes and Desires
We’ve now done a comprehensive overview of what the current state of the world is. I’m going to re-go over our hopes and desires, or what our hopes and desires were for building this new thing. To reiterate, we want to minimize the number of memory lookups. We want to not regress on performance. We want to support all the weird esotericness that Valkey and Redis styled us with. I’m going to inject one new one in here, which I didn’t talk about earlier, which is we want this to be generic. I talked a little about the variable different data types that Valkey supports. We want to be able to support all of those. Then the two new ones we talked about is we want to support incremental rehashing. We don’t want to have to rehash all at once. We want to be somewhat aware of what memory is available in the cache at a given time in the CPUs.
Idea 1: Dynamically Sized Structs
Let’s start talking about what we did. The first thing we did is we want to get away from these fixed size structs. Fixed size structs are great, but they’re not very efficient. The first version actually after Valkey launched, is we embedded the key directly into the dictionary entry, which makes some intuitive amount of sense. At the time, the dict could basically only be a string. It could be a couple of other types, but by and large it was a string type. We could just say, special handling of that. The next thing we did is we embedded the rest of the metadata object into the Dict Entry. At that point, that was just basically the Valkey server object. We basically got rid of the Dict Entry as a whole and just said, the pointer from the underlying table is pointing directly into the object. We had to make a bunch of core changes around how do you get a key out of a hash table. Once we had all that done, we were able to basically build a generic hash table that said, you’re given an object. It has a pointer to determine where the key is and where the value is. Now we’ve saved two of the main pointers of our dictionary.
Then also we get this great benefit of, if we look back and think about the TTL, the TTL had it embedded into the object as well, and we can still embed the object here. There’s a nice another benefit. When we looked up a key, we had to actually do a separate hash table lookup to check if it was expired. Now we have it embedded in the same object, so we no longer need to do that check. Embedding everything together made a lot of the code a little bit simpler, but it made the memory layout quite a bit more complicated. Instead of having this nice fixed size memory layout, we have a little bit more complicated of a layout. We still kept everything the same. The type encoding, we didn’t steal any of those bits yet. We did steal some bits from the refcount, which we knew we could steal.
Then we basically just said, trust me, there’s pointers. There’s information after this. We did a bunch of memory lookups inside the code itself to basically go look for where the pointers actually were. We embed the TTL basically if the hasexpire bit is set, we know the TTL is placed directly after the pointer. Then since TTL is fixed size, it’s pretty easy to compute the offsets. Then for the key, we have some extra information at the front of the key to know how long the key is. This is not necessarily very fun to do in C, but it’s fully possible to do in C. It gives us some great memory density benefits.
Idea 2: Implement Open Addressing
The other thing we wanted to get rid of was the next pointer. We talked earlier about when we have a collision, we have to go and address that somehow to figure out which is the actual object we have. The alternative that we thought about was building something that was more like an open addressing style. With open addressing, instead of resolving the conflict with the linked list, you basically just put the entry somewhere else. As long as that’s deterministic, you can basically go look for it. We looked around a bunch, and we found some similar prior art in the form of what Google built, which is called a Swiss table. It had this open addressing. The innovation that it had was it was very cache line aware. It builds everything into 64-byte large buckets, and each of those buckets have seven entries inside of them. There’s some extra metadata to make the lookups really fast. Let’s take a little bit of a look at what that memory layout looked like. I mentioned there are the seven 8-byte entries. There’s also seven 1-byte hashes.
The whole idea about this is you can avoid some of the lookups to the actual entry by keeping some of the hash inside the bucket itself. You can use SIMD operations to basically check those seven prefixes together to see which of those eight items is likely a hit. Then also there’s one more byte with a little bit more extra metadata, which was the everfull bit, which is basically, has this bucket ever been full? The entry you’re looking for might be in future buckets. Then basically presence bits, because the hash could in fact be 0, so we need one extra bit to basically rule out that case. That landed us on something similar to this. We had now our pointers to the buckets. When we were going to go do our lookup, we basically checked bucket 0. We perform the hash function, the lookups on the 1-byte hashes we have in the prefix in the buckets.
Then, for all the hits, we go and actually check the real item. This works well. It gets rid of the next pointer, but introduces some new challenges that we uncovered while we were building out the implementation. One of the problems, which is why we didn’t use probing initially in the design, was we can get very long chains. These long chains can really impact performance, because we might have to scan a lot of buckets to actually find the element we’re looking for. Then, also, when items get frequently deleted from the cache, which is quite common with varying TTLs, we end up maybe having to do a lot of checks on a lot of buckets, even though the buckets are relatively empty.
Over time, we either had to write complex garbage collection language to remove that everfull bit, or we basically had to rehash the whole table in place. As I mentioned earlier on, we don’t like to do rehashing very often, because we pause that during our full sync operations. Valkey supports a functionality to basically take a snapshot, which uses a system fork. While the fork is running, anytime you modify any memory address, even if you’re not writing new data, you have to copy the page within Linux. Rehashing would basically completely destroy, do a lot of copy-on-write. We like to pause it while that process is ongoing.
Although this had a lot of benefits, what we ended up doing was a hybrid. We have these buckets, but we still chain the buckets. Once the buckets get completely full, we will basically allocate a new bucket, and then start putting items in that bucket. I’ve affectionately called this the Swedish table. Similar to the Swiss table, but it was built by a Swede, along with some other folks. This is the end result of the table that we have today inside Valkey. It’s definitely by no means perfect, but it solves a lot of our core problems and actually makes the behavior very similar to the way it was in older versions of the engine.
Now that we had a design and an implementation, I’ll talk a little bit about what we did to make sure this table worked the way we were hoping it would work. There are basically two main criteria we care about. We care about correctness. We’re writing a lot in C. Obviously, we’re doing a lot of code reviews, but we want to also make sure we’re not silently introducing any regressions that might fail in edge cases. We used a lot of, obviously, code reviews. We have a lot of integration tests and unit tests. Within the Valkey project, we believe a lot in defense in depth for validation of code. Unit testing can test a lot of things. It’s also very useful to have integration tests basically also be covering a lot of those cases to make sure we’re not missing stuff that only appears at larger scales. We also run all our tests with an address sanitizer, which is very helpful to catch those cases when we’re doing arbitrary memory lookups and arbitrary memory decisions.
Then, also, we built a lot of assertions into this specific hash table to make sure all of our core assumptions while we were doing stuff like rehashing were all being maintained. Those are enabled both in the testing infrastructure as well as in prod because we really are afraid of having any type of memory corruption inside production. Same type of thing that you might see with Rust. That’s from the correctness perspective.
From the performance perspective, we’re also very trying to do defense in depth. We did both microbenchmarks and macrobenchmarks to basically verify the performance both in terms of memory efficiency and performance. For performance for the macrobenchmarks, we did stuff like flame graphs. I’ll talk about that. Then do actual end-to-end tests to make sure we’re actually saving the memory we expect. I want to talk a little bit about the limitations of doing flame graph analysis for this. This is a flame graph. Ruth explained a little bit, in her talk, that the horizontal width is the amount of probes. The way flame graphs work is you run perf, perf samples the program periodically and generates a bunch of traces, and then it basically stacks those traces on top of each other. The far right-hand side of this graph is just the right side of Valkey. Valkey is extremely dominated by I/O. If we go forward and look at the left side, 13% of all the time Valkey spends is on doing command processing. There’s that section which has dictFind.
Most of the sampling that we would do for performance testing this specific functionality, falls in that one dictFind block up at the top. Basically, that’s because all the probes hit the exact same function, and that’s because we compile with O3. Basically, all of this code is getting completely unraveled into one giant function. It’s actually very hard to use perf to verify any of the correctness of this feature. That’s why we use more of microbenchmarking. Rain was the person that did a lot of this work. We did microbenchmarking to verify the performance along the way to check to see how the performance worked before on that older implementation and the new one.
In caching, we care about two main workloads. First is you care about a cache hit, but you also care about a cache miss. Cache hit is the case where the item is actually in the hash table, so you can pull it out. In the miss case, it’s not there, but you still want that to be fast because then you have to go rehydrate the cache. We had a balanced benchmark that has both the hit rate and the miss rate. We also wanted to make sure we were having enough data in the benchmark to actually test the CPU cache. It’s not just entirely an L1 cache. We actually do care about the L3 cache hit rate. This was the microbenchmark we saw. Then we also had these more generic macrobenchmarks, which was doing throughput testing. One of these days, I want to get on my soapbox and complain about how everyone likes to talk about latency with Valkey, but really the only thing that matters is throughput. We care about throughput because basically every command in Valkey takes around 1 microsecond, so latency is very difficult to measure correctly. We primarily care about measuring throughput degradations.
As I said, we don’t necessarily care if it’s that much worse, but we generally like to be same or better. Some of these workloads were pretty unaffected, even though the memory efficiency was better because we did a bunch of that prefetching work I talked about in the beginning. Those are the cases where we are doing simple GET and SET requests with I/O threads. Valkey has a single-threaded mode and a multi-threaded mode. If you look on the graph, you can see each of the five tests. The left-hand side is the single-threaded case, and the right-hand graph is the multi-threaded case. We’re able to see generally better performance across our various benchmarks. At least from this perspective, the performance looked good.
Let’s talk a little bit about the main thing we cared about, which was the memory savings. We computed the theoretical memory savings to be about 40%. The main thing is we removed about 23 bytes of overhead. We removed those three pointers we talked about, the key pointer, the next pointer, and the value pointer. We added a little bit of overhead in the Swiss table. The Swede table had one extra byte for the hash information. We had a little bit extra of overhead there. These are all great, but as I said, jemalloc uses bin allocators. Even though these might be the theoretical results, we might be over-allocating memory.
These were the percent changes in memory that we saw across different keys and value sizes. I said up front that keys are pretty constant within Valkey. This is across different value sizes. On the far left, we have 16-byte values, which is relatively small for caching. On the far right, we have 512 bytes, which is relatively large for most Valkey workloads. By the most part, you see about 25% to 20% reduction in memory. Now that we had all this, we all lived happily ever after because that’s how software products work.
Production Issues
Next, I’m going to talk a little bit about what we found and what I wish we had done a little bit differently when we built all this software out. The first bug is esoteric. I mentioned that Valkey has a lot of weird commands. One of them is the SPOP command. Valkey has a data type, which is called a set. You can put a bunch of data into a set, which are unordered structures, and then you can pop a random item out. We had to basically build random functionality into this table we had. The way we implemented it is we picked a random place. We basically picked a random bucket, and then we sampled some of the items in front of that bucket. We wrote a bunch of microbenchmarks, and we proved that this is actually random.
If you call this random function a bunch, it picks all of the items relatively randomly. What someone in production started reporting is we were building these giant holes inside the hash table. The way the implementation worked is if it sampled a bucket, there was no items in that bucket, it would keep searching forward buckets to find an item. What was happening is if there was a hole, the first item at the end of the hole would be the one that we were picking. If you’re repeatedly calling this SPOP operation, you would basically keep making the hole bigger. Every time you hit the hole, you’d make the hole a little bigger, and that would increase the chance of you hitting the hole again. Even though the size of this table was pretty consistent, we were getting these very long, empty periods inside the bucket.
That was made worse by the fact that we were chaining the other buckets, because we weren’t overfilling into these buckets again. This is one of the things we actually had solved in the previous implementation, but we tweaked the implementation a little bit when we moved to the new implementation to try to make it a little bit more random. That ended up biting at us because we missed this edge case, because we had no performance testing in this. The only reason we found this wasn’t because it wasn’t random, it was because we were spending a lot of CPU time doing these linear searches through these empty buckets.
A second issue I want to talk about was related to performance. When we do scanning through these hash tables, what we’re actually doing is we actually start doing memory prefetching for future buckets, especially those chained ones we talk about. What we were doing was we had a function which would basically get the next bucket, and we were basically calling that on the next bucket itself. Instead of prefetching the next bucket, we were prefetching two buckets away, which is not functionally incorrect. You can see that we have code guarding against overflowing the table. It was still faster than not prefetching, so it looked like it was working. It wasn’t until an engineer was looking at this code, trying to do something else, and says, this code doesn’t look like it’s supposed to do what the documentation says it does. We were like, “You’re right. We should fix that”. Which is one of the reasons I really like open source. This bug could have just hung around for a long time, not causing any problems, but it would have been very hard to find. It’s nice that we have people from the open-source community to help us with this.
Should We Move the Project to Rust?
I want to use this, after we talked about these two bugs, to address a specific point that a lot of people asked me about Valkey, which is, should we move the project to Rust? I actually really like Rust. Most of the code I write is in Rust. We’ve built new projects on Valkey. We built this little wrapper around the core in Valkey with a Rust SDK that’s all safe. We built stuff like LDAP authentication with it. We built Bloom filter support with it. I think this is all great, but Rust doesn’t solve either of the two problems that I just talked about. One of the concerns I have with moving to Rust is we’re just going to introduce more of these little tiny bugs, because Rust is very opinionated. My general belief is if you’re starting a new project, you should probably write it in Rust, but if your project is in C or C++, don’t rewrite it in Rust, unless the codebase is really bad and it should be thrown away.
Takeaways
The things I really hope you take away is, deeply understand the access patterns of your application when you’re doing modernization. I didn’t show a lot of C code. I talked a lot about data structures and how to think about the code, how to make sure the way you’re accessing data is going to be performant, because that’s really what matters. I also want to encourage everyone to be very curious. We spent a lot of time evaluating stuff. We spent some time looking at Segcache for inspiration, which is another caching implementation. We’d have liked to look at Dragonfly, but it wasn’t open source. We looked at KeyDB a little bit, and we took a lot of inspiration to try to figure out what made the most sense for us.
Since we are a dependency-less C application, we could tune it to work exactly for our system. I also think it’s very important to have lots of testing when you’re doing low-level system stuff. A lot of people are like, unit tests are enough. We wrote unit tests and higher-level integration tests. They’re both really important. They’re important to run with address sanitizer, run them with Valgrind, and have fuzzing testing to make sure you’re catching all the weird esoteric edge cases. Also, make sure you’re benchmarking early and often. For our case, we don’t just mean performance, but also memory efficiency. We added a bunch of static asserts as part of this to make sure that we don’t silently regress memory efficiency. We have some performance benchmarking, but that’s something we’re also actively working on inside the Valkey project to make sure we’re able to retain that performance moving forward.
Resources
I would lastly like to thank the engineers that did all the actual real work. I reviewed some of the code, but I didn’t write that much of it. Viktor, the Swede, he’s the one that wrote most of the code. He basically helped pioneer the implementation. Then, Rain, who is from AWS, also helped build all the data type-specific implementations that use the data structure and helped a lot with verifying the benchmarks. If you would like to learn more, these are three references I think are great. The first is the Swiss Table Deep Dive. It’s one of the best talks I’ve ever heard from CppCon, talking about how they built this and how it worked for them. The second one is a deep dive that’s a blog that was written by Viktor to explain some of his thought process in writing this. The last one was the recent presentation that Rain did at a local Seattle Valkey meetup.
See more presentations with transcripts
