Transcript
Ramya Krishnamoorthy: My name is Ramya. I’m an engineer at Momento. I’m going to be telling you the story of how we rewrote our platform from Kotlin to Rust, and what we learned throughout that journey. We’ll start with a short introduction of what Momento is and what our product is. We are an early-stage startup. We offer a real-time data platform with various building blocks for serverless services like caching, a message bus, durable storage, and more to come in the future. We first launched in 2022 with the very first building block, which was Momento Cache. It’s a serverless key-value caching service. Think of your application cache and imagine it living in the cloud. At a very high level, our architecture looks like this. We have a control plane which is responsible for creating and managing the top-level entities.
In this case, our top-level entity was a cache. We have a data plane that offers the APIs that are used to modify the data inside the top-level entities, like the data inside your cache. Going one level deeper, our data plane consists of two major components. There’s a routing layer whose function is to take an incoming request and send it over to the place where the data lives. We have a storage layer. You can think of the storage layer as a collection of slices of data, and each of this slice is managed by a different partition. In the storage layer we run different types of storage engines based on what type of APIs or operations you want to do on your cache.
When we first launched, our highest priority was to just get the product out there as fast as we could, so that customers can start using it. We had a diverse team. We had engineers who were veterans, and we had engineers who had just started off in their career. The majority of us had a background in JVM-based languages like Java and Kotlin, and some of us had operated at-scale services in the cloud using those languages. We decided to just use Kotlin to get the product out there. As you can see, all the components that we wrote were in Kotlin. In our storage layer we use third-party or open-source storage engines, and our routing layer would just talk directly to those engines. Once we had something working, we could finally measure it. We put in the time to add metrics, like set up automated performance tests.
Then we came up with a workload that would identify how most customers would use us. For us, we started off with a 20K TPS workload of simple gets and sets with an equal split between the two. The gets and sets were for small items, under 4 kilobytes. The metric that we were most interested in at this time was the end-to-end latency as measured from the client side, so as measured from outside the service. We measured this latency with just a single routing node because we wanted to see what we can get from one node. We established our single node baseline as a p999 latency of 4.99 milliseconds and a p99 latency of 4.94 milliseconds. We decided, yes, this works, and we can just launch with this.
Product Evolution
We launched, but no product stays the same. We immediately had a new requirement. Now customers didn’t just want to set and get simple values. They wanted more complex data types like dictionaries, or lists, or sets. We had to make design changes. Now we needed to support a new storage engine that could handle these data structures. We also now had to support multiple of these storage engines based on what type of operations the cache was going to support. We needed a new abstraction now, a layer that would present a unified interface for all the storage engines to the routing layer. We decided to build this layer in Rust because there was already a lot of interest in Rust in the company at the time. We were exploring some storage engines that were built in Rust. We thought this was a good candidate for an experiment because it’s an internal component. It doesn’t talk to clients directly. It’s also a brand-new component.
The business logic was relatively simple. It acted as a manager for all the different storage engines, and it was responsible for just routing requests onto them. We ended up with a v1 version of our stack. We now had a Rust storage layer still using different storage engines. Our routing layer continued to be in Kotlin. Since I mentioned we had set up automated performance tests, we found a regression when we tried to use a cache that was going to support data structures. We didn’t change the workload, we used the same one. The workload was still just doing simple gets and sets. We found that there was a huge increase in the p999 tail latency. Compared to the old one, it went from 5 milliseconds to 9.3 milliseconds. That raised a red flag for us.
What About Scaling?
Before I dive into what we did to solve this problem, I’m just going to take a brief detour into scaling. All these benchmarks so far have been focused on the performance of a single node. Scaling is definitely a tool that we want to use. I’m just going to run through the advantages and disadvantages of the two different types of scaling, starting with vertical scaling. This was an easy change to make, as long as the application can take advantage of the additional cores and memory. We saw good results from trying to vertically scale the routing node from a c6i.2xlarge to a c6i.4xlarge. We were able to cut down the latency to 3.4 milliseconds, of course, at an increased cost. We didn’t want to use this tool right away, because we want to maximize what we can get out of a single node. We will end up using vertical scaling, but we didn’t think it was worth it at this point for a relatively low TPS of 20K. The same applies to horizontal scaling. It increases costs.
In horizontal scaling, you’re just scaling up more instances of the same service. There is more effort required for horizontal scaling. It works great for stateless services, like our routing layer. If we wanted to do that for our storage fleet, we had to build in functionality to go and migrate data from one node to another, because the slices of data that are being managed are changing. The second problem with horizontal scaling was that it assumes that you have good load balancing. This wasn’t exactly true for us, because our data plane interface was essentially a gRPC API. gRPC uses persistent connections. The way clients used us was when the client SDK would start up, it would establish these connections to the routing layer.
Then it would just continue sending requests on those connections. Adding new nodes would not immediately rebalance the existing traffic. Clients had control over how many connections they decided to establish, so they could choose to just send everything to a single node. Horizontal scaling would help but it wouldn’t really solve this problem completely. I want to reiterate why we have been focusing so much on the single node p999 latency.
For us, when we envisioned this product, we wanted to offer latency as a feature. By that, we want customers to be able to rely on us in the worst case and not just the average case. Our first core goal was to have predictable low tail latencies for this product. Our second goal was to maximize the cost efficiency. We wanted to stretch the infrastructure to as much as we could, and get as much performance as we could out of a single node before we start to scale.
Top Bottlenecks, and Optimizations
With the core goals in mind, we went back and looked at what was causing the performance regression. In the top bottlenecks, we found that garbage collection was causing unpredictable spikes. We also found that Kotlin was slower on Graviton. Graviton is Amazon’s ARM architecture. We wanted to move to Graviton because it can offer significant performance benefits as well as cost savings, up to 40%, which ties back to our goal of maximizing our cost efficiency. We also ran into Netty issues. Our Kotlin routing services used Netty for the server framework.
One thing that we discovered was that our epoll threads were very busy. We went back and tried some optimizations in the existing service, because you don’t want to throw away what’s working without doing a good enough investigation. One thing that really helped us was to tune the number of threads, which was surprising. When we brought down the number of threads, we got better performance. Another optimization that we added was to have a second epoll client loop, because the epoll threads were very busy. We added it to Netty. The third one, we found that using a DirectExecutor helped us for coroutine callbacks. Again, this was a change in Netty.
Then, we tried tuning the garbage collector. We were using the G1 garbage collector, and what we found was tuning it to have fewer and bigger memory areas was actually better. There are garbage collectors that claim to minimize tail latency. They’ve come a long way. This was two years back. We tried it out for a short period of time, and what we found was that it could not really keep up with the pace of allocations that we were doing.
Why Rust?
In the meantime, of course, the product keeps evolving, so we had another new requirement. We were going to add the second service, which was a serverless event bus. The idea of this service was to have lightweight topics that users could publish to and subscribe to without provisioning them in advance. This meant that now we had to add support for streaming gRPC APIs. Prior to this, we had a gRPC server, but it used unary APIs. Now we had to add streaming APIs so that customers could just call subscribe, and then get notified as new messages get published. We came up with the idea of rewriting the routing service in Rust. The main goal was to avoid garbage collection. We built a small prototype, a very unoptimized prototype. It showed that it would help with the benchmark workloads. Even going up in TPS, we were still able to keep the tail latencies under 5.2 milliseconds. We also saw the prototype had much better performance for topics. If we had to launch the topics feature with the JVM-based routing service, it would have cost us a lot more.
Some of you might be wondering, why Rust? We love the memory safety that Rust offers. I personally love that it flags errors at compile time versus runtime. The Momento specific reasons for using Rust were like, there was no garbage collection. Among the choice of non-garbage collected languages, let’s just say that C++ was not really popular. There was strong internal advocacy for Rust among the team. Engineers were excited to work on it. Usually, it’s really hard to balance your business needs with what engineers want to work on.
Using a language like Rust gave us the opportunity to do that. It was also a good way to hire in a competitive market. We were at a fork in the road where we had to choose between continuing with Kotlin and optimizing it further, or moving to Rust. We tried a bunch of optimizations and we had reached a point where the benefits were only incremental for the amount of effort that we were putting in. We could have delayed this decision, but that just meant that now we’d have to build topics in the old service. If we eventually migrate, now there’s a lot more area to migrate. We were aware that there are going to be risks to write this service in Rust.
The very first one was, we are spending developer months in rewriting an existing service compared to building new features and attracting more customers. That’s a pretty big risk for an early-stage startup. We knew that we won’t be able to have feature parity on day one. That meant that we had to be able to run two systems at the same time until we had feature parity. We had existing customer workloads on the old system, and we had to migrate those as well. All of this just contributed to increasing the burden on operators. The last risk we wanted to call out is that the Rust ecosystem is not as stable as older languages. I’ll dive into this one a little bit more later.
When it came to making a decision, we decided to just invest in the future and go with Rust. One of the factors that helped was that there was interest from customers who wanted to measure our performance by running their own load tests against us. We felt that Rust would give us a quicker path to meeting those expectations. We decided to do an incremental migration to mitigate the risks. We started with implementing only the new service, the topic service in Rust.
Then we slowly added the caching service and the APIs within that service one by one. That’s how we ended up with our v2 version of the tech stack. We had a Rust routing service and we had a Kotlin routing service coexisting in the routing layer. We were migrating traffic to one of those based on the type of feature. Whether it was topics, or cache, or which API in cache. We took a Rust-first approach to any new features. We started adding new APIs to the Rust service. This was a good way to minimize the blast radius when we were rolling out new changes.
Scaling Rust Expertise
The majority of our team did not have a background in Rust. We also needed to build up Rust expertise in-house. Some of the things that helped us do this was, first of all, we switched all our operational tools also to Rust. This gave engineers a good way to get experience writing Rust code without the risk of bringing down a service. The second thing that helped us was to have reusable patterns established in the code that could act as a template for developers trying to add a new API or support a new type of data structure. We created a culture of sharing the learnings. This was achieved through different means, like Slack channels, ad hoc talks on different features of Rust. We used code reviews as a teaching mechanism. There was a lot of effort spent on making sure that every pull request was looked into in detail. Also, in your comments, providing the best way to achieve something in Rust.
Rust in Prod
With all of that, we finally had a Rust routing service in production. If I were to summarize that experience, I’d probably say it was mostly harmless. I’m also mostly kidding. It was a smooth experience, with a few callouts. When I mentioned that the Rust ecosystem was not stable, what I wanted to say was that a lot of the popular crates for server frameworks, like tonic, which is the crate used for the gRPC server client implementation, these crates are still on unstable versions, so you can’t really rely on them. Hyper, which is the crate for HTTP client server implementation, made breaking changes when it went to v1. It required us to refactor our code to write a TCP accept loop in 2024.
The third point I want to call out is that it’s not just the changes that you’re aware of because it’s breaking your compile time. You can also pick up bugs just by picking up a new minor version in an unstable crate. This happened to us once, pretty recently. There was a change in the accept loop for TLS servers in tonic, which I mentioned is the gRPC server implementation framework we use, and that brought down our service in production. The loop was changed to treat any I/O error during establishing a TLS connection as a fatal error, and that shut down the server. You can imagine, this is a pretty big change. It took a while for us to debug the issue because, again, this was not obvious in a code review. It was a minor version change.
Performance – Rust Edition
Let’s talk about performance now, since performance was one of the main reasons we switched to Rust. It’s an unwelcome truth, but you can write slow code in Rust too. Rust gave us a great initial boost to performance. As we tried to scale the limits of what a single node could do, we ran into more performance issues. These were actually Rust specific, or, to be more accurate, they were related to the way the gRPC ecosystem works and our specific use case. I’m going to dive into one such example. Initially, we were not able to scale beyond 20K TPS on a single node using Kotlin. With Rust, we were able to go to 32K TPS for the same benchmark workload. What we found was that at 32K, our tail latencies would just spike to 100 milliseconds. There was no graceful degradation. It was just under 5 milliseconds, and then a sudden spike to 100 milliseconds. We started digging into that. The clues that helped us along the way were, we had metrics on both the routing service and the storage service.
The routing service metrics were reporting that they were seeing a lot of delay talking to the storage service. The storage service metrics said, no, they were fast. The time from when it sent a response was very low. We tried vertically scaling, and it doesn’t always help. It did not help in this case. We added more observability. We used Tokio as the async runtime. Tokio provides metrics on the tasks that are scheduled in the runtime. We got one hint from them. We saw that the long scheduling delays for the tasks were increasing as we increased the TPS. This was not really sufficient to find the root cause. Then we went to get some flamegraphs, and what we found was mutex contention. You’ll see that there’s a lock contended call on a sync mutex there. It’s being called from the h2 crate, from the send response function. We identified the root cause to be a longstanding issue with the h2 crate. h2 is the crate that’s used for the HTTP protocol implementation for both HTTP/1 and HTTP/2. This particular issue had been open since 2021.
The crux of the issue was that HTTP/2 uses multiplexing, so on a single TCP connection, it has this concept of a stream, which is an independent sequence of messages. What h2 was doing was it was using a mutex to coordinate between different streams on the same connection. Why did it impact us so much? What caused the latencies to spike to 100 milliseconds suddenly? As I mentioned, we use gRPC for both the routing service and the storage service. Our routing service is essentially a reverse proxy. It’s taking in requests, it’s sending them off to the storage service, and getting the response, sending it back. The client talks to the routing service via gRPC channels. The routing service talks to storage service on gRPC channels.
By design, we don’t want to be establishing new channels internally from the routing service to the storage service, so we use a small number of persistent channels there. On the client interface side, there are a lot more of these channels. I’ll explain what a gRPC channel looks like at a very high level. A gRPC channel is managing one or more HTTP connections. In our case, we only have one HTTP connection per channel. This is very language dependent. It depends on the SDK. The gRPC channels are scheduling RPCs. For Unity APIs, you can think of the RPC as just a request-response pair. The gRPC channel multiplexes RPCs onto the underlying HTTP/2 connection by mapping each RPC to a different stream. By design, since we have fewer channels between the routing service and the storage service, we have more streams per connection on that side. That meant there was a lot more contention, because, as you remember, there’s a sync mutex being used to coordinate access to these streams.
Then, we used a Tokio multithreaded runtime. You can imagine each request that’s coming into the service as a task that gets scheduled onto this runtime. The way the Tokio multithreaded runtime works is it has one or more workers, each of which is mapped to a thread. Each of these workers has its own task queue, and then there’s a global task queue. There’s no way we can control which tasks go to which worker. What was happening was multiple RPCs from the same connection could get picked up by different workers.
Now suddenly we have two workers running on two different threads competing for the same mutex, which is again a sync mutex. Just to give a brief overview, a sync mutex essentially means that while a thread is waiting for it, it’s not going to be doing anything else. Every time we hit this contention, let’s say we had only two workers and one of them was waiting for the mutex, we lost 50% of our processing power in the server for that duration. We started looking at solutions for this. The first solution was to use a different lock implementation, not the standard sync mutex. That showed improvement. The h2 crate is pretty far down the infrastructure stack. Patching it means that we are just entering dependency hell. There was no way we could have managed that in a sustainable way while continuing to upgrade the versions of the crates that we are using. We tried that out, we saw improvements, but we chose not to implement that in the service.
The second solution that we tried was, since we saw a lot more contention on the communication between the routing service and the storage service, we decided, why not just split that work onto a different runtime? Now, yes, that communication could still lock up threads, but it wasn’t locking up the same threads that were being used to serve the requests that were coming from the client. That helped, although it did have a drawback. When you are spawning new tasks outside a runtime, they are a lot more expensive to schedule. When we measured the performance for this, we actually found that the benefits outweighed the drawback. With this approach, we were actually able to scale the single node throughput to 40k from 32k. The p99s stayed under 10 milliseconds. We could have actually gone a little bit further without hitting the 100 milliseconds here as well.
This was not really enough for us. We decided to go back and find another solution. This slide here shows a very simplified representation of how a gRPC service in Rust looks like. There are multiple layers of abstraction. The stack starts with Mio at the bottom, which is used for low-level I/O, that then interfaces with what the operating system is offering. We have Tokio, which is the async runtime on top of it. Tokio is managing or scheduling the tasks and making sure they’re executed. We have Hyper, which is the HTTP client server implementation, followed by tonic, which is gRPC client server implementation. Hyper is the one that uses h2, where we discovered the bottleneck. We decided that it might be better to maybe just get rid of a few of these layers. What we ended up doing was like, for the communication between the router and the storage service, since this was internal to the system, we could control it. We decided to get rid of tonic and Hyper there. We built a new library.
We built something called protosockets. protosockets is a way to exchange messages over raw TCP connections. These messages can be encoded in any protocol that you choose. We used Protobuf for this, since both of these services were already gRPC and already talking Protobuf. We got rid of the top two layers, and then we just had the business logic in the application use protosockets to talk to the storage service. This is what it looked like after that change. Now, the client interface remains the same, but in the inter-service communication, we are using protosockets. We are bypassing h2 completely. That gave us pretty good gains. We were able to go from 40K to 65K TPS on a single c7g.2xlarge instance. p999 was still under 10 milliseconds. We stopped at this point with the Rust-specific optimizations. We might actually go back to it in the future, but for now, we stopped at that point.
While talking about all these different optimizations we have been doing that are Rust-specific, it’s important not to lose track of other simpler things that you can try out. One such change was just upgrading Amazon Linux 2 to Amazon Linux 2023. Without any other changes, just that change let us get to 75K TPS from 65K. Our latencies decreased at every percentile. Earlier, the highest we were seeing was around 10 milliseconds at p999, it even went down to 7.5 milliseconds. There are other changes as well, like choosing the right instance type that can offer significant performance benefits. This is where we are right now. We have gotten rid of the old JVM-based routing service. Our storage layer and the routing layer are both using Rust. We have been slowly migrating parts of the control plane as well.
The reason we were doing this is the control plane does not really need the scale that the data plane has, but having a unified language for the backend improves code reuse and it’s easier to maintain and less burden for the engineers. We have been building new control plane components as well in Rust. We are slowly migrating the existing one.
Looking Back – Would We Do It Again?
Looking back, would we do it again? My answer is a yes, but it’s a qualified yes. The main reason for that asterisk out there is that the hit that we took to feature velocity was a real business risk. It might not pay off again because external factors don’t always remain the same. If we were sure that that risk would pay off, then I would say, unconditionally, yes, it was a good decision to move to Rust. Because it let us offer the guarantees for the latencies that we wanted. It let us gain more customers because we were able to meet the load test expectations that they had. We were able to launch more new features, cost-effectively, rather than overprovisioning and spending a ton on infrastructure.
On the con side, of course, like I said, the hit to feature velocity was a risk. The unstable ecosystem and the constant changes in the crates that we were using caused outages and could potentially cause more. When we discovered problems in the low-level crates, the effort that was needed to get around them was quite a lot.
Is Rust the Right Choice for You?
With that, I’ll go to the question of whether Rust is the right choice for you, even though it was the right choice for us. The best solution is not the one that’s recommended by most people, it’s the one that’s tailored to your specific problem. When choosing the solution for your problem, the things that you should consider are, what are the SLAs you want to offer? If your product does not need to serve millions of TPS or hundreds of thousands of TPS, and you do not have tight latency constraints, you might be better off with a different language.
The second aspect is the business risk. What stage are you in? What’s the business goal? Are you trying to just attract more customers and grow revenue? Might be better to just throw money at the problem and just go with an existing simpler solution. Are costs a concern? Again, overprovisioning is good. It can help solve the problem if you can just throw enough money at it. You also need to consider how long it takes developers to ramp up on Rust. How frequently are you seeing churn in your engineering teams? How much does your application rely on third-party tools and libraries? Since the Rust ecosystem is fairly new, you might not find all the libraries that you’re looking for, compared to JavaScript or Java.
Questions and Answers
Participant 1: Do you find that developers are as productive in Rust as they are in Kotlin?
Ramya Krishnamoorthy: It’s been a while. Our productivity has increased over the last year. When we first started, they were not as productive because Rust does have a steep learning curve.
Participant 1: You think it’s just the learning curve?
Ramya Krishnamoorthy: Now I think we are more productive because we have established these patterns in the code. There is a template to follow when you are adding a new API. That makes it faster and improves the productivity.
Participant 2: Did you actually measure that?
Ramya Krishnamoorthy: The developer productivity?
Participant 2: Yes.
Ramya Krishnamoorthy: No, we did not.
Participant 2: The patterns with Kotlin, would you have seen the same productivity gains in Kotlin?
Ramya Krishnamoorthy: Probably, yes, I would say that. Yes.
Participant 3: Can you say a little bit about how long that learning curve is and what are some of the most challenging things to adapt to when you’re moving from a language like Kotlin to a language like Rust?
Ramya Krishnamoorthy: I would say one of the most challenging things that I personally saw with Rust like Rust compile time [inaudible 00:41:50]. Then, you can get tempted to just bypass the problem by using an unsafe feature. That was one of the things that I personally found hard. Actually, I’m not comparing it with Kotlin, I’m more comparing it with C++ in my mind.
Ramya Krishnamoorthy: How long does it take a developer who’s familiar with Kotlin to get familiar in Rust?
That’s going to depend a lot on the individual developer, but if I were to just throw out a number there, I would say probably a month. I would recommend that they start off with writing non-production code in Rust to just gain experience.
Participant 4: You’re showing tooling around your performance metrics and obviously doing that through load testing. Could you comment on maybe tools that you’re fond of or things that were useful in doing that on a regular basis?
Ramya Krishnamoorthy: For the load generator, we use a tool called rpc-perf, and one of the authors is Brian from IOP Systems. We work closely with them to develop new features onto rpc-perf as well. It’s a load generator that’s also written in Rust. It can send a high TPS of requests to your service. They also develop a tool called Rezolus, which is a telemetry agent designed for very low-level system metrics at high resolutions. Things like how many syscalls are being called, or how much lock contention are you seeing at the OS level? We use Rezolus to gather those metrics. For flamegraphs, we just use the cargo flamegraph crate.
Participant 5: I was wondering if garbage collection was the main bottleneck or one of the bottlenecks, have you tried to run without a garbage collection on the JVM side?
Ramya Krishnamoorthy: Do you mean just turning it off completely?
Participant 5: Yes.
Ramya Krishnamoorthy: We did, and we ran out of memory very quickly.
Participant 5: No, you can. There’s the Epsilon garbage collection, or really rewriting parts of it and disabling it.
Ramya Krishnamoorthy: I don’t think we tried completely disabling the garbage collector, but we did try making it run less frequently. That did not help us because we are doing a lot of allocations. Every time a new request comes in, we are allocating new memory for that request based on the item size that’s being operated on, and it did not help us. We ran out of memory.
Participant 6: It sounded like you went to Rust just because your developers thought it was cool and it was hip. Is that correct?
Ramya Krishnamoorthy: That was one of the factors.
Participant 6: Did you do any analysis on other languages before you jumped into Rust? Did you do any analysis on Kotlin before you picked Kotlin?
Ramya Krishnamoorthy: We did not do analysis on Kotlin because we had developers who had used Kotlin. I personally worked on a Kotlin service in AWS that would handle hundreds of thousands of TPS, so we did not actually do a lot of analysis on using Kotlin. At that time, our priority was to just minimize the time it took to launch. As I said, we just did not explore other garbage collected languages because we decided we just don’t want garbage collection. C++ was not a popular choice. We were also exploring storage engines that were written in Rust, so it would have helped to have a unified language for everything, so that’s another reason we settled on Rust.
Participant 6: How long were you using the Kotlin system before you transferred over to Rust?
Ramya Krishnamoorthy: Around like a year.
Participant 7: You mentioned the tooling that you use for Rust, rpc-perf and Rezolus, licensing on those, or maybe, if you know off the top of your head, or Brian? What’s the license model?
Brian: They’re Apache 2.0, they’re open source.
Miao: I do think that for us, as an early-stage startup, a lot of the decisions aren’t purely technological. In fact, most of the decisions are engineering in terms of like human driven, meaning that you have to have something that gets the team excited and unifies the team and moves us forward as a unified whole. That’s why I think it ended up being not a purely programming languages analysis, but more like, what does the team feel excited to go and quickly implement and get it out to market? We ended up picking Rust for that reason.
See more presentations with transcripts