Transcript
Simpson: My talk is, Fearless Programming with Rust. When I started off learning Rust, I was reading the Rust book, the official book, and it really resonated with me when they started speaking about option types, and result types, algebraic data types, and all those things. It occurred to me at that time, or it was the first time it clicked, that you could move this burden of discipline from yourself to the language.
Previously, I wrote a lot of code in dynamic programming languages, and that required you to have quite a lot of discipline within your team, specifically, in order to ensure that your code stays maintainable and is easy to keep over a long period of time. With Rust, the compiler and the language really enforce this level of quality on you. You knew that your production code was well-behaved no matter who wrote it. If Senyo Simpson wrote it, you might not trust it before Rust, but after Rust, for sure. It has a lot of this tooling to just ensure that the quality is maintained and a good standard. The result of that is this confidence.
Rust promotes a sense of fearlessness, and you can approach your code with some recklessness and work with the compiler to bring it up to a certain standard and quality, whereas opposed to a lot of other languages where you have to be a lot more meticulous with your code in order to ensure that it works really well. This is the point of my talk, and I’ll just carry this theme through. Really, what I want to focus on is the feeling and the emotion that you get, and how Rust promotes that.
Background
Who am I? I’m Senyo Simpson, a software engineer at Fly, and interested in systems programming, mainly networking. A lot of my talk revolves around writing a proxy in Rust. Then, I write occasionally at senyosimpson.com. Tweet, @senyeezus, and then also have Bluesky @senyeezus.bsky.social.
On the menu today, first I’ll talk about the values of Rust and software values generally. That’s inspired by Bryan Cantrill, a very wise software engineer. Just going to talk about how that’s formed what Rust is today. Then talk about where Rust shines and some lessons from the front lines, which runs. Yes, just my experience writing Rust. Then, the harder parts of Rust and its approach to embracing complexity that’s just inherent in our domain.
Values in Software
Values in software, to understand how Rust ended up having this feeling of confidence, you have to understand the values of software. Software choices are based on its values, and those values are born from a context and then furthered on by the community. One of my favorite examples of this is DHH with Ruby on Rails, and he really cares about your flexibility, your terseness, and your ability to be fast with using your programming language. You can see that he really cares about that over what some of us would call top gymnastics and having this code that’s really robust and can live over long periods of time if you prefer that way of programming. He prefers that. Then, those values are taken on by the community. You’ll see a lot of choices that they’ve made work with that, and similar with Rust, a lot of the choices that Rust has made cares about the values that Rust cares about, which I’ll talk about. Those values are important as they direct the evolution of that software.
If you’re deciding between path A and path B, you consult your values and say, we care about performance, we care about flexibility, and that evolves or sets the direction for your software. A mismatch in values creates an ever-growing chasm between your value system and the software value system. An important thing that I want to note is that it doesn’t make it wrong, it’s just different. If you have software that requires a lot of performance and safety, and you use a language that doesn’t have those values, over time it’ll make certain decisions that are counter to what you care about, and it’ll drift. That’s not really a problem, it’s just that, again, it’s different. Some ideas of what I mean by values, so you can think about velocities of value, correctness, safety, approachability, performance, these are all different types of values that you can embed in that software as you make choices and as it evolves.
The Values of Rust
Rust values, these are things that we all know and love, so things like performance. Rust really, I think that’s pretty much one of the biggest stories of Rust, the performance and the safety coupled together. It really cares about performance, it’s made a lot of choices focusing on zero-cost abstractions. It had green threads prior to 1.0, and they removed it because it had a global overhead cost. You can really see that they care about that. Safety, similarly, in the borrow checker and having all the aliasing rules.
Similarly, it cares a lot about safety and correctness. Here I mean that your program operates correctly under different states, and so it enforces that you handle all your errors or you handle null values and such things. Then, finally, ergonomics. We have the try macro, we have iterators, and a lot of this tooling that’s made this high-level language still applicable in a low-level language case. All these values basically contribute to Rust’s story around confidence, and it’s really become a core narrative. If you look in the Rust book in the forward, there’s this quote there that says, Rust empowers you to reach farther, to program with confidence in a wider variety of domains than you did before. It goes on to say that programmers who are already working with low-level code can use Rust to raise their ambitions.
Further on it says, can tackle more aggressive optimizations in your code with confidence that you won’t accidentally introduce crashes or vulnerabilities. You can see this story emerging where Rust has made a lot of decisions about performance, about safety, and all these things really come together to give you this feeling of confidence and fearlessness when you approach your program.
Lessons From the Front Lines
From my experience, I’ll just go over some of the stuff that I’ve experienced in other companies and what they’ve put out in the public domain. For me specifically, the things that have really resonated are these four ingredients, which I want to say is performance, predictability, correctness, and maintainability, which I’ll talk to. Before that, I just want to talk about my experience using Rust in production. Like I said, I work at Fly.io, which is a public cloud built for developers who ship, and another way to think about it is it’s a developer-focused cloud. We really care about the developer experience of you rolling out your applications on our cloud. There, I’ve worked on what we call Fly Proxy, which is our internal proxy written in Rust. It’s built on top of Tokio, Hyper, and Tower. It drives our internal Anycast network.
Every single request to an app on our platform goes through Fly Proxy. It’s responsible for routing requests from the edge to the nearest instance of your application. We have an edge network in over 30 regions. If you’re in South Africa and you make a request, it will get routed to somewhere in South Africa. If you’re here, it’ll get routed to the nearest edge somewhere in America. A really important part of Fly Proxy and a constraint that we have is what we call this 100-millisecond rule. That’s just a general rule, which says that interactions feel fast when they happen in about 100 milliseconds.
For our end users or people that are using our platform to run their own apps, they would care specifically about this 100-millisecond rule. We don’t want Fly Proxy eating up a lot of that latency budget. If Fly Proxy took 98 milliseconds, it would basically blow everyone’s budget for having responsive applications. We really have this constraint that we want to be as quick or as fast as possible. We want it to be predictable at the same time so that we don’t have cases where for 95% of the time, it’s fast, and for the other 5%, it’s pretty bad.
To give an overview of how Fly Proxy works, so imagine we have clients in France, Mexico, and Rwanda, and then we have edge hosts. This forms part of our edge network. In this case, we have one in the UK, in Canada, and South Africa. On each host, we run Fly Proxy and Corrosion. Corrosion is our state dissemination tool. That basically broadcasts changes to our platform globally. Then, we have workers, which is where your app instances actually run. Again, those will still have Fly Proxy and Corrosion, and then it will have your app instances there. Imagine you’re sitting in Rwanda. When you make a request, it’ll land on an edge proxy or an edge host in South Africa. That happens through Anycast routing. That essentially means that every edge host announces our total range of IP addresses, and then through internet routing, essentially, it gets sent to most likely the nearest geographical location to where you are.
If you’re in Rwanda, it’s in Africa, and we only have South Africa on the continent as a region. It’ll get routed to South Africa. From there, it’ll go to the nearest app instance that you’re running. From South Africa, that would be the UK. From there, it goes from one Fly Proxy to the other Fly Proxy, and then it’ll finally reach your application. One reason why we run Fly Proxy on both edges and workers, because in theory, it could go from the edge straight to your application, is that if you have an organization with multiple applications, we allow, or it’s possible for your app to send a request to other apps in your organization.
You might want that to happen over Fly Proxy. We make it really easy for you to have communication with instances directly, but you might want it to happen over Fly Proxy because that offers things like TLS termination and load balancing. Fly Proxy and Rust, we like to say that they’re happily married. They have a big overlap in terms of value. For Fly Proxy, we care about performance, predictability, reliability, and safety. Rust really meets all of those goals. We expect over time that, as Fly Proxy matures and Rust matures, Fly Proxy will be in alignment with Rust over a long period of time. Just speaking to those values aligning really well.
Going back to those ingredients. The first one I want to talk about is performance. I’ll speak about it in terms of Fly Proxy in a moment, but one thing that’s really important and one part that really contributes to this idea of confidence is that Rust is fast. It’s really fast. Oftentimes, it’s fast without you having to do anything specific. You get this high-level language or what feels like a high-level language and you get to implement whatever you want. You don’t really have to go out and tweak everything specifically and tune it for a lot of performance. You pretty much get that out of the box. In cases where obviously you’re pushing the edge of your performance, of course, you want to go out and put in a lot of fine-tuning into your algorithms and such.
For the most part, it’ll work pretty well. Speaking to Fly Proxy, so asynchronous Rust is at its foundation and it’s powered by Tokio. Fly Proxy is composed of these admin and processing tasks. The processing tasks are just the proxying, so from client A to instance C or whatever. The admin tasks are things like updating states. We actually run out of Fly Proxy. We downscale instances of your application if the load is tapered down, and other things like exporting metrics and such. Those all run as admin tasks in the proxy. This has worked super well for us. There’s been a lot of talk in the Rust ecosystem about async/await not being that great, or they’ve made the wrong decisions and now you have to care about Send traits and Sync traits and whatever. They are valid, but for us, it’s worked really well and we’ve been happy with it.
As a result, performance issues are pretty rare. We basically don’t run into them for reasons that are the language or the libraries that we use. However, historically, we’ve run into them because of locks and long-running tasks. I do want to reiterate again and say that we haven’t run into performance issues because of Rust specifically or because of the underlying libraries like your Tokio that does async/await, or the scheduler underneath it. For long-running tasks, we have these admin tasks, as I said, and sometimes, unfortunately, they’ve taken up to 700 milliseconds.
Remember, as I said before, you have about 100-millisecond budget for this thing to feel responsive, so 700 milliseconds is unacceptable, and blows that out 7 times over. That has cascading effects. One problem there is that when these tasks run on your runtime, if they don’t yield the CPU, then it blocks other tasks from running. You’d have this one task taking 700 milliseconds but causing like a cascade of performance issues down the line.
Some remedies. One way you can do it is to run an entire task on a separate thread. That won’t block the other tasks from running on the runtime. It’s like separate the two worlds, essentially. Or you can use the Tokio machinery and spawn a blocking task. Tokio has a mechanism for running tasks that will block or basically are CPU bound. It essentially does the same thing as running the entire task on a separate thread. It’s just that it’s embedded in the Tokio machinery, essentially. For locks, we have locks for fetching fresh state when proxying. When you’re making a request, obviously we want our requests to land at the right instance, especially as things are always changing on our platform. These locks are read and write, but we get a lot of write traffic.
The reason why is that we have, I think over 3 million applications running on our platform. These are getting updates all the time. Each proxy needs to be able to route requests to any instance of the application running anywhere. If a request comes from the USA and your app is in Hong Kong, we’ve got to be able to handle that request and vice versa, pretty much globally. Every proxy has to have a complete picture of the whole world of our platform. We get a lot of this route traffic. When we do that, we get these hotly contended locks, which of course cause performance degradations.
Remedies there. One option that we’ve used quite a bit is finer-grained locking. You want to take locks that don’t run for long periods of time, or does a lot of work in that period of time, and so you shorten the amount of time each lock is held for. That’s one option. There’s another bigger option that’s in theory available, which is shared-nothing.
The idea with shared-nothing is that different processes don’t share any state. You wouldn’t have to acquire any locks because they run completely independently. That is again, in theory, an option, but it would be a big lift and shift for us. You have to think about that from first principles and design your application around it. That has been done, like Envoy proxy pretty much works like that, where all these separate processes run basically as single-threaded processes. They have a mechanism for sharing state, but that’s the only lock that they have in their whole system. For us, that’s not an option available to us right now. It could be in the future. Then, other remedies. Longer deployment rollouts, that’s not obviously a direct remedy, but from experience, one thing we’ve noticed, or one thing that’s happened a few times is that we’ve deployed something into production, give it two or three hours to marinate.
Then, all of a sudden, you’re getting this huge performance degradation and that can come from just certain locks being held over time and the amount of state that needs to get updated changes over some period of time. Having slower rollouts can work. In the case of Fly Proxy, that’s possible because we can deploy it regionally. Like I said, we have over 30 regions. We can do a not follow the sun methodology where we can deploy it in regions that aren’t super active at a point in time. If it’s daytime in the U.S., it’s probably nighttime in South Africa, so you can deploy it in South Africa and let it run there for a few hours, or days, or whatever, and see how that works. Obviously, that’s not foolproof, things can still go wrong, but it does allow for the possibility of finding those out before doing a global deployment. Then, we have an ongoing effort to reduce the amount of state each proxy needs.
As I said before, currently every proxy needs a global state for our whole network across everything. If we’ll be able to reduce that to maybe just needing regional data, we’ll have less route traffic, which means that that lock would be less highly contended. Some of these remedies we’ve implemented. You can see, previously, we had these huge spikes in latencies up to like 1 second. Again, for a proxy, you have a 100-millisecond budget, and we don’t want to be blowing that for our end users. Before, yes, we had these huge spikes in latency when some of these admin tasks ran or when locks get held for too long. Then, over time, after some fixes, you see a more smoother profile, better performance. We still have some latency headaches there, but it’s obviously still getting better.
That brings me to my point on predictability. Predictability, again, is a really important part of the proxy and we really care about it, again, for this 100-millisecond latency budget. This is from Discord where they have a program written in Golang. The Golang one is the one in purple. You can see every few minutes you get these spikes. That’s actually the garbage collector running.
Then, in the blue, they have a Rust program that they converted from Go into Rust. You can see that that has a much smoother profile. They were able to take out the spikes from the garbage collector by using Rust, which doesn’t have a garbage collector. It runs as expected or to the requirements that they needed. Correctness, as again, I was saying, correctness here really means that your program behaves well under multiple states.
If it returns an error or returns none types or null values, you want your program to work in the correct way. From OneSignal, they’ve said here, we’re writing OnePush, even better, “OnePush needs very little attention. We were able to leave it running without any issues through the holiday break”. Go on to say that regressions are very infrequent. “There’s a huge class of bugs in languages like Ruby that just aren’t possible in Rust. When combined with good test coverage, it becomes difficult to break things, all thanks to Rust’s fantastic type system”. Again, reiterating this idea of the confidence that Rust can give you because the language forces a certain level of quality and some standards. Maintainability, and this ties in to before.
Again, from OneSignal, they were saying that the compiler and type system make refactoring basically foolproof. We like to say that Rust enables this belligerent refactoring, which is basically what I want to say is the same as the reckless approach that you can take to writing Rust. They were able to make these dramatic changes, and then working with the compiler to bring your project back to a working state. In this one, they say, “Compiler-driven development in Rust is so amazing. Cannot ever imagine going back to a language that doesn’t have the ability to let me make a root change to something and then guide me through all the different parts that are affected and need to be changed”.
Tales of Complexity
Rust is not all sunshine and rainbows. We don’t live in a perfect world. Engineering, as you all know, is a game of tradeoffs. Rust, in some senses, has made this tradeoff of embracing some level of complexity to give you all of this performance and safety and stuff like that. Rust does not shy away from complexity. It embraces it. This is not to say that it tries to make things complex, but there’s some level of complexity that is there in our domain. Examples of this is like the borrow checker, pinning, which happens because of self-referential structs when designing futures. My favorites, generics and trait bounds, which I’ll go into.
This person says, “I don’t get the hate Rust gets for being complicated. This is perfectly readable”, which is extremely debatable. I’ve been running Rust for some time, so I can read it, but it still takes my brain some time to parse and understand what exactly is going on here. Here’s another example, when someone says Rust is a very low-level language, I’ll show them this. Here, there’s a lot of things going on. We’ve got const generics, we’ve got lifetimes, we’ve got trait bounds, phantom data types. There are tons of concepts here. It’s not as simple as we potentially like it to be. This is my favorite example, and what I say is, when my Rust stops Rusting. Even still now, making sense of this is actually quite difficult. There are some things and some features in the language that make it difficult. As funny as it may be, it actually does have some impact in production for many cases.
We have a very esoteric case that we ran into at Fly. We had this proxy logic and we separated into Tower services. Tower is a crate that basically has a trait called the service trait, which maps how you take an asynchronous function, a request, and turn that into a response. A Tower service typically looks like this. There you can see, we’re implementing the service trait for this timeout. It’s got three associated types: response, error, and future.
Then, this timeout struct has an inner service, you can see it being called, like self.inner.call. This allows you to create middleware fairly easily using this. Again, it has quite a heavy usage of generics, and so that can get complicated fast. This is just to show that this is how you can make a middleware stack. You can see you have all these layers, and then you have a service at the bottom. Then, that’s the network path that it would take, for instance, if you had an API. In this specific case, like I’m saying, it’s an esoteric case, but there’s a point to it at the end. The one downside in our case was that these compiles times got much worse. I think it was taking around 4 minutes, and after a fix, it took a few seconds. This happened to be a rustc bug. No one’s to blame for that. That bug came from this thing called higher-ranked trait bounds.
Essentially, in a trait bound, so you can see that where for F, the part of the code after F is the trait bound. In trait bounds, there’s no easy way of specifying lifetimes in them. We have this higher-rank trait bound feature or syntax where you have this where for<‘a>. You can specify the lifetime in that a there. For services, you can have something that looks similar to this. You can see we have two trait bounds here, one for the service and then one for the future associated type for that specific service. This caused this 4-minute compile times issue where it would have been a few seconds. Some remedies for that. The point of why I’m bringing this up is that you can redesign the types to avoid this. That’s a big lift and shift. We would have had to go back for a lot of our code and figure out how to make these types work better to either not use lifetimes or some other scheme to make that work.
Another option that’s available to us and what we ended up using, which I call our escape hatch, is we could use less services and more functions. Instead of having each service written out and implementing the service trait, we could just have one service and then call a bunch of functions underneath it. That’s what we did. It took it from about 1 minute to 4 seconds. You can see that I got a callout for saying it has a lot fewer scary bounds because at the time, which was two years ago, I complained that the bounds are scary. They are still scary now.
One thing I like to say or one question I like to ask about Rust, part of the time is, are you solving the problem or are you solving a puzzle? Rust, sometimes you already know how to solve your problem. If you want to write some code that you’ve written before, you already know what to do. Then you end up working and doing all these kinds of like top gymnastics to make it work the way that you want it to work, especially when you’re trying to work with other crates or other traits that aren’t in your control. You can really get caught up in this. It’s not extremely common, but it is something to think about in the language. This user says, yes, Rust essentially requires a lot of knowledge of the language before you can be productive with it, which is a problem. Then goes on to say like a language like Go, however, only requires the fundamentals. In my past, I would have said like, you should just go and learn Rust and you’ll be fine. After using Go for the last year, I do understand the sentiment behind this.
Then, to Rust or not to Rust is a question that I get a lot of the time, is like, when should I use Rust, or is it worth adopting for certain types of applications? The most important thing, I think, when it comes to making this decision is whether or not you need the performance. Of course, Rust can be used without needs for a certain level of performance, but it’s really the thing that hinges the question on. If you do need the performance, Rust is oftentimes the best or easiest option to use. It’s fast. It’s safe. Of course, those two coming together is a really powerful combination. If you don’t need that, then you have pretty much all the languages available to you.
One reason why I would still suggest it and/or in a case where you think it would still work is if you have a long-lived software project and you wanted to have the language enforce a certain level of quality on people or on your teams instead of having to embed that in the actual discipline of each individual contributor. Obviously, code passes through a lot of hands over a lot of time. Having the language enforce that quality is really good. That’s one case where I would use Rust despite whatever performance issues, or even if you don’t have any performance related concerns.
For the most part, any other language, you can use them. It’s pretty much on the table. At Fly, we use a lot of Golang in cases where we don’t have performance necessities. Another part that’s really been interesting from experience is that we’ve had situations where people have come onto teams and we don’t prescribe exactly which team you need to work on at a specific point in time. A lot of people feel quite comfortable with Go. It’s a lot easier to learn. You feel productive very early on. Over time, you’ll find that some people will just end up gravitating towards a lot of the projects that already use Golang because they don’t have to go through this headache of learning Rust in particular. In our case, on balance, it was worth it every time. It’s been really good, as I said, for Fly Proxy and the needs that we have. I think it’s paid its dues for us, despite whatever complexity there is in the language.
Rust and Confidence
Then, just to tie up a few last quotes on Rust and confidence. “Rust as a language is different, not because of its fancy syntax or welcoming community, but because of the confidence one gains when writing a program in it”. A friend of mine, Evance, says, “I don’t feel safe or confident in any language as long as I’m the one producing the code”, which maybe you could consider a skill issue, but I’m the same. I think there’s a certain level of complexity in Rust language we’re all happy to deal with in exchange to eliminating certain types of bugs, which just echoes my sentiments that I was saying earlier. Then, yes, this long paragraph, the two blue quotes, “So far, developing in Rust has given us a slight confidence edge that we may not otherwise have had. That’s given us incredible peace of mind and freed up mental bandwidth”. Who doesn’t want that? Fearlessness, Rust affords us the ability to code with confidence, to code fearlessly.
Questions and Answers
Participant: I’m brand new to Rust. I’ve literally never used it before. What would you recommend as learning resources for someone getting started in it?
Simpson: It pretty much depends on how you learn. There are multiple ways to go about it. For me, I read the Rust book. If you just type in, Rust official book, it’ll come up. That talks through a lot of the concepts there. Some people like to use Rustlings. That’s like, you get these small programming challenges, and you can work through that. It teaches the language through that. Otherwise, there’s a few video courses on Rust as well. I’m sure if you Google them, they’ll pretty much come up. There’s one, Crust of Rust. If you’re doing web development, there’s, “Zero to Production”. That’s a really good book. Lots of people have enjoyed learning from that. That’s a good way to learn if you’re doing it through web development and you already know how to do a lot of web development. That’s my recommendation so far.
See more presentations with transcripts