Latency: The Race To Zero...Are We There Yet?

Transcript

Amir Langer: I’ll be talking about latency: low latency, zero latency. Physics suggests that zero latency is impossible but we are still in this race, we’re still trying to get as close as possible to zero. We need to start by asking why. Why is it so important? Why do we invest so much into reducing latency? Latency matters. In the fintech industry we can link latency directly to profit and money. If I have lower latency than the competition, I can get to the better deals, I can make the better deals.

In previous companies where we had a trading system, we could see exactly which market makers had the lower latency. They were the market makers that had the smaller spreads. They could afford that because they knew they could get in faster and change the prices faster when the market changes. Low latency definitely matters. It’s not just low latency, predictable latency also matters. If I send an order to a trading system, I really don’t care what’s the average latency of that trading system. I only care about the latency of my order. Low latency also matters in recovery. When we recover, we are unresponsive.

Unresponsive is unavailable. We want to minimize that. When I talk about latency in this talk, I talk about the time it takes to do an operation in a mission critical system like a trading system. This is a highly distributed system but it also has other quality attributes like scalability and resilience. Most of the time will be spent on the communication between the different components in the system, and we need to send those messages and send them fast. We must also not lose any message.

Background

In this talk, I will cover some famous examples from the past where latency was reduced and what can we learn from that. We’ll talk about the challenges in the present day. We’ll also think about what can we do in the future to reduce latency further. Latency is also a metric and we are used to seeing numbers so we will see numbers, and we can go over a single scenario and see how latency can be reduced by quite a few orders of magnitude. Again, think about what’s next. How can we reduce latency even further?

My name is Amir Langer. I’ve been a software developer for quite a long time. I joined the fintech industry in 2007 when a new startup was forming called Tradefair. Later on, it changed its name to LMAX, and was made famous by the LMAX open-source Disruptor project. We will go over that project as well. Today I’m at Adaptive, the home of Aeron. A lot of the ideas in this talk are ideas that together with Martin Thompson, the father of the Disruptor and Aeron, and others, we develop in a future project called Aeron Sequencer.

The Past – Latency Reduction

Let’s go to the past. We go all the way back to the Roman Empire. The Romans had this thing that they called cursus publicus, that was the famous Roman roads. It was more than just the roads, they had a system of horses and wagons and a lot of points along the way where Roman officials could replace the horse, the wagon, the rider, and Roman officials could send messages and send goods at much lower latency than just private people. We move forward in time to the invention of the electrical telegraph.

The interesting thing about this is this invention is directly related to latency. The inventor, Samuel Morse, was away on a business trip, his wife died unexpectedly. He didn’t get the message in time and he was so devastated that he decided to invent something so future generations will have a better chance of getting the message in time. We move forward a few years to the Pony Express. This was a great success story. They managed to reduce the latency of sending messages from Missouri to California from between 3 to 5 weeks to only 10 days. How did they do it? They had a system of horses and riders and points along the route where they could replace the horse or the rider. Really a huge success story, but the solution was known for more than a thousand years. Nothing was invented here. That would be the first interesting thing about the Pony Express. The second interesting thing is that the company went bankrupt in 1861. Just 18 months after this great success they went bankrupt. Anybody want to guess why? The invention of the telegraph. The telegraph made its way to the west coast of the U.S. and suddenly we had a much better solution.

The Present – Challenges

In the software industry, until about two decades ago we had an easy, simple solution to reduce latency. We just needed to replace the hardware, and that was it. We immediately got a latency boost. This is no longer the case. The world is changing and it has changed, and we need to think about how we build our software. Modern processor designs are far more complex than they ever were. We have a lot more caching layers. They take more physical space. Throughput is still increasing but the latency is getting higher. We have a lot more shared memory in our cores today, but we need to take advantage of that shared memory. It’s not enough that it’s there. We have the cloud. The cloud is everywhere. The cloud hides a lot of complexity for us, but nothing is for free, there’s a cost. We have a lot more layers of abstraction that we need to go through. The scale of distributed systems is bigger than ever before. Until a decade or so ago we could still talk maybe about transaction locks or CPU clocks, now the communication is the bottleneck. That’s the real problem for latency.

If your system architecture looks like this, then we have a problem. If you need to fight through layers of abstraction, then that’s another problem. What do us developers like to do? What would be the naive solution? A lot of the time what I see is developers just try to turn the volume all the way up to 11, and just throw more resources at the problem. If we have an out of memory error, we give it more memory trying to make it stop. If we have a system with high latency, we might give it more threads, more CPUs, and we hope and pray that the underlying framework and the underlying operating system will somehow magically find the optimal way to run our scenario and our system. That doesn’t work.

What does work? What is the best way to reduce latency? The best way is to not think about any other quality attribute. Just strap yourself to the rocket and just go as fast as possible without caring about whether you blow up in the middle or where do you land. It is simple to send messages fast if you don’t care about whether you lose any message or not. For us, it doesn’t work as well. What can we do? Can we design a low latency system that doesn’t compromise on other quality attributes? As we said, nothing comes for free, so what are the tradeoffs? In 2010, we open-sourced the LMAX Disruptor project. The Disruptor project was a very efficient way of sending objects between threads in Java, and so a very efficient queue, really. We had all the tricks in the world, no memory allocation. We knew exactly what the system did with memory barriers. We had ring buffer at the heart of it, so something that hardware designs knew from the 1970s.

Again, nothing was invented here. It all contributed, but really the one big thing about this project that gave us a huge latency boost was separation of concerns. What we got is a way to separate the concerns of the work streams in our system. The journaling of a message that we got was decoupled from the decoding of the message, and the decoding of the message was decoupled from the business logic. That gave us really threads that were doing just one thing and one thing only, and they were not being interrupted by anything, and they were not waiting for anything. That’s where the real latency improvement came.

In 2014, Martin Thompson and Todd Montgomery open-sourced the Aeron project. Aeron started as a really efficient, low latency, and reliable way, of sending messages between processes. Aeron can do UDP, unicast or multicast, and it can also use IPC, inter-process communication. We will get to that. We can see numbers of what does latency mean in Aeron. This is a simple echo test. You send an echo from a source to a target, and the target sends it back. We can measure how much time does it take to send this message round trip at a very specific throughput rate. In the cloud, this is our Java version. These are the latency numbers in microseconds. These numbers are GCP. There are other numbers for other cloud providers, but the actual numbers don’t matter so much. It’s more the difference. The C version, what do you think? Higher or lower? It’s both higher and lower. It’s about the same. We can do more than that.

At Aeron, we can integrate with another project called DPDK, which gives us kernel bypass. Basically, it gives us the ability to bypass all those layers of abstractions, including the operating system, including the sockets, and write Ethernet packets directly into the NIC, into the network interface card. DPDK works with GCP, it works with AWS. When we use DPDK, we get to these numbers. When we move from the cloud to our own hardware, we get numbers that are better than the cloud, but not as good as the cloud with DPDK. We can play the same trick again. For our network interface card, on our PerfLab, we can use another project that integrates with that, and Aeron integrates with this project ef_vi as well, and we can get to single-digit microseconds. I mentioned IPC, so we can use inter-process communication. We can use that shared memory that we now have. There’s a tradeoff here that we need to mention. Both sides, source and target, needs the access to that same shared memory. Really, they need to be on the same host. If we use IPC, we get to these numbers.

Can we do better? Can we do better than IPC, or what is better than IPC? If you think about it, IPC is still sending a message, so we need to encode that message, we need to put it on shared memory. We need to decode that message. We already have the tradeoff that we are on the same host. If we increase the tradeoff, or the cost, and we are in the same process, then we can use a function call, and a function call will be better. A virtual function call is tens of nanoseconds. If we have our compiler inlining our function, this really is zero. Can we design a distributed system, where we have separation of concerns, and we have control over the communication in this system in a way where we can change the communication channel, and when we decide that this is worth a function call, we will use a function call, and when we decide we send it over the network, we do that. For that, we need to go back to the academic research area of distributed systems.

Nothing is invented here, so we go back and learn from what was already researched, an incomplete list of projects and names. Virtual Synchrony, 1987, Ken Birman, 1988, Viewstamped Replication, Barbara Liskov. In 1989, Leslie Lamport attempts to publish the Paxos consensus protocol. It is too vague for anybody to understand, but slowly it gathers pace and finally published in 1998. In 2001, people realized it’s still vague, and Paxos Made Simple is published. In 2013, Diego Ongaro publishes the Raft consensus protocol, which attempts to simplify even further, and basically have a Raft that will take us from the island of Paxos into a much more understandable consensus protocol.

All those projects start from the very basic computation model of a replicated state machine, and that is the essence. That is the basic computation unit we can work with. It’s a really simple one. That’s the most powerful thing about it. We get input events in, and for every input event, the state machine will deterministically modify its state, and then generate one or more events out. It has to be deterministic, and as you can see, it only knows events and it’s asynchronous. It’s a state machine. If we model a matching engine, for example, you can think of orders coming in. The state is the order book. The order book gets modified. Then execution reports are the output events. Virtual synchrony already had the two key ingredients that we believe we need for such a distributed system that will use this basic model of replicated state machine. The first thing is to have a totally ordered sequence of messages, we call it the log.

The idea is really powerful. If you have this totally ordered sequence of messages, you can give it to different instances of your state machine, and you get replication, because those state machines are deterministic, and this is a totally ordered sequence of messages. There are two more really important things about this basic idea. The first is the idea of checkpoints. If I’m a state machine, I can say that I’ve processed message number 2, I haven’t reached message number 3. I can say that to any other component in the system. All components in the system will know exactly where I am, and they could compare where they are as opposed to me. The second thing is the concept of time. This gives us synchronization of time because the timestamp on those messages is the time of the system. I don’t need any other synchronization mechanism to deal with time. I don’t need to worry about clock drift.

The second basic idea that was already there in virtual synchrony is the concept of logical groups. I can logically group my state machines, and I can then manage them. The group membership is also replicated in the log. Every component in the system can know what are the members at this point in time, which in another word means at this point in the log, who left and who joins. When we put it together, what we get is that all replicas in a group will have exactly the same state at the same checkpoint. We get replication. What can we do with it from a latency point of view? First of all, we can divide background tasks between different instances. We can assign one instance to just do backups because it has the same state, while the others keep on processing. We can assign one instance to serve queries while the other keep on processing. We get high availability. We can have one member be assigned to be the active member and it will publish its output events.

The other members still consuming the log, they can still be at the same state and they can be ready there to take over if the active one dies. We get a much faster delivery of the messages if we have multiple active members because it really doesn’t matter which instance generates the message. The messages will always be the same. We are talking about a deterministic state machine. The fastest will simply win, and we can discard all the duplicates. There’s a tradeoff here. There’s a cost as well. We need more bandwidth but we get the fastest possible delivery. Another thing we get with this is that mean time to recovery of an instance is really zero in this case, because when an instance recovers, the other active members already generated that same message.

We get separation of concerns in two different and interesting ways. One is we get separation of concerns between different state machines or state machine groups. We get it because they communicate via the log. They need to send messages. They need to define a very clear protocol between them. That’s a good thing. Can we avoid communicating only via the log? That might be slow. We can think about state machine composition. We can have more than one state machine inside that instance, that member of the group. We can see two ways in which we can compose state machines. One is when there’s a dependency between two state machines. Both will consume the log. Both will be at the same checkpoint. When they are, one state machine can query the other. It’s safe to do so because they are at the same checkpoint. They are at the same point in time. That’s one way.

The second way is by pipelining state machines. This is really useful when really the output messages of one state machine are the input events for the other state machine. The second separation of concerns that we get in such a system, is that we have separation here between the business logic or our state machine implementations and the rest of the system concerns. The state machines don’t care how the log is persisted or how do we distribute the log across all instances. State machines only care about the events that they get in and the events that they sent out, and the state, that’s it. That’s powerful. That’s something that we can use.

We need now to talk about fault tolerance as well because it’s not enough to say that we have all those wonderful state machines. We do need to worry about the log. We do need to think about how do we make that log fault tolerant. Distributed systems told us how fault tolerance works, we need a quorum. We can have three or five or seven nodes, and they run a consensus protocol between them, and they can reach an agreement on what the log is. This has a history of considered very slow. We think we can implement it pretty fast, but a lot of the solutions in the past decided to not go that route because it looks slow, and so they ended up with maybe primary and secondary.

Then they go and have all kinds of mitigation issues and try to do things like one in flight, where you send just one message in flight to the primary, and if the primary suddenly fails and you’re on to the secondary, you have only one message to really worry about. Aeron open-source project has also an implementation of a cluster. We implemented the Raft protocol. We’ve implemented Raft because it has a concept of a strong leader in this cluster, and that gives us more predictable latency because the leader is not swapped around. It’s strong until it dies, but it is strong. If we look at a very similar scenario, again, it’s an echo scenario but this time it goes into the cluster. We run our consensus protocol, and we send back the response after we reached a consensus on that message. We get to these numbers in microseconds. As you might expect, because the underlying transport is the same Aeron transport, it would all be proportionally the same. C and Java will be the same. DPDK gives us a huge boost, not only in terms of much lower latency, it’s also more predictable.

If we go to our PerfLab, these are the numbers. We again play the same trick of kernel bypass, and we have these numbers in microseconds, so P999 of less than 29 microseconds for sending a message in, reaching consensus, and then sending back the response. What about IPC? IPC doesn’t make any sense. The whole point of running a consensus protocol is to gain fault tolerance. If we’re on the same host, there’s no point, we’re not going to get fault tolerance anyway. We don’t do consensus protocol over IPC. We can think of a system like this. We have our applications, they are sending messages in. Inside we have the cluster, the cluster agrees on the log, gives the log to the state machines as input.

Then the output events turn into messages and go in an egress channel back to the applications. We have lots of customers and we have those kinds of systems running. There are a few problems. Number one is the problem of fan-out. In the fintech domain always, one message in ends up in average to have a lot more than one message out. We have a bottleneck, and that bottleneck is that egress channel because suddenly just managing the amount of output events from this cluster becomes a problem. That problem creates us a problem of scale because we can’t add more and more state machines and deal with more and more data because everything is condensed in this egress channel and gives us a problem of scaling. The third problem is a problem of recovery. If we want to upgrade a state machine, we need to take one of those cluster nodes down, and that has a cost for your consensus protocol. It’s either that or you need to run a hot standby or find some other solution. There’s a cost here.

The Future – Sequencer

This brings us to the sequencer. The sequencer is an architecture that’s been there in the fintech industry for some time. The first company to publicly talk about it is Island ECN in 1996, and their sequencer ended up being the sequencer in NASDAQ. The idea is very simple. Let’s have a component that doesn’t run any business logic. All it cares about is deciding on the log and sequencing the messages, putting the timestamp on them, and the state machines will be inside the applications. If we look at the fault tolerant view of the same architecture, we have a cluster, and inside the cluster we only run a sequencer. This cluster that runs the sequencer sends out the log, and the state machines are in the applications. There’s no fan-out problem anymore because we are not sending out the output messages. We’re sending out the log, which is really a condensed version of all the input events.

That means we don’t have such a scaling problem that we had previously. Recovery cost is also not as big as it was, because if you want to upgrade a state machine, all you need to do is you need to upgrade an application. You take down the application, bring it back up, and it can consume the log from the same checkpoint. We can imagine a system where you have a sequencer running inside a fault-tolerant cluster and distributing the log to many groups of compositions of state machines. That gives you a distributed system where you have separation of concerns that will allow you to much better have a fit for your system and for your needs, and allow you to reduce latency in a way that really meets your particular problem. The race to zero latency still goes on and will still go on, I’m sure, but with a distributed system like that which is virtually synchronous and state machine composition, we can get much closer.

Ellis: I was going to ask about the sequencer because you didn’t give any metrics for that. Is there a thinking about that?

Amir Langer: Yes, for good reason. One is it’s a future product, meaning it doesn’t exist yet. It’s only in our laptops, or a very crude version of it. Also, if you think about it, it’s all about the tighter fit. If I’ll have an echo version of the sequencer, I imagine it will look very similar to the numbers of the cluster because I’m not doing much. The power of the sequencer is in actually looking at a real scenario, a real distributed system and improving the latency in such a system. It’s more than just an echo test. When we have real scenarios, we should be able to see the difference. That’s definitely the intention.

Questions and Answers

Participant 1: I just wondered when you mentioned the kernel bypass for the networking, what are the downsides of this? Are there any downsides?

Amir Langer: It’s very low level and it’s very tight to the network interface card. If you just want to suddenly replace your hardware, you might find that you need to use other integration or maybe that network interface card doesn’t have any support. It really depends. This is going very low level into integrating with a very specific kind of hardware.

Participant 2: It seems like we’re using Java and we have very fast IPC. The IPC is under 10 milliseconds, but then we use Java. The IPC is fast under 10 microseconds, but then I need to load a class into memory and that takes 50 microseconds. That’s why it’s very hard for me to understand why we would use Java for this use case in general.

Amir Langer: Java is fast, really. You have the JIT compiler that does a lot of heavy lifting and its optimizations really change the game. If you have an interpreted language versus C, yes, there’s no battle here, really. Just the fact that they are comparable and they get to the same numbers means that, yes, Java is fast and it has its own ways of being fast. There are things that you also need to do in Java that you need to be mindful of, but that’s true of any language. You shouldn’t just allocate memory like mad because you have the garbage collector and then at some point it will start collecting garbage and you lost all your predictable latency. If you allocate your memory in advance and not allocate any memory at the hot path, then suddenly you’re fine. Again, it’s true of any language and you need to just know how to work with Java. This fight has ended. Java is not slow by any means. We can see it in the numbers, really.

Participant 3: I’m wondering if latencies in particular parts of the application may introduce latencies to broader parts of the applications, because if state machines in different components all have to sync on the same sequences and the sequencer is aware of all of these, if I have, for example, a component that does slow queries and takes its time, the sequencer still has to sync on basically the whole of the system, even in components that are unrelated and may be affected by that.

Amir Langer: There are dependencies in a distributed system. The different components don’t just work on their own and don’t communicate to each other. They communicate with each other at the end of the day, and so there is a dependency if one component sends a message to another component and then maybe needs to get a response or something like that. That would create a dependency, and there’s no way around it. If that’s the business problem, then that’s the business problem.

If you are just sending a message directly, then you might need to wait until you get back a response, and that’s bad. That’s really bad. What we learned over many years of research is that really the best way is to decouple, separate the concerns, and just process the messages and work with queues. We want to work asynchronously. You are sending the message, and at some point, you will get a response and you will react to it, but in the meantime, you can do other stuff. We cannot really resolve dependencies that are part of your business problem. That’s just the business problem, but other dependencies no longer exist.

Ellis: Obviously, in the history of latency, when we think about hardware, we’ve seen a lot of the stuff that was in low latency make its way into mainstream hardware over time. Now you’re engineering around software issues in cloud specifically. What are the one or two things you think from here should make their way into the more mainstream cloud engineering over time?

Amir Langer: To the cloud, or?

Ellis: Just into software over time, so into more mainstream.

Amir Langer: The short answer is I don’t know. The maybe longer answer is there are message protocols, low latency, reliable message protocols that could make it into much more lower level, be much more widely used in the cloud. The cloud comes with its own set of challenges. You’ve got all those availability zones. A lot of the time, you don’t know what is the lower level. You don’t really know what’s the hardware that is underneath. Doing low latency in the cloud is tough. We get customers that really care about low latency so much that they will still buy their own hardware. Perhaps looking at something like more customized, low latency, reliable messaging offered by cloud providers.

See more presentations with transcripts