Optimizing Software With Zero-Copy And Other Techniques

An important aspect in software engineering is the ability to distinguish between premature, unnecessary, and necessary optimizations. A strong case can be made that the initial design benefits massively from optimizations that prevent well-known issues later on, while unnecessary optimizations are those simply do not make any significant difference either way. Meanwhile ‘premature’ optimizations are harder to define, with Knuth’s often quoted-out-of-context statement about these being ‘the root of all evil’ causing significant confusion.

We can find Donald Knuth’s full quote deep in the 1974 article Structured Programming with go to Statementswhich at the time was a contentious optimization topic. On page 268, along with the cited quote, we see that it’s a reference to making presumed optimizations without understanding their effect, and without a clear picture of which parts of the program really take up most processing time. Definitely sound advice.

And unlike back in the 1970s we have today many easy ways to analyze application performance and to quantize bottlenecks. This makes it rather inexcusable to spend more time today vilifying the goto statement than to optimize one’s code with simple techniques like zero-copy and binary message formats.

Got To Go Fast

The cache hierarchy of the 2008 Intel Nehalem x86 microarchitecture. (Source: Intel)

There’s a big difference between having a conceptual picture of how one’s code interacts with the hardware and having an in-depth understanding. While the basic concept of more lines of code (LoC) translating into more RAM, CPU, and disk resources used is technically true much of the time, the real challenge lies in understanding how individual CPU cores are scheduled by the OS, how core cache synchronization works, and the impact that the L2 and L3 cache have.

Another major challenge is that of simply moving data around between system RAM, caches and registers, which seems obvious at face value, but the impact of certain decisions can have big implications. For example, passing a pointer to a memory address instead of the entire string, and performing aligned memory accesses instead of unaligned can take more or less time. This latter topic is especially relevant on x86, as this ISA allows unaligned memory access with a major performance penalty, while ARM will hard fault the application at the merest misaligned twitch.

I came across a range of these issues while implementing my remote procedure call library NymphRPC. Initially I used a simple and easy to parse binary message format, but saddled it with a naïve parser implementation that involved massive copying of strings, as this was the zero-planning-needed, smooth-brained, ‘safe’ choice. In hindsight this was a design failure with a major necessary optimization omitted that would require major refactoring later.

In this article I’d like to highlight both the benefits of simple binary formats as well as how simple it is to implement a zero-copy parser that omits copying of message data during parsing, while also avoiding memory alignment issues when message data is requested and copied to a return value.

KISS

Perhaps the biggest advantage of binary message formats is that they’re very simple, very small, and extremely low in calories. In the case of NymphRPC its message format features a standard header, a message-specific body, and a terminator. For a simple NymphRPC message call for example we would see something like:

uint32     Signature: DRGN (0x4452474e)
uint32      Total message bytes following this field.
uint8       Protocol version (0x00).
uint32      Method ID: identifier of the remote function.
uint32      Flags (see _Flags_ section).
uint64      Message ID. Simple incrementing global counter.
<..>            Serialised values. 
uint8           Message end. None type (0x01).

The very first value is a 32-bit unsigned integer that when interpreted as characters identifies this as a valid NymphRPC message. (‘DRGN’, because dragonfly nymph.) This is followed by another uint32 that contains the number of bytes that follow in the message. We’re now eight bytes in and we already have done basic validation and know what size buffer to allocate.

Serializing the values is done similarly, with an 8-bit type code followed by the byte(s) that contain the value. This is both easy to parse without complex validation like XML or JSON, and about as light-weight as one can make a format without adding something like compression.

Only If Needed

When we receive the message bytes on the network socket, we read it into a buffer. Because the second 32-bit value which we read earlier contained the message size, we can make sure to allocate a buffer that’s large enough to fit the rest of the message’s bytes. The big change with zero-copy parsing commences after this, where the naïve approach is to copy the entire byte buffer into e.g. a std::string for subsequent substring parsing.

Instead of such a blunt method, the byte buffer is parsed in-place with the use of a moving index pointer into the buffer. The two key methods involved with the parsing can be found in nymph_message.cpp and nymph_types.cppwith the former providing the NymphMessage constructor and the basic message parser. After parsing the header, the NymphType class provides a parseValue() function that takes a value type code, a reference to the byte buffer and the current index. This function is called until the terminating NYMPH_TYPE_NONE is found, or some error occurs.

Looking at parseValue() in more detail, we can see two things of note: the first is that we are absolutely copying certain data despite the ‘zero-copy’ claim, and the liberal use of memcpy() instead of basic assignment statements. The first item is easy to explain: the difference between either copying the memory address or the value of a simple integer/floating point type is so minimal that we trip head-first into the same ‘premature optimization’ thing that Mr. Knuth complained about back in 1974.

Ergo we just copy the value and don’t break our pretty little heads about whether doing the same thing in a more convoluted way would net us a few percent performance improvement or loss. This is different with non-trivial types, such as strings. These are simply a char* pointer into the byte buffer, leaving the string’s bytes in peace and quiet until the application demands either that same character pointer via the API or calls the convenience function that assembles a readily-packaged std::string.

Memcpy Is Love

Although demonizing ‘doing things the C way’ appears to be a popular pastime, if you want to write code that works with the hardware instead of against it, you really want to be able to write some highly performative C code and fully understand it. When I had written the first zero-copy implementation of NymphRPC and had also written up what I thought was a solid article on how well optimized the project now was, I had no idea that I had a “fun” surprise waiting for me.

As I happily tried running the new code on a Raspberry Pi SBC after doing the benchmarking for the article on an x86 system, the first thing it did was give me a hard fault message in the shell along with a strongly disapproving glare from the ARM CPU. As it turns out, doing a direct assignment like this is bound to get you into trouble:

methodId = *((uint32_t*) (binmsg + index));

This line casts the current index into the byte buffer as a uint32_t type before dereferencing it and assigning the value to the variable. When you’re using e.g. std::string the alignment issues sort themselves out somewhere within the depths of the STL, but with direct memory access like this you’re at the mercy of the underlying platform. Which is a shame, because platforms like ARM do not know the word ‘mercy’.

Fortunately this is easy to fix:

memcpy(&methodId, (binmsg + index), 4);

Instead of juggling pointers ourselves, we simply tell memcpy what the target address is, where it should copy from and how many bytes are to be copied. Among all the other complex scenarios that this function has to cope with, doing aligned memory address access for reading and writing is probably among its least complex requirements.

Hindsight

Looking back on the NymphRPC project so far, it’s clear that some necessary optimizations that ought to have been there from the very beginning weren’t there. At least as far as unnecessary and premature optimizations go, I do feel that I have successfully dodged these, but since these days we’re still having annual flamewars about the merits of using goto I very much doubt that we will reach consensus here.

What is clear from the benchmarking that I have done on NymphRPC before and after this major refactoring is that zero-copy makes a massive difference, with especially operations involving larger data (string) chunks becoming multiple times faster, with many milliseconds shaved off and the Callgrind tool of Valgrind no longer listing __memcpy_avx_unaligned_erms as the biggest headache due to std::string abuse.

Perhaps the most important lesson from optimizing a library like NymphRPC is that aside from it being both frustrating and fun, it’s also a humbling experience that makes it clear that even as a purported senior developer there’s always more to learn. Even if putting yourself out there with a new experience like porting a lock-free ring buffer to a language like Ada and getting corrected by others stings a little.

After all, we are here to write performant software that’s easy to maintain and have fun while doing it, with sharing optimization tips and other tricks just being part of the experience.