We design and benchmark a cross‑platform echo & chat server that scales from laptops to low‑latency Linux boxes. Starting with a Boost.Asio baseline, we add UDP and finally an io_uring
implementation that closes the gap with DPDK‑style kernel‑bypass—all while preserving a single, readable codebase.
Full code is available here: https://github.com/hariharanragothaman/nimbus-echo
Motivation
Real‑time collaboration tools, multiplayer games, and HFT gateways all live or die by tail latency. Traditional blocking sockets waste cycles on context switches; bespoke bypass stacks (XDP, DPDK) achieve greatness at the cost of portability.
NimbusNet shows you can split the difference:
- Run anywhere with Boost.Asio (macOS, Windows, CI containers).
- Drop latency ~2× with UDP by eliminating TCP’s ordering tax.
- Unlock sub‑25 µs RTT on Linux via
io_uring
—no kernel patches, no CAP_NET_RAW.
Build Environment:
Host |
Toolchain |
Runtime Variant(s) |
---|---|---|
macOS 14.5 (M2 Pro) |
Apple clang 15, Homebrew Boost 1.85 |
Boost.Asio / TCP & UDP |
Ubuntu 24.04 (x86‑64) |
GCC 13, |
Boost.Asio / TCP & UDP, |
GitHub Actions |
macos‑14, ubuntu‑24.04 |
CI build + tests |
Phase 1 – Establishing the Baseline (Boost.Asio, TCP)
We begin with a minimal asynchronous echo service that compiles natively on macOS.
Boost.Asio’s Proactor‑styleasync_read_some
/async_write
gives us a platform‑agnostic way to experiment before introducing kernel‑bypass techniques.
#include <boost/asio.hpp>
#include <array>
#include <iostream>
using boost::asio::ip::tcp;
class EchoSession : public std::enable_shared_from_this<EchoSession> {
tcp::socket socket_;
std::array<char, 4096> buf_{};
public:
explicit EchoSession(tcp::socket s) : socket_(std::move(s)) {}
void start() { read(); }
private:
void read() {
auto self = shared_from_this();
socket_.async_read_some(boost::asio::buffer(buf_),
[this, self](auto ec, std::size_t n) { if (!ec) write(n); });
}
void write(std::size_t n) {
auto self = shared_from_this();
boost::asio::async_write(socket_, boost::asio::buffer(buf_, n),
[this, self](auto ec, std::size_t) { if (!ec) read(); });
}
};
int main() {
boost::asio::io_context io;
tcp::acceptor acc(io, {tcp::v4(), 9000});
std::function<void()> do_accept = [&]() {
acc.async_accept([&](auto ec, tcp::socket s) {
if (!ec) std::make_shared<EchoSession>(std::move(s))->start();
do_accept();
});
};
do_accept();
std::cout << "⚡ NimbusNet echo listening on 0.0.0.0:9000n";
io.run();
}
2 – UDP vs. TCP: When Reliability Becomes a Tax
TCP’s 3‑way handshake, retransmit queues, and head‑of‑line blocking are lifesavers for file transfers—and millstones for chats that can drop an occasional emoji. TCP bakes in ordering, re‑transmission, and congestion avoidance; these guarantees come at the cost of extra context switches and kernel bookkeeping. Swapping to udp::socket
: For chat or market‑data fan‑out, “best‑effort but immediate” sometimes wins.
#include <boost/asio.hpp>
#include <array>
#include <iostream>
using boost::asio::ip::udp;
class UdpEchoServer {
udp::socket socket_;
std::array<char, 4096> buf_{};
udp::endpoint remote_;
public:
explicit UdpEchoServer(boost::asio::io_context& io, unsigned short port)
: socket_(io, udp::endpoint{udp::v4(), port}) { receive(); }
private:
void receive() {
socket_.async_receive_from(
boost::asio::buffer(buf_), remote_,
[this](auto ec, std::size_t n) {
if (!ec) send(n);
});
}
void send(std::size_t n) {
socket_.async_send_to(
boost::asio::buffer(buf_, n), remote_,
[this](auto /*ec*/, std::size_t /*n*/) { receive(); });
}
};
int main() {
try {
boost::asio::io_context io;
UdpEchoServer srv(io, 9001);
std::cout << "⚡ UDP echo on 0.0.0.0:9001n";
io.run();
} catch (const std::exception& ex) {
std::cerr << ex.what() << 'n';
return 1;
}
}
Latency table (localhost, 64‑byte payload):
Layer |
TCP |
UDP |
---|---|---|
Conn setup |
3‑way handshake |
0 |
HOL blocking |
Yes |
No |
Kernel buffer |
per‑socket |
shared |
RTT (median) |
≈ 85 µs |
≈ 45 µs |
Here we replaced tcp::socket
with udp::socket
and removed the per‑session heap allocation; the code path is ~40 % shorter in perf traces.
If your application can tolerate an occasional drop (or do its own acks), UDP is the gateway to sub‑50 µs median latencies—even before kernel‑bypass. If you can tolerate packet loss (or roll your own ACK/NACK), UDP buys you ~40 µs on the spot.
Takeaway: if you can tolerate packet loss (or roll your own ACK/NACK), UDP buys you ~40 µs on the spot.
3 – io_uring
: The Lowest‑Friction Doorway to Zero‑Copy
Linux 5.1 introduced io_uring
; by 5.19 it rivals DPDK‑style bypass while staying in‑kernel.
-
Avoids per‑syscall overhead by batching accept/recv/send in a single submission queue.
-
Reuses a pre‑allocated
ConnData
buffer—no heap churn on the fast path. -
Achieves ~20 µs RTT on Apple M2‑>QEMU→Ubuntu, a 3× improvement over Boost.Asio/TCP (~85 µs).
// Extremely small io_uring TCP echo server (edge‑triggered)
#include <liburing.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <unistd.h>
#include <cstring>
#include <iostream>
// ---------------------------------------------------------------------------
// Compat shim for old liburing (< 2.2) — Ubuntu 24.04 ships 2.0
// ---------------------------------------------------------------------------
#ifndef io_uring_cqe_get_res
/* If the helper isn't defined, just read the struct field directly */
#define io_uring_cqe_get_res(cqe) ((cqe)->res)
#endif
constexpr uint16_t PORT = 9002;
constexpr unsigned QUEUE_DEPTH = 256;
constexpr unsigned BUF_SZ = 4096;
struct ConnData {
int fd;
char buf[BUF_SZ];
};
int main() {
// 1. Classic BSD socket setup
int listen_fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
sockaddr_in addr{}; addr.sin_family = AF_INET; addr.sin_port = htons(PORT);
addr.sin_addr.s_addr = INADDR_ANY;
bind(listen_fd, reinterpret_cast<sockaddr*>(&addr), sizeof(addr));
listen(listen_fd, SOMAXCONN);
// 2. uring setup
io_uring ring{};
io_uring_queue_init(QUEUE_DEPTH, &ring, 0);
// helper lambda: submit an accept sqe
auto prep_accept = [&]() {
io_uring_sqe* sqe = io_uring_get_sqe(&ring);
sockaddr_in* client = new sockaddr_in;
socklen_t* len = new socklen_t(sizeof(sockaddr_in));
io_uring_prep_accept(sqe, listen_fd,
reinterpret_cast<sockaddr*>(client), len, SOCK_NONBLOCK);
io_uring_sqe_set_data(sqe, client); // stash ptr so we can free later
};
prep_accept();
io_uring_submit(&ring);
std::cout << "⚡ io_uring TCP echo on 0.0.0.0:" << PORT << 'n';
// 3. Main completion loop
while (true) {
io_uring_cqe* cqe;
int ret = io_uring_wait_cqe(&ring, &cqe);
if (ret < 0) { perror("wait_cqe"); break; }
void* data = io_uring_cqe_get_data(cqe);
unsigned op = io_uring_cqe_get_res(cqe);
// Accept completed → op = client_fd
if (data && data != nullptr && op >= 0 && op < 0xFFFF) {
int client_fd = op;
delete static_cast<sockaddr_in*>(data); // free sockaddr
io_uring_cqe_seen(&ring, cqe);
// schedule next accept right away
prep_accept();
// schedule first read
ConnData* cd = new ConnData{client_fd, {}};
io_uring_sqe* r_sqe = io_uring_get_sqe(&ring);
io_uring_prep_recv(r_sqe, client_fd, cd->buf, BUF_SZ, 0);
io_uring_sqe_set_data(r_sqe, cd);
io_uring_submit(&ring);
continue;
}
// Read completed → if >0 bytes, write them back
ConnData* cd = static_cast<ConnData*>(data);
if (op > 0) {
io_uring_sqe* w_sqe = io_uring_get_sqe(&ring);
io_uring_prep_send(w_sqe, cd->fd, cd->buf, op, 0);
io_uring_sqe_set_data(w_sqe, cd); // reuse struct
io_uring_submit(&ring);
} else { // client closed
close(cd->fd);
delete cd;
}
io_uring_cqe_seen(&ring, cqe);
}
close(listen_fd);
io_uring_queue_exit(&ring);
return 0;
}
Even without privileged NIC drivers, io_uring
brings sub‑50 µs latency into laptop‑class hardware—ideal for prototyping HFT engines before deploying on SO_REUSEPORT + XDP in production.
4 – Running Benchmarks: Quantifying the wins
We wrap each variant into Google Benchmarks
#include <benchmark/benchmark.h>
#include <boost/asio.hpp>
#include <thread>
#include <array>
using boost::asio::ip::tcp;
using boost::asio::ip::udp;
/* ---------- Helpers ------------------------------------------------------ */
// blocking Boost.Asio TCP echo client (loop‑back)
static void tcp_roundtrip(size_t payload) {
boost::asio::io_context io;
tcp::socket c(io);
c.connect({boost::asio::ip::make_address("127.0.0.1"), 9000});
std::string msg(payload, 'x');
c.write_some(boost::asio::buffer(msg));
std::array<char, 8192> buf{};
c.read_some(boost::asio::buffer(buf, payload));
}
// blocking Boost.Asio UDP echo client
static void udp_roundtrip(size_t payload) {
boost::asio::io_context io;
udp::socket s(io, udp::v4());
udp::endpoint server(boost::asio::ip::make_address("127.0.0.1"), 9001);
std::string msg(payload, 'x');
s.send_to(boost::asio::buffer(msg), server);
std::array<char, 8192> buf{};
s.receive_from(boost::asio::buffer(buf, payload), server);
}
#if defined(__linux__)
// tiny wrapper for the io_uring server (assumes it’s already running on 9002)
static void uring_tcp_roundtrip(size_t payload) {
boost::asio::io_context io;
tcp::socket c(io);
c.connect({boost::asio::ip::make_address("127.0.0.1"), 9002});
std::string msg(payload, 'x');
c.write_some(boost::asio::buffer(msg));
std::array<char, 8192> buf{};
c.read_some(boost::asio::buffer(buf, payload));
}
#endif
/* ---------- Benchmarks --------------------------------------------------- */
static void BM_AsioTCP_64B(benchmark::State& s) {
for (auto _ : s) tcp_roundtrip(64);
}
BENCHMARK(BM_AsioTCP_64B)->Unit(benchmark::kMicrosecond);
static void BM_AsioUDP_64B(benchmark::State& s) {
for (auto _ : s) udp_roundtrip(64);
}
BENCHMARK(BM_AsioUDP_64B)->Unit(benchmark::kMicrosecond);
#if defined(__linux__)
static void BM_IouringTCP_64B(benchmark::State& s) {
for (auto _ : s) uring_tcp_roundtrip(64);
}
BENCHMARK(BM_IouringTCP_64B)->Unit(benchmark::kMicrosecond);
#endif
BENCHMARK_MAIN();
With Google Benchmark we measured 10 K in‑process round trips per transport on an M2‑Pro MBP (macOS 14.5, Docker Desktop 4.30):
Table 1 – Median RTT (64 B payload, 10 K iterations)
Transport |
Median RTT (µs) |
---|---|
Boost.Asio / TCP |
82 |
Boost.Asio / UDP |
38 |
|
21 |
Even on consumer hardware, io_uring
halves UDP’s latency and crushes traditional TCP by nearly 4×. This validates the architectural decision to build NimbusNet’s high‑fan‑out chat tier on kernel‑bypass primitives while retaining a pure‑userspace codebase.
Takeaways & Future Work
- Portability first, performance second pays dividends—macOS dev loop, prod Linux wins.
- UDP is “good enough” for most chats; sprinkle FEC / acks for mission‑critical flows.
io_uring
slashes latency without root privileges, making kernel‑bypass approachable.
Next steps
- SO_REUSEPORT + sharded accept rings → horizontal scale on 64‑core EPYC Processor
- TLS off‑loading via
kTLS
withio_uring::splice
. - eBPF tracing to pinpoint queue depth vs. tail latency.