DuckDB Co-creator: "It Was Clear That A New Architecture Was Necessary"

Hannes Mühleisen

(Image: Hannes Mühleisen)

Hannes Mühleisen is co-creator of DuckDB and CEO of DuckDB Labs. Together with Mark Raasveldt, he originally launched DuckDB as a research project at the Centrum Wiskunde & Informatica (CWI) Amsterdam.

Read more after the ad

Golo Roden is the founder and CTO of the native web GmbH. He is engaged in the conception and development of web and cloud applications and APIs, with a focus on event-driven and service-based distributed architectures. His guiding principle is that software development is not an end in itself, but must always follow an underlying professionalism.

Golo: Hannes, you are one of the co-creators of DuckDB and co-founder of DuckDB Labs. When DuckDB version 1.0 was released in the summer of 2024, I reported on it for heise – and a lot has happened since then. Before we go into the details, I would like to start at the beginning: DuckDB has its roots in your research at the CWI in Amsterdam, where you and Mark Raasveldt worked on database internals for years. What was the moment (or gap) when you both decided that the world actually needed another database, and what did you originally want it to be?

Hannes: Back then, we worked quite closely with statisticians who had to analyze large survey data. It was clear to us that they needed database technology! But when we suggested this, they said that they didn’t really want a database in the classic sense. For example, before Docker, it wasn’t easy to install a database locally without being an expert. In addition, you couldn’t easily share the state of the database with someone else.

It was clear that a new architecture was needed, an embedded analytical database system. That didn’t even exist back then. It became clear quite quickly that we needed a completely new development – a clean design tailored to the embedded deployment model, with a modern system architecture.

In the summer of 2018, we decided to make this a reality and started implementing DuckDB.

The term “SQLite for Analytics” has been attached to DuckDB for years. He gets to the heart of a lot in just three words, but can also seem reductive. How accurate do you find this framing from your current perspective, and where does it fall short?

Beyond Big Data

You’ve been taking the position for some time that distributed systems are simply oversized for the vast majority of analytical workloads – and that a single modern machine can do significantly more than the industry usually assumes. This is an argument that I also took up in a detailed iX test, in which I positioned DuckDB as a slim alternative to Apache Spark. Would you like to make this thesis in your own words? And how do you react to people who immediately criticize you for underestimating their problem?

Hannes: My argument rests on three pillars. First, hardware development has made great strides, and modern computers are amazingly powerful. Today, a powerful laptop ships with a dozen fast CPU cores, tens of gigabytes of memory, and a fast SSD with terabytes of storage. A server can easily offer ten times that amount or more.

Second, the field of database architecture has evolved significantly since 2010, when big data emerged. We were able to build on results in column-based storage, vectorized query processing, concurrency, and concurrency control. We have also conducted our own research on topics such as compression and operators for data volumes that exceed RAM.

Third, what most people don’t consider is that even if an organization is sitting on petabytes of data, you never need to process all of the data in a single query. There is now robust evidence of this: In recent years, both Snowflake and Redshift have published samples and statistics of their user queries – veritable treasure troves for understanding real workloads. George Fraser at Fivetran has an excellent analysis of this, showing that even among queries on Snowflake and Redshift, the 99.9th percentile scans about 300GB, so could easily run on a single node.

Performance is one of the most striking aspects of DuckDB – many early adopters describe their first experience with the words “that can’t be right, let me check the result again”. Which architectural decisions do you think are most important, and which of them are not obvious to outsiders?

Hannes: We have already talked about opting for a single node architecture, which eliminates various types of overhead in implementation, operation and performance. But there are also some non-trivial architectural decisions.

We chose vectorized execution over JIT compilation because it’s perfect for analytical workloads and much easier to maintain in the long term. We didn’t use GPUs or exotic hardware like AI accelerators, but rather put all our energy into writing the most efficient algorithms for the CPU. And finally, we deliberately avoided using SIMD intrinsics (manually formulated vector commands) when implementing these algorithms. Instead, we wrote scalar code and let the compiler do the auto-vectorization. The result is highly portable yet powerful code.

Additionally – as discussed in the previous question – a lot of current research has been incorporated into DuckDB. Processing data volumes that exceed RAM by offloading them to disk is a key contributor to DuckDB’s performance. Most modern database systems can swap to disk, but when they do, they experience a performance crash. DuckDB uses modern flash-based storage to handle this much more elegantly – users often barely notice that their queries have been offloaded to disk.

The ecosystem

DuckDB’s reach into the Python and R communities, into Node.js, into all sorts of tools and notebooks is remarkable. Was this ecosystem strategy a conscious choice from the start, or did it come about because people pulled DuckDB into their workflows?

Hannes: Of course you have to meet the users where they are. Initially, we envisioned that DuckDB would be used for data science workloads, and that determined the initial selection of clients. We obviously needed a command line client. On the language side, Python was already very strong, and we had strong connections with the R community, so we decided to implement these clients first.

Node.js followed soon after. As DuckDB grew, the community began developing clients independently. This allowed us to monitor their adoption before investing the core team’s work into fifteen different drivers. For example, the DuckDB Go driver was initially implemented by Marc Boeker, who later gave the code to the DuckDB Foundation.

The extension mechanism seems like a rather quiet but very consequential design decision. It allows DuckDB to read formats it wasn’t built for, work with object stores, and even talk to other databases. How do you think about the line between what belongs in the core and what is better off in an extension?

Hannes: We see DuckDB being used in resource-constrained environments – single-board computers, browser tabs, memory-limited containers. To enable this use, we want to keep the core of DuckDB small and only include the essentials: the SQL parser, the database engine, the storage engine, the CSV reader – and the extension mechanism. Most other features such as the Parquet reader or even HTTPS support are available as extensions.

A nice side effect of this powerful extension mechanism is that our community can build its own extensions. There are currently more than 180 community extensions for DuckDB, each of which brings new features to the system and can be installed with a single line.

DuckDB co-creator: “It was clear that a new architecture was necessary”

Beyond Big Data

The ecosystem

Leave a Reply

Beyond Big Data

The ecosystem

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply