Scaling An Embedded Database For The Cloud – Challenges And Trade-Offs

Transcript

Stephanie Wang: This talk is going to be about building a database. How many people here have experience building databases or working with database internals? Basically, what we’re going to talk about is some of the lessons learned and tradeoffs that we had to consider when we built a cloud-native data warehouse from the ground up. I think the interesting part about this project which I did, was that we tried to scale an embedded database. If you think about it, scaling an embedded database for the cloud is a contradictory motion, because embedded database literally means your application runs in-process locally on your computer, but then we want to make it work in the cloud. How do we make that work? What’s the motivation for that?

Currently, I’m a staff software engineer at MongoDB. Previously, I was a founding engineer at MotherDuck. That’s mostly the topic that we’re going to discuss is my experience working on building this data warehouse at MotherDuck. Before I went to build this data warehouse at MotherDuck, I was working at Google on the BigQuery product. BigQuery is Google’s cloud data warehouse offering. This is not really about scaling to hundreds of millions of users just yet. It’s really building something from nothing. It’s the scaling lessons from something that’s in-process, in-memory, fully local, to something that has cloud-native capabilities. I consider this going from 0 to 1. I’m hoping that as you come along the talk with me, the ride with me, perhaps you could also think about what would I do if I were the founding engineer and I need to make some decisions? How would I go about doing that?

What is DuckDB?

Before we jump into talking about scaling something for the cloud, I think it’s maybe a good idea to first talk about what DuckDB is, because understanding the key properties of DuckDB and what it does well is key to also understanding why we face certain challenges while trying to scale it for the cloud. Who knows about DuckDB or who’s used it before? I don’t think it’s completely new to folks. The minority of folks have actually heard about DuckDB or probably used it. A quick intro on what DuckDB is. In general, it’s considered a very fast and in-process database system. You can think about SQLite but for analytics queries, they’re very similar in terms of their goal and architecture. A few key properties of DuckDB. First is that it’s in-process. That means that the database is inside your application. This is quite different than traditional databases that live in distributed systems.

Some examples you may know of is MySQL, PostgreSQL. DuckDB is very different. It does not deal with networking. There’s no network latency. The data that you query is right alongside your application. It’s like a library that you link in. Everything takes place in the process. It’s very fast, because it doesn’t have to deal with that coordination overhead in a distributed system. It is focused on the single user experience, which means there’s no coordination amongst multiple users that you have to worry about.

Most traditional data warehouses or databases would want to support multiple people connecting to a database at the same time. That’s not really the goal of DuckDB. It’s more made for like a solo data scientist or data engineer who is interested in doing ad hoc query analytics on their laptop, and they want to get to results and insights really quickly. That’s the target audience for the DuckDB open-source project. It’s embedded. In other words, the database comes to your data, not the other way around.

Traditionally, most databases or data warehouses would expect you to upload your data to the warehouse before you can run the query. There is that big data ingestion path that you have to get over with. DuckDB flips that model where if you have a local file on your computer, CSV, Parquet, JSON, whatever format that’s popular, the database is right next to your data and you can just quickly query that data. That combined with the fact that it’s in-process makes DuckDB very powerful in exploratory data analytics. If you want to do something quickly, like I said it’s super-fast to get started. There is no signing up for a cloud account. There is no moving your data around incurring ingestion burden and whatnot. You can just get started really quickly.

That’s probably why DuckDB has gained massive popularity in the open-source community over the last few years. It was started in 2019 by a few folks from the Dutch research institution called CWI, which is the same institution that also invented Python coding language. Just a few days ago, DuckDB’s GitHub star surpassed 30,000. GitHub star is not, obviously, the absolute metric to say if something is good or not, but it is a proximity to understand if people are embracing this technology or not.

In DuckDB’s case, it’s really getting popular amongst ML practitioners, data scientists, data engineers, who need to do this quick query analytics. This is the trajectory of that popularity since the inception in 2019. You can see it’s almost like a hockey stick trend. This just shows you that DuckDB is no longer a research institution niche tool or a research project. It’s quite the opposite. It’s becoming quite a standard in the modern data analytics world. Here is a trend in DB-Engines Ranking. DB-Engines Ranking is a popular metric in the database community to figure out what database technology is very popular. It’s a combination of different things like discussions, job postings, and all of these things on the internet combined. For DuckDB, it’s, I think, one of the fastest rising stars on the DB-Engines Ranking. Now it’s really putting it on the map alongside some of the other very established query analytics databases as well. As you can see, people love DuckDB. They love using it for local-first analytics.

MotherDuck (Serverless DuckDB)

As people embrace technology, it is really meant for a lone duck sitting in their computer trying to analyze some files. They discover the need to have cloud capabilities where, ok, what if I’m working on a set of data and I discover something interesting and I want to share with my colleagues, for example? Also, if I were to do that, how do I do that securely? How do I manage all of the complexity that comes with provisioning, compute, and orchestration, and whatnot. That’s the motivation for us to start MotherDuck.

In a very basic level, MotherDuck was started as the serverless DuckDB, where we want to provide these cloud capabilities to DuckDB for very easy and cost-effective data analytics. We wanted to make sure that fundamentally we do not trade off DuckDB’s ease of use and simplicity, but on top of that, we wanted to give DuckDB users more power. They already are doing things locally and they’re enjoying it and they’re loving it, but then, how do we give them even more capabilities on top of that? That was the motivation of starting MotherDuck.

Why is it Hard?

You may ask, why is that hard? Why is it hard to build a serverless DuckDB. Can’t you just toss DuckDB on a massive EC2 instance and call it a cloud database, and it would just gain magical capabilities to fly in the cloud? It’s very intuitive and I think a lot of people have tried to do it, in fact. The core challenge here, ironically, is that DuckDB is not a client-server database. By just putting wings onto something that’s not built to fly, it’s not going to make the thing fly.

Although in this case, the elephant is flying, but that’s not what’s happening with DuckDB. Some of the interesting things that we’ll delve into relate to a lot of the core properties of DuckDB itself. Maybe let’s first think about, what does a client-server database execution path look like? This is probably something that’s familiar to most people who’ve used database technologies. You have an application which is responsible for sending a SQL query to some type of a server. Your server would then reach out to some remote data source and then it processes that query and then send the results back to the client. In this model, this is inherently multi-user and it supports networked query execution.

On the other hand, if you look at DuckDB’s execution model, DuckDB lives right within your host application. It gets the SQL query and it gets the data directly from the host application as well. This could be from your memory, data, or from your disk. Sometimes it can get also external data sources if you use the HTTP connection inside DuckDB. For the most part, it’s executing everything within your host application. This innate difference between a traditional client-server execution model and the DuckDB’s in-process execution model is what makes it difficult to make it fly in the cloud because it does not support any type of networking. There’s no socket. There’s no RPC calls. There’s no need for any of that stuff as far as DuckDB is concerned.

Very ironically, the things that make DuckDB very special, fast, easy-to-use is also the very things that make it very hard to cloudify. It’s in-process, which means it doesn’t support networked query execution. It doesn’t have external metadata management. It doesn’t track anything that lives outside your host application because it’s not its concern. It’s single user focused, which means it doesn’t support multiple people trying to interact with a database, or a database file, or a dataset at the same time. Traditionally, if you have a data warehouse, you would expect there to be some type of a data publisher, a change of the data, and then other people should be able to get the changes so then they can continue to produce more analytics or find insights on top of that data. That is not the way DuckDB is built. It’s also embedded, which means it doesn’t actually have persistent or long-lived compute. In other words, it lives and dies with your host application.

For example, if you spin up a DuckDB in your local CLI, command line tool, so you can download that binary easily, and click open, and try to run some queries with it, everything works so great. Then, you exit that process. All of the things that happened during that process, for example, if you’ve created some type of even credential when you’re trying to access a file living in cloud object storage, that is gone. Unless you use this way to persist something into a local file, but then it’s going to be unencrypted and then you’re going to have to deal with the complexity about, how do I make sure my local file is safe and whatnot? There’s a bunch of things that gets annoying when it comes to making this data processing or data analytics collaborative and also easy to manage, easy to scale up and down.

Challenges of Cloudifying DuckDB

To summarize, the challenges of cloudifying DuckDB, in my mind, can be boiled down to four main categories, at least from the 0 to 1 journey. Of course, we’re going to have more challenges later on as we scale out to millions of users, but to just even get started.

First, we have the coupled compute and application problem, because it’s embedded. We have coupled compute and storage as well. Because, like I said, DuckDB accesses your database file or writes to the database file or reads from the database file all within your local environment, and it’s just accessing whatever is there. Everything is coupled. There’s limited concurrent reads and writes. Especially a top use case would be multiple people trying to read a set of data while that data gets updated. Also, distribution. The cloud data warehouse space is very busy. Now we also have data lakehouses, like Databricks. There are lots of different options out there. When we started, of course, we’re thinking, DuckDB is super popular. You’ve got all these stars on GitHub.

Then, how do we leverage that popularity in the open-source project while we build this cloud version of that thing? We’ll go through each and every one of them and analyze a bit on what that means working in the cloud. Thinking as a founding engineer, what are some of the options we’ve considered or the option that we went with, and the tradeoffs that we have to contend with because we made that choice.

1. Coupled Compute and Application

Let’s first take a look at that coupled compute and application. Being embedded means that your DuckDB instance, the lifecycle of that database instance is completely coupled with your host application. Your host application can be anything from a Python Jupyter Notebook to a command line tool, CLI, API server, or it can also even be a web browser, which is a WebAssembly build. That’s actually one of the most popular builds within DuckDB as well. You can have a DuckDB instance running in your browser using the browser memory.

In other words, when your app starts, DuckDB starts, and when your app exits, DuckDB disappears, along with all the memory states and open resources and whatnot. DuckDB is a library, which you can use to interact with and build things on top of. It is not a service. There is no socket, no clients, no authentication, none of that, orchestration layer, control plane. It’s just function calls, to DuckDB and getting results from DuckDB, and you can do more things with whatever you get out of it. What does this mean for cloud? In the cloud, however, users expect things like concurrent sessions. You need to be able to connect to your database instance from different places, potentially.

For example, you can have a client connection coming from your command line. You can have another client connection coming from the web browser, which is the UI. You can have yet another one coming maybe from the Python Jupyter Notebook. You should be able to support something like that even for a single user. We also want the ability to have authentication and access control so that everybody can have their isolated environment where that’s secure and safe that they can operate in. Sometimes people also want the ability to send SQL over HTTP or REST protocol, which also involves building out this networking layer.

Thinking like a founding engineer, what kind of end user experience do we want to achieve for primarily the existing DuckDB users who already love DuckDB and maybe want to start experimenting with DuckDB and eventually want the cloud capability on top of that? First and foremost, we want to preserve DuckDB’s embedded, fast, and local-first model. We don’t want to change that. We don’t want people to have to pick and choose that, ok, I love DuckDB, but if I go to the cloud version of it, I will no longer be able to use DuckDB like I always have or I will have to pay for it. People are not going to like that.

Then we also want to support cloud-scale workloads, so way larger files, big joins. Things that your local computer cannot support. My computer, for example, is a MacBook M4 with 48 gigs of memory. A lot of people have powerful machines on their hands. However, that’s still limited. If you process a very huge hash join in some way, that’s going to consume more memory than what you have locally. That’s a good use case for offloading to the cloud. We want to avoid rewriting DuckDB into a full distributed system. It’s probably somewhat contentious still today, but hardware has advanced dramatically over the last 10, 20 years.

What you got 20 years ago for 50 bucks is different than what you get for it today, because of that you can have very powerful hardware that can scale up tremendously without having to scale out. We want to stretch the limit of single node as much as we can before even considering distributed system options, because once you go distributed, your problems also go distributed. It gets really difficult to manage. We believe that there’s no such need for most ad hoc interactive query analytics. That’s the goal that we set out with. We also wanted to make it feel like that it just works from the user’s laptop or notebook.

We want this experience to be completely seamless. They just have to change something very small in order to connect to their cloud backend. We don’t want them to do more work than they have to, because these people are already very used to the seamless, easy experience they get with DuckDB. When people do come to use something that’s built on top of DuckDB, they also have that expectation. You’re not Databricks, you’re not Spark, you’re not Snowflake. You are different. The difference comes from the simplicity and the ease of use. These are the guidelines when we started about building out this networking layer that enables cloud capability.

The option that we went with here is that we have a local instance of DuckDB running, and we have a remote instance of DuckDB running in the cloud. In other words, whatever the user has already been using on the client side. In this case, on the left-hand side, this is the client world. DuckDB offers this set of client APIs, and it already supports popular languages like Python, R, Java, Julia, Node.js, even Wasm, ODBC, and whatnot. We want to support these client connectivity options, but then allow them to simply change a part of their connection string, and then if they’ve done that, that will trigger an authentication flow that allows them to connect to their cloud-hosted version of DuckDB, which is then managed by the MotherDuck service.

On the client side, how we do that is by injecting this client extension. DuckDB is built in a similar model as Postgres, in the sense that there is a very extensive extension ecosystem. The core of DuckDB is kept very small, because not everybody requires geospatial capability or Iceberg capability, Delta Lake capability. These are things that really come into the picture if you have the need to scan geo-related data or access your lakehouse formats. The DuckDB core is very small, and developers, other than MotherDuck, anyone, like one of you, for example, can also build one of these DuckDB extensions to do something specialized in some way. That’s what we piggybacked on, is to build MotherDuck as a service that’s one part, the client extension, that’s able to inject into DuckDB so that it gives DuckDB the capability of making these RPC calls over the wire to access the cloud.

In the cloud we have another managed instance of DuckDB, which has a server extension that talks to the client extension as well. This lives within Kubernetes processes and is managed by the control plane that is something that MotherDuck is responsible for. Then we also have a persistent storage backend that stores all of the database files that people write to, so that all of these files now become persistent. They no longer live on your local computer, now they actually live in the cloud. We’ll talk a bit later about the storage engine and storage layer as well, because you will see that DuckDB is not actually designed to work very well with cloud object storages. In general, this is the main architecture of how it works. Simply put, a local instance of DuckDB interacting with a cloud-hosted instance of DuckDB, and the cloud-hosted instance of DuckDB is the one that’s able to scale according to your workload.

The multi-tenancy in MotherDuck is in such a way where everybody has their personal data warehouse, in a sense. DuckDB is not multi-tenant by design. Every single instance of DuckDB is process local and single user. In the case of MotherDuck, when you connect to MotherDuck, it would spawn a dedicated Duckling. That’s what we call, which is a Kubernetes process, essentially, per user session or query. This naturally ensures very strong isolation across all of the users. The MotherDuck control plane handles authentication, authorization, data access control centrally. DuckDB itself, which is a database core, is completely separated from the control plane.

This separation of concern is important so that you’re able to manage the service without messing with the database internals itself. It eliminates just naturally cross-user interference by never sharing the DuckDB instance amongst different users. Each Duckling, again, these are Kubernetes processes run in a single-tenant fashion, so there’s no shared memory or execution state amongst these Ducklings. From the connection and session management standpoint, so the clients embed DuckDB and initiates remote RPC calls when it’s necessary. Remember the diagram we saw earlier, the client is the one that’s always initiating these RPC calls. Each query is executed both locally and in the cloud. In the cloud only if applicable. We have this rather complicated execution framework that allows DuckDB to determine whether a query should be executed fully locally or in the cloud. I’ll talk a bit more later on in the talk. Each query is session scoped, but stateless.

A session might have its own temp tables, authorization context, and whatnot. The underlying process can be short-lived. A Kubernetes pod can be spun up for just that query, and then it can just die. It’s completely ephemeral. There’s no persistent connection, like a TPC connection, or server-held session state, like you see in PostgreSQL or some other data warehouses. It’s not stateful and it’s not always on. You can think of it like a diagram like this, where you can have different users, A, B, C, for example, different ducks. They will try to connect to the MotherDuck service. We have a MotherDuck router that routes them to their dedicated Duckling, so each one of them have a personal data warehouse that scales up and down depending on their workload.

The user A, for example, has a ton of data to process and they run large joins, so they will get a Jumbo instance of Duckling. User B and C are just running standard workloads, so they get a standard instance of a Duckling. Your user D has a Pulse instance of Duckling, which is a tiny one. I think it may be free as well. There’s a generous free plan for developers who are just playing around with the service. That’s the compute layer. Then you have the storage layer that’s fully separated from your compute layer. This allows us to scale compute separately from storage, depending on your needs, again.

Thinking like a founding engineer, what about resilience and fault tolerance? You must be wondering this by now. There is this famous statement made by Randy Bias, in the early 2010s, “Treat your servers like cattle, not pets”. Oftentimes, the single node system is considered to be not resilient and not fault tolerant because they’re so special. You just have one thing of server. If you look at this architecture here, it looks like user A only has one instance of Jumbo Duckling, so does user B and user C. That doesn’t look so good. How do you make sure that things are actually resilient in some ways? We don’t want our servers to be these cute, special Ducklings that just can never die.

Actually, in the MotherDuck use case, we actually have rubber ducks, rather. Again, these servers, like I mentioned, are first ephemeral. They can just live and die depending on whether a query is coming in or not. If there is nothing it needs to work on, it just terminates. They’re stateless. They’re not saving any type of user configuration or contextual information to a particular session. They’re completely stateless. They’re completely isolated from each other. User A’s Jumbo instance has nothing to do with user B or user C’s instances. Everybody has their own. They’re all completely isolated. That means that if something bad happened to your rubber Duckling, it would just terminate. We have a bunch of sitting ducks waiting there in Kubernetes that will get spun up immediately as soon as that particular instance fails and the router kicks in. All it requires is the client-side retry, and then it would retry your entire query. Obviously, this model allows us to adhere to the cattle-style infrastructure, but it’s not without tradeoffs either.

Comparing some of these tradeoffs of this per-user scaleup model versus a traditional cloud data warehouse model, I’ll first talk about the fault tolerance, which is at the bottom. One thing that we do give away with the initial version of this implementation is that you can’t easily resume the query midpoint, like you do with cloud data warehouses, where you have tons of state tracking and whatnot. It’s not an unsolvable problem either. The use cases that we want to focus on are ad hoc, interactive query analytics.

Most of the queries that we deal with, or our customers or users deal with, are not long-running query jobs that will run over periods of days. They’re rather quick. They’re trying to power some type of a BI dashboard, for example. You want to achieve super-fast experience with that BI dashboard. That’s not really a huge concern for us. However, if you’re thinking about, you want to be able to support really tons of occurring long-running jobs and resume somewhere from in the middle, and you don’t want to run something for two days, it fails, and you have to re-kick off the entire thing from day zero, then that’s different. A traditional data warehouse probably has better operational guarantees, but it also trades off on the simplicity side.

On the user experience side, especially if you’re a DuckDB user, it’s really easy for you to connect to this cloud data warehouse and get cloud capabilities without having to install a new package or a binary on your client side just so that you can interact with your server-side stuff. That’s one thing that we wanted to support.

On the multi-tenancy aspect, we get natural isolation amongst all of these users. It makes everybody able to have a personalized version of their data warehouse. That’s something that we wanted to offer to our users. That allows us to scale up instead of scaling out, which is a way more complicated version of scaling, because in that case, you have to worry about shuffling, and you have to worry about a ton of different state management and whatnot. We felt like it was a reasonable tradeoff for us to make. If you have a cloud data warehouse, you obviously can still have multi-tenancy, but that usually demands advanced workload management, and it’s going to be more complicated.

In terms of startup cost and latency, it’s very easy for us to spin up a Duckling, which is a very lightweight unit of execution, compute. I think to connect to MotherDuck and run your first query takes less than 200 milliseconds or something, and that’s really fast. If you have a more traditional type of data warehouse, because it’s scale-out model, it takes some time to get that cluster ready for you. Usually, the startup cost and just the latency there is higher. In terms of the cost, so George was mentioning that a lot of data warehouses are always on, so you have the reservation pricing model. These days, we do see a lot of other serverless data warehouses offering the pay-for-use pricing model.

In our case, because we’re able to suit the user’s needs with different instance sizes, we’re able to price a lot more cheaply for those who don’t demand a lot more resources. Also, because we’re able to execute queries locally on your computer, as well as offload queries to the cloud, so you would only ever pay for whatever you use, and that’s always in the cloud. In the traditional sense, you will pay for all of the stuff in the cloud, because you always have to move your data to the cloud in order to actually run that query. Those are the important things for us when we got started. Those are the tradeoffs in terms of adding the server layer and orchestration layer for DuckDB to enable the serverless experience in MotherDuck.

2. Coupled Compute and Storage

Now coming to the second challenge, the coupled compute and storage. DuckDB is powerful locally, but limiting in the cloud. What I mean by that is that DuckDB is an in-process, single-file analytics engine. When you use DuckDB, you will realize that it creates a .duckdb file in your local computer. It uses that for every single one of your databases. It just keeps reusing it. All of the compute happens on the same machine as storage. Like I said, it’s usually a .duckdb file, and it can also be a local Parquet file, for example, if you explicitly export your data to Parquet, or it can also be CSV or JSON formats, but it all happens locally on your machine. It assumes exclusive ownership of that single database file.

Most importantly, it writes directly to local disk using random access I/O. This is a big problem for cloud-backed object storage. In other words, there is no native separation of compute from storage. What that means is, going back to the earlier thought about why don’t you just throw DuckDB on a massive EC2 instance and call it a day? That is the reason is because your compute and storage are completely tightly coupled. You can’t just throw more compute to DuckDB without worrying about the storage implications of what that means.

This model is difficult for the cloud, because it cannot scale compute independently of storage, like we said. It’s difficult to use cloud object storage like S3 directly, because cloud object storages are immutable. The files you write into S3, you just write it, and then you keep on moving. You don’t go back and modify your files in S3. That doesn’t work just out of the box with the model that DuckDB has, which is random access I/O to the database file. It’s also not built for multi-user or multi-writer scenarios, even.

Multi-writer scenarios, meaning you have one single database, and you can have multiple people write at the same time, this can happen but it’s not super common as a use case in the data warehouse scenario that we’ve seen. More commonly is you would at least have someone writing to a database, and it changes the database files, tables, views, and whatnot. Then, many other people need to get access to view the state of that database. This is not something that you can just do with your local DuckDB. You’d have to maybe send the .duckdb database file to your co-worker or something if you want them to look at it. Even then, it’s very difficult to resume back to the state it was in where you even share that database file. It’s not easy to do.

Now thinking like a founding engineer, so what are some of the viable options that we can choose from in order to support the separation of storage and compute in DuckDB? Like most problems, it’s been solved in the past. In this case, the very common way to support something like this is log shipping with cloud object storage backing. This is a common model that’s used by Postgres and Redshift. The challenge here with DuckDB is because DuckDB is optimized for analytical workloads and not like transactional or operational workloads, it works differently. The assumption of log shipping works if you always go through the write-ahead log when you try to write to the database file.

In DuckDB’s case, a lot of times, when it performs bulk updates, it would just skip the write-ahead log. It wouldn’t write to it. It would chunk everything together and then write to it later on. This type of operation breaks the assumption that log shipping makes. It makes that just naturally incompatible with DuckDB’s I/O model. Log shipping is not something that’s very easy for us to do unless we change how DuckDB works innately in the storage engine front. Given the constraints that we’re facing, the questions are really, how can we safely run DuckDB in a multi-user environment?

Meaning we want to support the use cases where, say, a team can collaborate on a single set of data. Someone can make changes. Other people can get access to that, read it very easily. How can we put data in the cloud without rewriting DuckDB itself? Because that’s going to take a lot of work. We don’t want to break how it works just in a local-first design either, because, again, we always have to deal with the fact that DuckDB is an open-source project. There are people who exclusively want to use DuckDB just for local scenarios. Now we’re trying to add cloud capabilities to it, and we don’t want to break any type of existing promises that DuckDB makes to its current local only users.

Then, how do we also handle things like snapshots, rollback, or time travel easily? Because it’s very common for people to want to say, I want to go look back 30 days before my dataset, how did it look before? What were the numbers back then? How do we support that easily without having to keep all of that physical data in one place? We’re making tons of copies of different variations of this data. How do we reconcile cloud-scale durability with DuckDB’s file-based model? Like we said, single file, random I/O access.

The idea we came up with is that we don’t actually need to rewrite DuckDB. What we needed to do is to actually virtualize the file system that it talks to. What do I mean by that? Is that thinking like a founding engineer, we want to keep DuckDB the way it is, but make it work with the cloud-backed object storage, because cloud-backed object storage is durable, and it’s very cost effective. What we did was we introduced this storage engine that we call differential storage.

The differential storage engine actually is a snapshot-based append-only system that makes cloud object storage look like a local file. How that’s done is through this middle oval that’s FUSE. FUSE is file system in the user space. How this works is that DuckDB continues to do whatever it does, but FUSE translates those file operations, like reads and writes, to the differential storage. Differential storage is the one that keeps snapshots of these data files whenever a write comes in. This layering approach allows people to still read the latest snapshot, but the writes can continue to happen under the hood without interfering with those reads.

The magic here is really done by this FUSE file system in the middle. It acts like this middleman where DuckDB continues doing whatever it was thinking it’s doing. DuckDB still thinks I’m writing to a local file, but in reality, it’s just writing to this versioned and layered differential storage under the hood. Taking a look at differential storage with a bit more details, what I meant earlier. I have these layers of snapshots. When you start, you go from the bottom layer. Then when a write comes in, you have another layer, so you keep going on the top. These layers are point in time snapshot of your database file. These are DuckDB database files, and they’re immutable. Reads never interfere with writes. You naturally have this isolation between read and write workloads.

Then we have a metadata service that keeps track of these layers, and that’s able to formulate a logical file view of your database files. That’s the logical layer of the database. That means when you try to read it, you’re reading from the logical layer of the file, but then your physical file is stored in fragments that are snapshots under the hood in layers. All of this, the layers, are stored and backed by the cloud object storage, which is very durable and very cheap. That’s how differential storage works on a high level.

3. Limited Concurrent Reads and Writes

What’s interesting is that this architecture naturally supports better concurrent reads and writes. That’s why we came up with this idea of database shares. Like I said, you’re working on some database. You’re a finance team. You’re analyzing some data, but you want your team to work with you on your data. How do you do that? You can share your database. You can make that database automatically updated. We’ve introduced options where you can say, update my database share instantly, automatically. That would just publish the latest database view to all of the people who are subscribing to your database. Or you can configure manually when you want to update your database share. This is all powered by differential storage. A view that actually what happens is like this. The people who are accessing your database shares are actually accessing the logical database file view. Underneath, you have active snapshot layers.

Then you have previous snapshot layers. You can imagine, this allows you to support time travel and sharing of the database and whatnot very easily because it’s not really copying data again and again. Sharing data is a simple metadata operation, which is very fast. It doesn’t modify physical bytes under the hood. It’s just reconfiguring this logical database file using that metadata that we store on the side. The metadata is stored in PostgreSQL. We use CockroachDB because we want to support fast writes to it and also fast reads from it. This allows you to have one database where someone can write to and keep making changes to, while lots of other people can access this database in a read-only fashion.

This is a common use case for people who want to collaborate on a set of data and allow people to share their database and their database views easily, and also do time traveling later on, and branching even. Those capabilities are very easy to add given this type of architecture. Of course, you may be thinking, what if you want to support multiple writers to the database? That is a different challenge. That’s a challenge that really, I think, needs to be solved on the DuckDB level itself. It’s not something that the service layer we support can easily solve, because deep down, you need to resolve the locking situations and the coordination when you have multiple writers. That’s a way harder and more complicated issue to address. Again, what’s your use case? What’s your workload? In our case, it’s not something that we needed to really confront going from 0 to 1. That’s something we defer to as a problem to solve later on. That’s important to do when you’re a startup.

Comparing the architectural tradeoff in choosing the layering storage approach, which we went with, versus rewriting the storage engine, which you can also do in order to achieve something like that. By doing the layering approach, we completely preserve DuckDB’s design. As you know, nothing changes on the DuckDB side. It works as-is. It’s thinking, I’m writing to a local file. That’s it. We also enable the cloud-native type of durability, sharing, and elasticity.

If you have more files to write, you could just scale that cloud object storage naturally. It’s very durable. You can share, which is a simple metadata operation, as opposed to making multiple copies of your actual data and actual physical bytes. We’re avoiding major engine rewrites on the DuckDB side. Of course, this doesn’t come with nothing as a tradeoff. We do have to manage that metadata layer, like the metadata database that we store in Postgres, we have to maintain that. Because you need to keep track of, what is my latest snapshot layer? What’s the active data chunks involved? All of that has to be tracked somewhere, some way. That’s the metadata service. Also, if something becomes stale, you want to do garbage collection over time. You have to manage the complexity that comes with this metadata over time.

Another thing is we do not get storage-aware query optimization, because by design, DuckDB doesn’t know that deep down, we have this new way of storing data. In Snowflake and BigQuery, for example, they’re able to rearrange their partitions across datasets to make things more efficient during scan time or query time. When you’re ingesting data, it would already be stored in a different way, for example. That’s all capable in a traditional data warehouse. That’s not something that’s going to happen here, because DuckDB is fully unaware that the underlying storage has changed. Again, tradeoffs. What we wanted was to keep DuckDB the way it is. We’re ok with dealing with the metadata overhead that we have on hand. It’s been working pretty well. We really wanted zero-copy clones. We wanted easy snapshotting. We wanted time travel and all of that. This gives us those things with better developer ergonomics, but in terms of more advanced optimization and things like that, so we’re ok to defer that to a later time or a never time.

I also want to address that there are open table formats these days as well that support some type of multi-user experience as well. There’s Iceberg, Delta Lake. Iceberg is a very open format. You basically just store all of your data in Parquet files, and you manage your metadata using JSON and Avro files. Every time you write something into your cloud object storage, you also produce a few other files that keep track of the state of your files.

The point of that is you want to organize your files as if they are tables. You don’t want to lock yourself into any particular cloud data warehouse vendor. You keep all of your data in your own object storage bucket. Commonly, we have Iceberg, and Delta Lake, which is released by Databricks. Now we also have DuckLake, which is DuckDB’s take on open format. The main difference between DuckLake and the existing open table format is that DuckDB uses just the DuckDB database or any type of SQL database to manage metadata and the catalog. Because people realize that it’s not good enough to just know, I have these Parquet files, and they’re organized here in a bucket. They also want to know exactly how they’re versioned, and what’s in them, and where I can look up the data related to each Parquet file. That’s why you have the catalogs. DuckDB just uses a database for it, whereas Iceberg and Delta Lake, they continue to use files to manage them.

This architecture does allow some type of multiplayer experience in the sense that you can have multiple people put data into your lake, into your S3 bucket, just like that. Without a server, it’s still very difficult to manage exactly how everything is organized. You still have to manage the coordination. You need to manage compaction, for example, authentication. It doesn’t come for free. Your data may be free, and you can write to them all you want, but in the end, this experience is not serverless. You still have to manage the compute layer yourself.

4. Distribution Challenge

Coming to the distribution challenge. DuckDB is super popular, and we want to build a cloud data warehouse that’s also popular. We want to take advantage of what DuckDB already offers. One option is to go full server mode. This is well understood, like Snowflake and BigQuery. They ask you to sign up for an account. Then you set stuff up. Then you start querying your data and whatnot. If we did that, then we let go of the convenience of having existing DuckDB client APIs. Users will have to install a separate package altogether in order to just interact with our data warehouse. We didn’t want that.

Instead, we went with the local DuckDB instance, interacting with remote DuckDB instance way. That way, your local DuckDB can still continue to process data as-is. This is an internal look of the client extension and server extension we saw earlier in the presentation. Inside, the client extension intercepts DuckDB at the parsing, binding, optimizing, and query execution level.

Then this client extension is responsible for determining how your query is going to be executed, whether locally or remotely, or remotely to what extent. It sends these remote fragments to the remote server, the remote extension, and initiates these RPC calls, and also manage all of these bridges in between, and manage data chunk uploading and downloading, and all of that complexity. It’s also able to talk with the catalog server on the remote side to reconcile some metadata for the catalog. Your catalog on the client side is able to get refreshed. Because imagine you will have something on the client, and some new tables have been added, you want that to refresh locally as well. That capability is also achieved by having the client and server extension interactions. We actually wrote a CIDR paper around how exactly it works under the hood, this dual query execution model. If you’re interested, feel free to look it up.

Essentially, the tradeoff over here of having a very fat client versus a full server, like no client, or a very thin client approach, is that we have this convenient and wide distribution channel. We can just get to the entire DuckDB community easily to access MotherDuck. All they need to do is to modify their connection string, like prepend a md:, and then immediately, that allows you to connect to our service and backend. It does make query planning and optimization more complex. We wrote a whole paper around this. We also have assumptions about serialization.

If you have loss in serialization, things get lost in translation. Or if your server-side optimization is not idempotent, then that introduces issues. Some complex queries that involve cross-operator joins and whatnot can break in this model, just because we introduced some of the distributed system challenges when we break the query optimization or query planning into these two parts, like multiple clients interacting with your server backend.

I’ve shared with you some of these challenges that we faced while trying to bring an embedded database into the cloud, going from 0 to 1, so we have some flying capabilities. Here’s a little comic strip for you.

See more presentations with transcripts

Scaling an Embedded Database for the Cloud – Challenges and Trade-Offs

Transcript

What is DuckDB?

MotherDuck (Serverless DuckDB)

Why is it Hard?

Challenges of Cloudifying DuckDB

1. Coupled Compute and Application

2. Coupled Compute and Storage

3. Limited Concurrent Reads and Writes

4. Distribution Challenge

Leave a Reply Cancel reply

Stay Connected

Latest News

Philips TAA6219

The Shark Matrix Plus robot vacuum is at its lowest-ever price — save $400 right now at Amazon

The one Windows feature everyone should turn off immediately for privacy

Does The iPhone Air Bend? Here’s How Much Force Is Needed To Break It – BGR

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

What is DuckDB?

MotherDuck (Serverless DuckDB)

Why is it Hard?

Challenges of Cloudifying DuckDB

1. Coupled Compute and Application

2. Coupled Compute and Storage

3. Limited Concurrent Reads and Writes

4. Distribution Challenge

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News