Transcript
Marzoev: I’m Alana Marzoev. I’m the co-founder and CEO of Readyset. In this talk, I’m going to be telling you about how you can use a computational model called partially stateful streaming dataflow to build a high-performance SQL cache. Partially stateful streaming dataflow is a relatively new concept that was introduced by the computer systems research community a few years ago.
The main idea is that you can represent SQL queries as these long running computational graphs, and stream data change updates through the graph to incrementally update the result that the graph emits. At Readyset, we’ve been using this computational model to build a transparent database cache that has a performance profile that’s on par with a hand-optimized, custom caching system, but that helps to alleviate a lot of the burden that you typically see with caching, so specifically around having to come up with a cache invalidation policy, rewrite your application, all of that stuff.
Roadmap
I’ll start off by talking about the problem domain that we’re going to be working within in greater depth, so you can get a sense for both the use cases and the workloads that we’re ultimately solving for. Then from there, I’m going to do a deep dive into what I call dataflow-based caching. The way I’ll approach this is by starting off by showing you a demo of Readyset in particular. You can think of it as a standard replacement for any dataflow-based caching system.
From there, I will describe what you just saw. I’ll start off by going into the interface and the specific design decisions that we made there, and how they affect the usability and programmability of this caching system. From there, I’m going to walk through the life of a query, and this will give you a better understanding of the system architecture and how requests flow through that system. Finally, I’m going to do a deep dive into dataflow. I’ll introduce what dataflow is.
Then from there, I’ll show how we can specifically apply these concepts to the land of relational databases, and how we can specifically solve cache staleness issues in an automated way using dataflow. Finally, I’m going to more explicitly walk through some of the alternative approaches that people use today to address database scaling. I’ll discuss how we see dataflow-based caching relative to these alternatives, so you can get a sense for when it’s appropriate to use and when it might not be.
Problem Domain
Without further ado, problem domain. Imagine you’re building a new web application, and at the moment, it doesn’t really have any users. It’s just you, so your goal is to get users. When you think about how you approach development in this context, the thing that you’re going to be optimizing for is reducing iteration time. If your goal is just to try out a lot of features, see what sticks, you’re not going to do anything particularly fancy with your tech stack to enable that. You’re probably going to pick a programming language that you’re already intimately familiar with. You’ll use the web framework that’s associated with that language.
Then for your database, you’re not going to do anything fancy. You’re probably going to just pick a vanilla, standard relational database hosted by a hyperscaler like AWS. Let’s say that some time passes and all of your hard work and efforts on feature development start to pay off, and suddenly you get this influx of users that you were looking for. Although this is objectively great, given your goals, it actually leads to a new set of technical challenges, most of which are at the database layer.
Concretely what tends to happen in situations like these is you haven’t spent a lot of upfront time optimizing your queries, adding indices, anything like that. When you start to see this influx of traffic, sometimes this mass of users will lead to a large number of queries being run simultaneously, so there’s a lot of contention in your database. That could lead to both slow queries, aka slow page load times, which could lead to churn, and in the worst case, database outages, just because your database has not been provisioned appropriately to solve this problem.
What specifically is happening here that makes it hard for your database to keep up? User facing applications as a broad group are really quite latency sensitive. Because there’s humans on the other end of the screen that are waiting for results, and because modern day ORMs tend to autogenerate large numbers of queries, and also inefficient queries, then a single page load could require running sometimes dozens of SQL queries behind the scenes, and page load times will be bottlenecked by the slowest of the batch. In this context, it’s really important that your tail latencies are always quite low, which is challenging with off-the-shelf databases. The other dynamic at play here is there tends to be a lot of variance in traffic.
The canonical example of this in the U.S., at least, is Black Friday. It’s the day after Thanksgiving. There’s all of these big sales, and truly companies will spend the rest of the 11 months of the year just capacity planning for this day because it’s such a big revenue generation day. You see the same dynamic at play at smaller scale. Let’s say you’re Airbnb, you could totally imagine it being the case that in certain seasons and specific regions, there’s an uptick in bookings because it’s summer vacation season, or something like that. Same could be the case with a B2B SaaS type of situation where you’re going to see the most user sign-ins on Monday through Friday during business hours. There’s a lot of fluctuations, and you have to plan accordingly.
The last is that these types of workloads that are associated with user facing applications are both read-heavy and have quite skewed data access distributions, and this is something that we could actually leverage. Pay extra attention to this one. On the read-heavy side, that’s exactly what it sounds like. There’s more reads than writes. That makes sense with our mental model of the world. When you’re going on a website, the vast majority of the time you’re not going to be actively changing the data, you’ll just be consuming it. That tends to be the case across this broader category. Similarly, the data distribution, or the data that we’re accessing is not uniformly distributed.
Normally, popularity is a dimension at play, and also how recent the content was created. If you think about a forum, for example, the front page of Hacker News, and it’s the most popularly voted forum entries. The vast majority of people that visit Hacker News are going to go just to the front page, not to page 103. Then, moreover, the vast majority of things that are posted to Hacker News never see the light of day. They briefly show up on the new page and then no one ever sees them again.
The way that people in this world like to describe that distribution is the Zipfian distribution, which is governed by Zipf’s Law, which is that, generally speaking, the nth most common term in the set is 1 over n times as popular as the most common term. You can see a visualization of that here. In the context of Hacker News, it could be the case that the second most popular entry is half as popular as the top thing on the front page of Hacker News, and so forth.
Let’s say that with all of this context in mind, you’re now starting to solution. You’re like, my database is falling over. This is an urgent problem. How am I going to approach scaling things? The first thing you’re going to try is probably just vertically scaling. You’re going to get a bigger box to run your database on, and hopefully that gets you at least something, because there is some contention that there’s a lot of queries being run simultaneously. It’s a reasonable first step, but it tends to not fully solve the problem, because, as I mentioned, nowadays, the vast majority of companies are using ORMs, which specifically abstract away the database for you. That’s kind of the point. It’s really nice from a usability perspective. A lot of the queries that they generate tend to be quite unnatural, quite inefficient, and there tends to be more of them than what’s necessary.
Given that, oftentimes just adding more cores won’t actually get you the latencies that you really need, given that page load times are bottlenecked by this. From there, you might think, now let’s take a look at what the queries I’m actually running are, and try to optimize them. That is also a perfectly reasonable solution. What we find is that a lot of people, you’re not even aware of the specific queries that your application is generating due to this ORM obfuscating things, and then they tend to be more challenging to optimize if you even have that skill set. It’s not a given that every application engineer is going to know how to optimize SQL queries. That’s a pretty domain specific thing.
Given that, one of the next solutions that you’d go to is caching, because given that it’s a read-heavy, skewed data access distribution, that has caching written all over it. Typically, what this look is you have a standalone in-memory key-value store off to the side of your database. You essentially will run the queries against your database and then store them in this in-memory key-value store, such that the next time a user comes along and requests that same entry, they’ll be there in memory. You just do a key-value lookup, and it’s really fast.
There’s a lot of different strategies for caching, but one of the most common ones, and the one that I’m going to talk primarily about is read-through caching, which is quite simple. The main idea is that, you’re always going to check the cache first to see if the thing that you’re looking for is there. If it’s not, the cache is going to trigger the query that would populate that result against the database, and then it would cache it locally before returning it to the end user, or the application that requested it. Let’s think through what our code looks like before and after introducing read-through caching. For context, this is just a random code snippet I found from a Python, SQLAlchemy, and Redis caching tutorial, but it’ll illustrate my point.
This function here, is just fetching a user from the database that’s opening up a session, using SQLAlchemy to generate the query, to pull the user info, and then just returning the user data. It’s as simple, as simple can be. This is what your code starts to look like after caching. As you can see, there’s twice as much of it for the same underlying function. The concrete new thing that’s here is this get_user_from_cache function, which is implementing the read-through logic that I previously just described. You take a look at this, and you’re like, there is another helper function. Obviously, this is just one of the places where I might want to start caching. If there’s a lot of them in my app, it’s annoying to add these helper functions. I don’t want to do that. Maybe at the end of the day, it’s not the end of the world. We can write helper functions.
The question is like, is this it? Is your caching problem solved? Are you done with this code addition? The answer to that question is, unfortunately, no, simply because cache invalidation tends to get really messy. Concretely, in every point in the application where you want to introduce caching, you have to really sit down and think through what the user expects from you for that functionality, essentially. Then from there, you have to design a cache invalidation policy that aligns with those user expectations, and implement it. To make that a little bit more concrete, let’s say you have these two settings.
The first is a Facebook News Feed, and the other is a shopping cart in an e-commerce application. In the Facebook example, let’s say that you are a user of Facebook, and so are your friends, and your friend goes to brunch and posts a photo from their brunch at some time point t. Let’s say you don’t see that until t plus 5. Five minutes have gone by, there’s no update from your friends. In the vast majority of cases, you’re going to be none the wiser. Unless you’re truly sitting next to your friend and they asked you, do you see my Facebook post? You’re not going to know that there’s a 5-minute delay between when your friend posted the photo and when you saw it.
In this case, it’s easier to cache, because you don’t have to think too hard about making sure the cache is up to date and returning fresh results to the user. Contrast that to the e-commerce example. Let’s say that you are just trying to buy a t-shirt on the internet, and you’re on an e-commerce site. You add it to your cart, and it takes 5 minutes for it to show up in your cart, and you’re ready to check out. You’re not going to wait 5 minutes under the hopes of like, I want this shirt to show up in the shopping cart. You’re going to go on Amazon, or just buy it elsewhere. These are two examples where the expectations from the user are really quite different in terms of how fresh the data needs to be, and that is going to inform your cache invalidation policy.
Right now, with caching, you have to write a lot of bespoke code for cache invalidation. There’s no automated way of doing this in the general case. You could choose something as simple as a TTL. Maybe every 5 minutes your cache entries get purged and you have to re-request all the data. Or you could do something extremely complicated and sophisticated, where you could be tracking the specific tables that are referenced in the query that’s being cached and trying to gauge, this one was updated, and therefore this cache entry likely needs to be recomputed and so forth. The world is your oyster there. There’s this fundamental tradeoff that I think it’s really important to highlight, where on one end of the spectrum, it’s like, you’re rarely ever evicting.
This is good for your cache hit rate, because when your application checks to see if an item is in the cache, it’s always going to be there because you’re rarely evicting, but in all likelihood, your data is going to be extremely stale, and so it’s not going to be appropriate. It’s not going to align with expectations of your users. On the other hand, let’s say that you’re evicting all the time because you’re like, I really need the data to be fresh. I want to provide the best user experience. Then, at that point you might as well not be using a cache at all, because you’re never really leveraging it. Figuring out exactly where on this tradeoff curve you want to live is a hard problem.
I’m sure many of you have heard this quote, “There are only two hard things in computer science, cache invalidation and naming things”. This quote is referencing this general dynamic of, caching at a high level, it’s super simple. It’s like you’re just precomputing a result set, storing it in memory. What is there to know about that? In reality, it’s actually quite messy. It’s really easy to introduce bugs. Let’s say that you have sat down, you’ve thought through the user expectations. You’ve designed an invalidation policy, and now you just have to write the code to do that. That code itself could actually be quite nuanced.
Concretely, you could accidentally introduce a distributed systems bug, like a thundering herd, where you are evicting a popular cache entry unintentionally, and that’s triggering a storm of database queries against your database, which wasn’t provisioned accordingly. Then that takes it down, which, again, at that point, why are you caching? You’re making your own problem worse and shooting yourself in the foot. When I take a step back and think through a cost-benefit analysis of caching, this is what I would broadly come up with. On the pros side, you get the best-case query latencies when you get a cache hit rate. Because you are doing all of the work before, and all you’re doing when you’re doing a read is an O(1) lookup into an in-memory, lock-free data structure. Caching asymptotically doesn’t get better than that.
Similarly, there’s a much more straightforward path to scaling, as opposed to sharding your database. Making your cache scale horizontally is a lot easier than making your source of truth scale horizontally. There’s a lot of drawbacks as well. You have to rewrite your application, which is already not ideal. Caching is quite error prone. There’s three different places at least we can get it wrong, like the user expectations level, the cache invalidation policy level, and the actual implementation. When you do make a mistake, bugs are quite obvious to the end user, or otherwise they could take down your database. The stakes are really high, and you don’t have any isolation between the cache and the database. That would lead you to potentially pause and think, do I really need a cache? Only if you really needed it, would you proceed with this plan.
If we were to wave a magic wand and elucidate specifically what we would want from an ideal caching system for this purpose, I think it would look something like this. One is, we wouldn’t have to rewrite the application. Ideally, we’re able to start caching in a way that is non-intrusive, doesn’t take a lot of engineering time, and so forth. Two is that we wouldn’t have to think so hard about cache invalidation and worry about staleness concerns, because, as I just described, it’s a pretty fraught process, and it’s really easy to make mistakes. Then, finally, the cache should absolutely never make things worse, because presumably you’re adding this to your stack, because you’re already in a bit of a pickle when it comes to databases falling over and so forth, and you really just don’t want to be making it worse by trying to make it better.
Dataflow-Based Caching: Demo
With all of that in mind, I’m going to introduce dataflow-based caching. I’m going to start off with a demo, so you can get a sense for what it actually looks like and feels like, and we can go from there. For context, we are working with an IMDb dataset, which is a real dataset that’s available on the internet, that’s scraped from the IMDb website, the movie ratings website. We are going to try to cache a query, which the natural language interpretation of is, how many movies filmed after the year 2000 had over a 5-star rating on IMDb? Behind the scenes here we have a Postgres database in a Docker container, and we have a Readyset instance. A Readyset is like this example of a dataflow-based cache. Right now, we are going to connect to the cache, and we’re going to connect to it using psql. You can see this connection string here, says Readyset in it. I’m going to go ahead and do that.
Then from there, I’m going to turn on timing so that we can see how long queries take to run. The query that I just told you about, where we’re trying to see how many movies had a rating of over 5 after 2000 is right here. I’m going to start off by just running that against the underlying database. We can see that that took 173 milliseconds, and it returned the answer, 2418. There are 2418 movies that fit that criteria. You can see, I just typed the statement, show caches. There’s currently no caches that are set up in the system. We’re going to go ahead and create one by typing in, create cache from, and then I’m going to copy and paste this query string. We can see now the cache presumably was created. We can see that that was created. Now we’re going to run the same query again, and the first time we run it, it’s going to be a cache miss.
The cache started off empty, we’re warming it up. We run it a few times, and then we can see, now we’re in business. Now it’s taking 0.75 milliseconds as opposed to 50. That’s fairly consistent. That’s pretty cool. Now what we’re going to do is essentially issue a write that should change the result that’s being returned. In this case, I’m going to say, there’s this one movie that I liked from 2003 that everyone else hated, so I think that one deserves a rating over 5. The public doesn’t agree, but I’m going to change it such that that’s the case.
Then that should presumably change the query result that we see. Let’s see if I can find that update. Right now, the rating is 2.5. I’m going to change it to 5.1. Now I’m going to run the query again. We can see, all of a sudden, the result has been refreshed. Before was 2418, now it’s 2419. Notably, the time that it took to get the state from the cache didn’t change. We didn’t just evict and recompute, because if we evicted and recomputed, it would take 50 milliseconds again. In this case, it’s just the new result, but it’s still a cache hit.
Dataflow-Based Caching: Interface
How does all of that work? That’s going to be the rest of the talk. Let’s start by talking through the interface, because that’s the most tangible part of what we just saw. We saw how to interact with this. Let’s make the decisions that we made there a tad more explicit. The first thing to know is this cache is set up as a read replica. When you start up Readyset, it is going to snapshot the tables in your database that are relevant for caching. Concretely, if you’re trying to cache two queries and they’re referencing tables A, B, and C through them, but you have 500 tables in your database that aren’t A, B, and C, then it will only snapshot the tables that you actually need for caching, so A, B, and C. Then from there, it’s going to maintain a replication slot on your primary database so that it can get the row changes that are being replicated from your database to automatically and incrementally update the cache state.
The specific way that this works will vary depending on the database. Readyset in particular supports Postgres and MySQL. In Postgres, this is logical replication. MySQL is row-based replication. The next thing to know is that it’s wire compatible with Postgres and MySQL. That’s what we saw in the demo. I was using the psql client to connect to a Readyset connection string. The goal of doing that is to prevent you from having to make large changes to your application to start using a cache. This is an ergonomics related feature.
Of course, under the hood, Readyset does not yet support all of the features of the underlying database, it just looks like one at the surface level, and it has a subtly different SQL dialect. By and large, you shouldn’t have to worry about that, because for most of the things that you’re caching, which are going to be SQL-92 type queries, like relational algebra type computations, that shouldn’t be an issue. If in your application you’re already using SQLAlchemy, or Django, and Rails, you shouldn’t have to change your application code in any meaningful way.
The next thing that I will point out, which you also saw in the demo, is that caches are specified explicitly. We have the DDL equivalents for cache management, so creation, deletion, and so forth. This wasn’t obvious from the demo, but specifically, we’re caching prepared statements, which you can think of as being parameterized SQL queries. If you have two queries that are structurally the same, they’re operating over the same data, they’re running the same computations, but the only difference is the literal in one. Let’s say the user ID that you’re asking for is 3 and the other is 5, then those are actually considered to be the same query within Readyset, and it’s going to be handled by the same cache and the same dataflow graph, which I’ll be explaining in a later section.
The next thing and last thing is that, any queries that aren’t explicitly cached will be, by default, proxied to the primary database. The same goes for writes. Readyset never tries to apply a write that you send it within the cache first, it’ll just forward it along to your database and wait for that write to show up in the replication stream later on, and then from there, it will use that to incrementally update the cache state. The reason that we chose to do this is that, it gives you fine-grained control over what’s cached or not, which is desirable, because sometimes you don’t want to cache something. Sometimes you’re running a bank transaction, and we know that we don’t want to cache in bank transactions.
Sometimes it’s just not economical to cache because maybe you’re just not running that query so much, but to get a high cache hit rate, you have to use a lot of memory. Maybe you just don’t want to. At the same time, it’s also not always desirable to have to manage multiple read-write connection strings. If you already have the infrastructure set up for read replicas, and you have reads and writes split out, then you can totally adopt it in that way. If you don’t, then you don’t have to. You can still just use a singular database connection and have the functionality be unchanged. Your app will still work, but you can still have this fine-grained control over what’s cached or not.
Dataflow-Based Caching: Life of a Query
Now I’m going to walk through the life of a query. The way I’m going to do this is by starting off with a bird’s eye view of the architecture. I’ll quickly talk through what’s happening here. Then we’ll step through a bit more carefully all the different components for both setting up a cache and for actually serving traffic from the cache that we just set up. Here’s architecture. Broadly speaking, there’s the application server up on top, which is essentially unchanged. It’s as though you’re not using a cache at all. You have the business logic, which is presumably utilizing ORM. The ORM, in turn, is using a database client to actually connect to the database and issue the queries against it. In this case, instead of connecting to the relational database directly, it’s connecting to the Readyset adapter. Within the Readyset adapter, there’s different components.
The first one that it speaks with is like the SQL shim, and it’s responsible for decoding from the binary protocol of that database into just the text that we need to essentially create an internal representation of that query. Then once we have that internal representation, the Readyset client is going to be deciding whether or not that should be sent to the primary database or if it should be sent to the Readyset server. If it is sent to the Readyset server, then the query will be resolved within a component called the reader.
Then, behind the scenes of all of this, your database is getting writes. Those data changes are being replicated to the Readyset server and being handled by a component called the replicator, which is updating the local copy of the base tables that I mentioned that we snapshotted before. Then emitting those row changes through the dataflow graph, which will be ultimately responsible for keeping the cache state in the readers up to date.
I’m going to talk through that in greater depth. We’ll start by describing how we set up the cache. At this point in time, I’m assuming that we’ve already created a Readyset cluster or instance, and it gave us the database connection string that we can use. The first thing that we want to do in our application is swap out the database URL that was previously pointing at your primary database to now point to this Readyset instance that is connected to your primary database. Again, this is describing the approach of using Readyset as a full proxy as opposed to a read replica. Both are options. It’s just a matter of what works best in your setting. The specific place in the code that you’re going to actually change this URL is going to depend on the programming language you use, the framework you use, but typically it is just swapping a connection string.
From there, you as developer need to decide which queries you want to cache. There are a lot of different ways that you can imagine doing this, either by looking at your database is slow query logs. If you’re using an APM, like Datadog, or whatever, you can check that as well. In Readyset itself, we offer some basic heuristics to help you decide as well, so things like total CPU time that that query is taking up in your database, the count, how bad the tail latencies are, that sort of thing. You have to take a look at that, and then from there decide which queries you want to cache. Once you have a list in mind, you have to go ahead and actually create those caches. You do that via the DDL-esque SQL extensions that I showed you before, like the create cache from, and then the query string or query ID. Then that will trigger a migration within Readyset to construct the dataflow graph, which is going to be responsible for keeping the cache up to date.
That process of creating the cache, it’s not going to be instantaneous. It could be really fast. It could take a few minutes, depending on how big your tables are, what indices we have to create behind the scenes, and so forth. It will let you know once the cache is ready to start serving traffic. Now let’s talk through the actual life of a query part of this. Let’s say a user requests some data, which translates to an application server, or in app logic, a query is being generated.
Again, the application is connected directly to the Readyset adapter. In this context, we’re just continuing to use the ORM and proxy everything through Readyset. That’s going to get sent to first the SQL shim, which is going to be responsible for decoding the binary representation and then converting that into this generic internal representation of the query that is the same across all database variants within Readyset. Think of this as the relational algebra representation of the query. Then that internal representation is going to be passed to a component called the Readyset client. From there, the Readyset client is going to be pattern matching this string or representation against the caches that it knows are stored and are created in Readyset server. If there’s a match, and remember, we’re doing this at the level of prepared statements, so we’re ignoring any differences in literals. We’re just looking at the queries in a structural way.
Then that’s going to be sent to the Readyset server, to the reader nodes, and if not, it’s going to be proxied to the database. The database will compute the result and return it to the application. Then, in the background, the cache is continuously being refreshed in an eager way. Concretely, let’s say you’re sending writes to your database, those are being reflected in the replication stream. The replication stream is being sent to the component within the Readyset server called the replicator, and it’s going to receive those data changes. It’s going to update the local copy of the base tables to reflect those changes. Then it’s going to propagate those changes through the dataflow graph. By doing that, the dataflow graph is able to compute an incremental update over the old cache state to reflect this new data change. Then it’s going to swap out the old version in the reader, which is the cache, for all intents and purposes, with this new, updated version.
Dataflow-Based Caching: Solving Staleness with Dataflow
In this walkthrough, dataflow was a black box. I’m going to spend the rest of the time talking through how this fancy incremental update mechanism actually works, because, in my mind, that’s the cool part. Concretely, as I just alluded to, we’re going to be figuring out how to solve cache staleness issues in an automated way using this mechanism. I’ll start off by just explaining what dataflow computing is. Dataflow is just generally a very overloaded term in computer science, but the version that I’m referring to is dataflow computing, which is also known as stream processing in a lot of cases.
The main idea is that you can represent computations as directed graphs where the nodes of the graph are considered to be operators that perform those computations over incoming data. Then data is flowing through the edges into these operators. I have this little figure here, it’s an arithmetic example. Let’s say that our goal was to compute x times y plus z. What this would look like is, you have this graph where the nodes of the graph are the computations. There’s the multiplication operator, and then there’s the addition operator. Then the inputs of the graph are the data that we’re inputting, so x, y, and z. You can see that x and y are inputs into the multiplication operator. They’ll go to the multiplication operator. The multiplication operator will multiply those two together, and then it’ll emit the result x times y out of its outgoing edge, which is leading into the addition operator.
Then the addition operator will add x times y, which it got from its parent node, to z, which is the other input into the graph, and then emit the result. One way that I like to explain this idea is in contrast to batch processing. Where, typically, when at least I think about running a computation over a dataset, essentially, you have a dataset, and then you run some compute over it.
Presumably, you’re ingesting data for some time period. The data is accumulating. Then at some point in time you’re like, let’s run this computation. There might be some delay between when the last data point was updated, or the delay between the most recent version of the dataset and the response that you’re giving. If you contrast that to stream processing, in stream processing or dataflow computing, you are continuously ingesting data, and you’re continuously running some, in many times, like incremental computation over that data to compute a response that is eagerly made, up to date.
Now let’s talk about how we can represent SQL queries as dataflow, to bring this back to the domain that we were just talking about. In this case, the nodes of the graph which are again representing the computations, are just relational algebra operators. To name a few, like SELECTs, PROJECTs, JOINs, and so forth, it almost looks like a query plan, if you’ve seen those. Then, the data that’s flowing through the edges are these row updates coming from your primary database through the replication stream. Those updates are propagating through the graph. The relational operators are computing the changes over those and emitting the results.
By the time you have the final output, that’s the query result that we are looking for. I’m going to make this more explicit with an example. I already talked a little bit about Hacker News. It’s a forum for developers. People will post things. Most of the things that get uploaded make it to the top page. Let’s say that, just for the sake of this example, we’re only dealing with two tables. There’s a stories table, which has the story ID, the author of the story, the title, and URL.
Then there’s the votes table, which has just a mapping of which users voted for which story IDs. Let’s say, in an attempt to write a query that generates the content on the front page, we’ll have this, and the natural language interpretation of this is essentially return the story content and metadata for that story, along with the vote count for the story with ID x. It’s just doing a join and joining the story info with the vote count for that story, and it’s parameterized on the story ID, so it’s a prepared statement. The way that we would represent this as dataflow would look something like this, where you have the base tables which are stored locally on disk.
Then, in the actual dataflow graph itself, you have a count group by node which is receiving input from the votes base table. Then there’s the stories base table node which is feeding into the join node. You see the join node has another input coming in from the count group by. Finally, that’s storing the cache results in the reader.
Let’s talk through how writes look in this world to make this a little bit more dynamic and complete. Let’s say that Justin upvotes story with ID 2. That’s going to be reflective of a row insert into the votes base table. Justin votes for story 2, that’s going to propagate through the dataflow graph. As you can see, the outgoing edge from this base table is into the count node. The count node is going to receive that update, and it’s going to take the prior vote count, which was 580, and increment it by 1 to reflect the fact that there’s now one more person that voted for this story. Then it’s going to emit that new vote count into the join node. The join node is going to be like, I have this new vote count. Let me update the result of the join for story ID 2 to have this new vote count, 581.
Then, finally, let’s make that available for consumption by users and update it in their reader node, which is where we’re serving all of our traffic from. Now to walk through a read. Let’s say that now we’re trying to get all the story data and vote count for story with ID 2. That’s going to just look like the query that we were just looking at before. Now, instead of being parameterized, we’re passing in the parameter too and executing the prepared statement. To perform this read, all you have to do is a key-value lookup for ID 2 in this reader node, which is just an in-memory data structure that’s storing all of the cached query results. It’s pretty on par with normal caches in that regard.
Let’s talk about how efficient this approach is as currently described, because that’s going to be really important. I’ll start by comparing read efficiency. I just alluded to this. Essentially, in both types of caches, like in the one where you just have an in-memory key-value store, and you’re evicting and recomputing and all of that. Then in this version, where it’s like a dataflow-based cache, all you’re doing is this very fast lookup into an in-memory, lock-free data structure. You get the same performance, slight code differences aside, in both contexts. Now let’s compare cache update efficiency. Again, with “traditional caches”, like this in-memory, key-value store-based model where you’re running queries against the database, storing it in the cache, and so forth. Whenever you want to invalidate the cache and update it, you are going to evict and then recompute the missing state. This recomputation could be pretty expensive. It won’t necessarily be.
Presumably, the reason you’re introducing the cache is because you’re running this computation all the time, or perhaps it’s like a complex query, so the fact that you have to invalidate by removing the state from the cache and rerun against your database could take quite a while, and that’s going to reduce your cache hit rate, because those are all going to be cache misses. With dataflow-based caches, things are different, because you aren’t ever evicting and recomputing to update the cache. You’re incrementally updating it, which tends to be a lot less compute intensive, because you’re not throwing away the old value entirely. You’re saying, I have the old value.
Then, some small amount of the data has changed since we last looked at the old value, so let’s just reuse that old computation and then just figure out how to update the old computation to reflect just these new data points, as opposed to starting from scratch. You’re doing this all within the context of the cache. You’re never running a query against your database to update the state and make it fresher, because we have the local copy of the base tables, all of that is just being resolved within the cache itself.
Now let’s talk about memory overhead. With traditional caches, the heuristic that people like to use is that you need to make sure that you have at least enough memory for your working set allocated to the cache. Because if you don’t, you’re going to have this thing called thrashing, where you’re continuously swapping items in and out of the cache, because the application needs them but there’s not enough space to hold all of them. The same dynamic is absolutely true with dataflow-based caches. You need to absolutely make sure you have enough memory allocated for the working set.
There’s this whole dataflow graph around that as well, because the working set, that’s just the reader nodes. We have all these other nodes that are also stored in memory, like the dataflow graph ones. That could actually be pretty heavy, because it’s going to depend, of course, on the base dataset sizes. It’ll depend on how complex the query is, because the more complex the query, the more graph you have. The more computations you’re going to be doing in the graph, and the more memory you’re going to have to allocate to it. It also has to do with the data distribution in this interesting way. Because let’s say you’re filtering out some of the nodes responsible for filtering out the data, you could have instances where you have the largest base tables of all time, but you’re very quickly filtering away most of that data.
Really, there’s a question of like, how much are you filtering out? How heavy are any of these nodes going to be along the way? It begs the question, is this memory overhead practical? The answer is, not always, at least as described. This is the main obstacle, as I see it, to really using this in a real-world setting.
Now I’m going to talk about how we approach solving this memory overhead blowup problem in Readyset, at least. The way we do this is via a mechanism called partial state, also known as partial materialization. The main insight here is coming back to the beginning of this talk, like we have a skewed data access distribution. It’s not the case that everything is going to be equally as popular, and we have to make sure we’re preemptively putting everything in the cache.
Rather, it’s very likely that there’s going to be a relatively small subset of items that are really popular, really recent, or whatever. It’s distributed via Zipf’s Law, so it’s going to be highly skewed. You can get a really high cache hit rate just by caching the bulk of those, as opposed to caching the full long tail of potential query results. In the context of these prepared statements, think of it this way. You don’t have to preemptively compute the results of this query for any possible input parameter, like any possible story ID, because that would be a waste of time, because, as we discussed, most people don’t go on page 103 of Hacker News.
With partial materialization or state, the main idea is that when you first create this dataflow graph, it’s going to start off entirely empty, and you’re going to fill it in an entirely lazy way on-demand, via a mechanism called an upquery. The way upqueries work is that, again, cache is starting off empty. A user requests a value from the cache, it’s going to recursively traverse up the dataflow graph to find the closest node that has the missing state and then run it through the rest of the graph to perform the update.
I will just walk through a quick visual of this. Again, let’s say we’re running this query for story ID 2, the graph is starting off completely empty. The request is going to go into the reader node. The reader node is going to be like, “I don’t have it. Let’s ask my parent node”, which is the join node. The join node is also empty because we just got started. It’s going to be like, let me ask my two parent nodes, see if we can find the missing state. This is actually happening in parallel. In this diagram, it’s happening sequentially. It’s going to ask the count node, count node is also empty. The count node is going to ask the votes base table. The votes base table is, again, this local copy of the tables on disk. It’s going to necessarily have the state that we’re looking for, and it’s going to replay all of those row entries into the count node.
Concretely, all of the rows that indicated somebody voted for story 2, that’s going to be counted up by the count node, and it’s going to then emit that to the join node. The join node had asked its stories base table parent to pull the story info for story 2, and it’s going to join the results together and store it in the reader, and then return a response. The way we deal with memory pressure is the same as normal caches, where once we hit some memory threshold, maybe just like the memory on the machine, as opposed to simply being, we’ll just evict state based on LRU, LFU, whatever. It’s pretty standard.
Dataflow-Based Caching: Comparison to Alternatives
If I think through dataflow-based caching relative to traditional caching, you can avoid a lot of the usability challenges because it’s wire compatible with the database. It has the same interface. You don’t have to rewrite the app. You don’t have to worry as much about staleness, because it’s eagerly updating the cache entries behind the scenes. Then, the cache is never going to make anything worse, because we have a local copy of the base tables. We use that to deal with misses. You’re not going to intentionally introduce distributed systems bug. The important thing to note is that this is effectively expanding the space of supported workloads. Because you’re not doing this evict and recompute, you can tolerate comparatively write-heavy workloads.
If you had a lot of writes in a normal caching setting, then you would have to continuously evict, and then you never get a high cache hit rate. Here, because we’re just updating the cache directly in-memory, and it’s not a cache miss, then you can tolerate a much larger number of writes while still getting a good cache hit rate and having the cache be worth it. Of course, you can more easily cache complex queries without having to track data provenance when it comes to doing invalidation. I compare this to query optimization, obviously, the barrier of entry is lower because you don’t need to know anything about databases to be able to do this.
Sometimes, even if you optimize a query to the max extent possible, it’s still too slow, so caching still has a role there. When I compare this to read replicas, you can essentially set up this type of cache to have the same interface as a read replica, but you don’t have to optimize queries, and because you’re doing caching, you just get lower latencies.
Conclusion, and Resources
With dataflow-based caching, you can get on par performance to hand-optimized homegrown caching systems. You don’t have to design a cache invalidation policy. That’s all handled for you. It’s useful in a wider range of settings in the database world, for example, when you have more writes. We have a GitHub. If you want to see the source code, it’s available there, just look up Readyset. Then this is based on the Noria research project from MIT, so if you want to geek out further, then there’s a whole paper on it that you can take a look at. If you just Google that, you can find it.
Questions and Answers
Participant 1: Do you have read-after-write consistency?
Marzoev: We do not. It’s actually technically possible in this model, but we haven’t productionized it. By default, it’s eventually consistent. The way that I like to think about this is, when you deal with normal caches, you have no guarantees whatsoever, and you don’t expect any guarantees. We like to compare ourselves to the best alternatives, which are regular caches, as opposed to databases. Yes, CAP theorem.
Participant 2: Actually, how big is the table size it can handle in memory? For example, there is lots of tables we are querying in. If my query has lots of large tables, and since it is going to replicate those tables into the memory, would that be efficient to keep the copy of those tables in the memory?
Marzoev: If we have really large base tables, will it be efficient to store them in memory? We never store the full base tables in memory. We store them on disk in RocksDB. The only thing that’s in memory is the computational nodes. It’s possible that you might unintentionally bring those into memory if you’re running a really hefty join, but that’s what the whole partial materialization scheme is meant to defend against, where you don’t have to materialize the whole join, you just materialize the result for queries that you’re asking for.
Participant 2: Isolations like when the read and writes happens at the same time on the tables, how is that being handled? There could be lots of cases when the dirty read would be possible, in some scenario, when one user is updating the data and another one is reading the data.
Marzoev: Is there isolation between reads and writes? Yes. Essentially, the way that works is like, we are waiting to see the writes come in from your primary database, so they’re already sorted. The reader node is separate from the dataflow graph, because we want to be able to both continuously write updates to it, while supporting reading this eventually consistent state. Then there’s a point in time where we just do a pointer swap and we’re like, we’ve written to the new version of the cache for a while, so let’s flip that to still service it to the user. They’re completely separate, like threads.
Participant 2: There is a possibility of a dirty read in between, then.
Marzoev: It depends on how you define that dirty read.
Participant 3: Can you share a little bit about spreading into other database support, rather than Postgres and MySQL?
Marzoev: There’s something fundamental about this that makes it specific to Postgres or MySQL. Obviously, there’s a big assumption we’re making that we’re dealing with relational datasets. To support a new database, we just need to write a new replicator for that database and then write a new SQL shim for that database. It’s possible. It’s just engineering work. We’re a startup so we are probably not going to expand past these two for quite some time.
See more presentations with transcripts