Transcript
Liu: My name is Wenshuo. I’m from Squarespace. We do web hosting, and we help people build their online presence, whether they want to sell online, or they want to build a portfolio. Besides that, we also probably sponsor your favorite podcast. We are going to talk about a common problem and an ancient idea, how we built our asset library. I want to explain the title first. The common problem here is we often have a very large amount of data, but at any given moment only a tiny portion is actively being used. How do we store everything durably and cost effectively and also serve that tiny portion of active data really fast? The ancient idea is cache.
Cache is a common solution for our common problem. It is a very old idea. In software engineering, not a lot of ancient ideas are still relevant, but cache is, and that’s because cache tries to solve the problem that storage technologies are now cheap, fast, and durable at the same time, we often have to pick two. This problem will still be relevant for a while, so cache is a common solution that has really stood the test of time. We’re going to talk about what is asset library, what is Alexandria, the problems we had, how we solved it, and we’ll look at some results.
What is Asset Library?
Since we do web hosting and we help people build their online presence, media assets like images and videos are very important for our customers to show what they have to offer for their customer. The process of adding and iterating on media assets should be easy and fast. For example, if someone wants to reuse an image on their website, they shouldn’t have to upload it again. We provide an asset library for our customers to manage their media assets. Here’s a little video. You can see there’s people. Someone is uploading an image to the asset library, and searching for assets that has 2024 in its name, and adding the assets to a folder that’s named 2024.
Now you can click into the folder, you can see your 2024 assets, and we want to use something. Can go to a web page, add an image section which comes with a demo gallery, and we can replace images with something from our asset library. Now you have your image on your website. As you can imagine, we have a very large amount of library data, but the library data is only accessed when our users are editing the site. When a visitor goes to a Squarespace website and their browser loads media assets, that’s served by our origin servers as CDN. Asset library is not involved there. At any given moment, the amount of active library is a very tiny portion compared to all the library data we have to store. That is the common problem part of this talk. We want to store all the library data cost effectively and durably, and we also want to serve this tiny portion of active library very fast.
What is Alexandria?
Behind the scenes, the asset library is backed by a service called Alexandria. We named it after the ancient Egyptian library. This talk is about ancient stuff, we’re taking a break from the shiny new and modern. That Alexandria stood around for hundreds of years and went down in a big fire. This is also what we hope for our Alexandria, a very long life and an ending that’s not boring at all. Inside Alexandria, we have two important data models. The first one is library header. Library header has any level information, like library ID, owner, folder structures, production level to indicate whether it’s a public library or a private one.
For every asset our user uploads to their asset library, Alexandria creates an asset record for it. Asset record has asset level information like ID, file name, file size, color information for images or format for videos. Everything is stored on Google Cloud Storage, or GCS. It is a service for storing objects in Google Cloud. A particular library can have three types of GCS objects. Library header is stored as its own objects. Asset records are stored in segment files. One library can have many segment files, and each segment file can hold many asset records in it. We’ll come back to that later. Trash can has assets that have been deleted by the user but still within the retention period and can be restored back to the library. It’s also a collection of asset records. It’s quite similar to segment files.
At this point, you might be wondering, why GCS? Why not a database? We did explore many database options, and we decided against them for one of three reasons. The first is scalability. The data we need to store is not only a large amount, it’s also constantly growing, because people are always coming to Squarespace, creating a website, uploading media assets. We don’t want to worry about scale every so often, we prefer something that provides near infinite scalability. This requirement ruled out databases like Cloud SQL, which have hard limits on storage. The second one is cost. Again, it’s a lot of data.
If we were going to put all of them in a database like Cloud Spanner, depending on the configuration, the cost will be about 5 to 10 times higher. The third reason is strong global consistency. Consistency is unfortunately a terribly overloaded term in software engineering. It can mean completely different things in different context. Even in the same context, it can still be used differently. For example, depending on what documentation you’re reading, strong consistency and external consistency can be equivalent, or external consistency can be a higher level of consistency guarantee.
I want to spend a couple minutes clarifying in our context what we talk about when we talk about consistency. Distributed systems are hard. First, there are multiple nodes, machines, processes, whatever you call it, that might lead to concurrency issues. Operations take time. Multiple operations might overlap. We call two operations concurrent, if there is some time during which they’re both executing. Concurrency is hard to reason about. Bugs are hard to test and reproduce, because it only happens when you get unlucky with timing. We also have multiple copies of data that we call replicas. They can be out of sync. Depending on which replica you read from you may be reading stale data.
There are a variety of consistency models that provide different levels of guarantees around these two complications. You see big mouthful words like serializability, linearizability, but at the highest level of consistency guarantee a system works as if a single process is working on a single copy of data. It’s a single process, so we don’t have to worry about concurrency and race conditions. It’s a single copy of data, so we don’t have to worry about reading from a stale replica. For the weaker consistency guarantees, they are either relaxing the single process part or the single copy of data part, or both.
For example, you often see something is eventual consistent. It means when the update is made in the distributed system, eventually all nodes will reflect that update. That does not work like a single copy of data. For us, we care about consistency, because if a user uploads an image but they don’t see it in their asset library, they refresh a few times and it shows up, and they refresh again and it’s gone. That’s just not a very good experience. This ruled out options like Bigtable, that only guarantees eventual consistency, but there’s always one. GCS is object storage. It’s good at what it’s supposed to do, but it’s not designed to be fast. With GCS, we solved the problem of storing that large amount of data durably and cost effectively, and now we can move on to serving that tiny portion of active library fast.
The first thing we did was cache. Every Alexandria instance comes with a in-memory cache. When the request comes in, it first checks in-memory cache, if it’s not there, it goes out to GCS and loads it into the cache, so later requests for the same library do not have to make that trip again, and will be much faster. To better utilize the cache, we also introduced consistent hashing at the load balancer layer. Consistent hashing is a distributed hashing technique. We use it to split data or distribute requests across multiple servers. When we have multiple servers and given a piece of data or a request, we need to decide where it goes. The simplest way is to calculate some hash key and take the hash key modulo of the number of servers.
The problem is, when the number of servers changes, for example, a server is removed, all key locations change. That’s the problem consistent hashing is trying to solve. Instead of using modulo, it places the hash key on a virtual ring. It also places servers on the ring by randomly assigning them angles. Now if a server is removed, only the keys that fall as angle need to be reassigned, everything else is untouched. Here’s an example. We have four servers on the ring. Three requests come in and are also placed on the ring.
According to their positions on the ring, you can see request 1 goes to server 1, request 2 goes to server 3, and request 3 goes to server 4. Now server 4 is removed, only request 3 needs to reassign, request 1 and request 2 stay the same. In our use case, we use consistent hashing at our load balancer to route all requests for the same library to the same Alexandria instance. In our context, the hash key is the library ID, and request 1 will be all the requests for library 1, and request 2 will be all the requests for library 2, and request 3 is all the requests for library 3.
We do this for two reasons. First, it’s faster, because a library is active, if its request can get routed to multiple servers, then each server on its first request will need to go to GCS and load it. Second, if multiple servers could be updating the same library, that means for a given Alexandria, since it doesn’t know what other instances have done, its local copy of the library could be out of date. For every request, it needs to go out to GCS and see if its local copy is the most recent, and that would simply make the in-memory cache useless.
The third piece is a Postgres sidecar. Sidecar containers are secondary containers that run along with the main application containers within the same pod. Every Alexandria instance is deployed with a Postgres sidecar to build indexes for active libraries. Earlier we saw in the demo that people can search in their library and they can also sort their library. We build the index so this feature will be fast, but we don’t need to keep the index around so once the library becomes inactive and get evicted from the in-memory cache, the index also gets deleted.
Earlier, I mentioned two shapes of our data. There is a large amount of data, and at any given moment, only a tiny portion is active. Here I want to point out a third shape of our data. Each library is logically its own database. All transactions are scoped within library, and indexing is only necessary within one library. We never have to cross the boundary of a library. This character is the reason that we can build this throwaway index. At this point, with the in-memory cache, the consistent hashing at the load balancer layer and the Postgres sidecar index, we have pretty good read performance, and we move down to write performance.
Problems
Earlier, we mentioned that each library has three types of GCS objects, a library header, one or more segment files, and the trash can. In its original state, every write operation in Alexandria needs to update one or more of the GCS objects synchronously. For example, when the user deleted asset, first we delete the asset record from the segment file, and then we need to add it to the trash can. Alexandria can only confirm that the delete was successful after everything was persisted in GCS. The performance of both writes to GCS was reflected in the response time the user saw of the delete operation. The first problem we had was with large libraries. Earlier, we mentioned that each media asset has an asset record, and asset records are written to segment files on GCS. Some of our libraries are really large.
For example, libraries on photographer studio sites or Instagram influencer sites, they can have hundreds of thousands of media assets in the library. If we write all the asset records in one segment file, it could be hundreds of megabytes. Having the latency of writing hundreds of megabytes to GCS as part of user experience latency simply will not work. Most requests will just time out. We started having multi-segment libraries to parallelize the writing. We put a hard limit on the number of asset records each segment can have. We always start with one segment. Once it reaches the size limit, it will split into two. Segments are organized as a prefix tree.
For a certain asset, we take the binary representation of the asset ID and search down a prefix tree to determine which segment it belongs to. All the fancier data structures that I learned in college, this is the only time I got to use one. For example, this particular library has four splits. First is to split into segment 0 and segment 1. Segment 0 split into 00 and 01. Segment 1 split into 10 and 11. Then, segment 10 split into 100, and 101. Now an asset record is created, its asset ID is asset1. We take asset1 and get its binary representation. It starts with 01001101, and we use that to go down a prefix tree. You should go into segment 01. This limits each segment at a very reasonable size. Once we parallelize the writes, the latency is acceptable again. We choose to implement multi-segments this way for two reasons.
First, every time a segment split, about half of the assets go into one segment and the other half go into the other. This prefix tree is always well balanced without extra effort. That’s because we use the binary representation of asset IDs. The 0s and 1s at a certain bit is pretty uniformly distributed. Second, we don’t actually have to keep a tree structure anywhere, all we have is the little segments, and the segments are all named with a bunch of 0s and 1s. For any given asset, if there are multiple segments, there’s only one that would be a prefix for the asset. It’s pretty easy to determine which segment any given asset belongs to. The code is pretty simple and maintainable as well. There’s actually no tree stuff in the code at all. It’s conceptually a prefix tree.
The second problem with GCS was there’s a write rate limit. For a given object, it can only be written once per second. If we update the library and write its header or segment, and we cannot do that again within a second, this rate limit presents a problem for any rapid updates against one library, for example, if a user is doing bulk upload or when we run migration jobs to import data into Alexandria. To work around this limitation, we implemented a logic to combine multiple changes into a single write, a process sometimes called coalescing. This is a simplified request flow. A request comes in. If the library is not in the cache, we load it from GCS, and then we update the library in memory, we add it to a batch, and then we check if the library has been written to GCS within a second. If the answer is yes, we would wait till the end of that second. If no, we can write it to GCS.
During the wait, if there are other updates coming for the same library, those updates will be batched together as well. The coalescing logic solved the rate limiting problem, but it was difficult to read and test. We did a lot of load testing to make sure it works. Bugs were very subtle, and they’re hard to replicate outside those load tests. Once it works, nobody wants to touch it anymore. It made changes to this part of the code very risky and time consuming. The comment on that part of the code was, “MAGIC, DON’T TOUCH”.
The third issue was the long tail of the write latency. Most GCS writes are fast enough, but a very small percentage are not. User occasionally had to wait a long time or the request might even time out. This is a graph of our old p50 and p99. The blue line on the bottom is the p50 and then the crazy green part is p99. I want to point out that almost every distributed system has a p99 that’s much worse than its p50, but we don’t always care about it, and we don’t always have to improve it. In our use case, we care about p99 because it is affecting real users in front of a screen somewhere. This is a fast-paced world, and web hosting is a competitive area. One bad upload experience might be enough to lead a user to abandon our trial.
At this point, we have two problems. We have the latency long tail for users, and we have that complicated request coalescing logic for developer, and we decided to introduce a write-back cache to solve those two issues effectively. Before we go deeper into the write-back cache, I want to do a quick cache strategy refresher. These are again, ancient ideas, but because of how old they are, we don’t think about them very often. I think a quick refresher would make it easy to see how we make our decisions. Reading, the first common strategy is cache aside. Application is responsible for loading data into the cache.
When it needs a piece of data, it always checks the cache first, and if a cache miss, the application loads the data from the database into the cache. It’s simple and flexible, but it needs some actual work to make sure the cache is up to date. Second common reading strategy is read through. Cache sits between the application and the database. On the cache miss, the cache is responsible for retrieving the data. It simplifies the application, since cache handles loading data, but it often requires the data type and the database and the cache to be the same. Our in-memory cache that we talked about earlier is a cache aside, because we don’t simply put the GCS objects in the cache, we translate into a data tab that can be easily used.
For writing strategies, the first one is write through. It’s very similar to read through. The cache sits between the database and the application. The application writes the data to the cache, and the cache immediately writes it back to the database. This strategy makes sure that the cache is always up to date. The downside is performance, because now your latency includes two writes. The last one is the write-back cache. Write-back cache means the application writes the data to cache, but the cache doesn’t immediately write it to the database. It writes after some delay. Writes to the write-back cache are usually a lot faster than writes to the long-term storage. Having a write-back cache layer can improve latency and throughput, because users don’t have to wait on the slower writes, but it does lead to data consistency. Since our goal is to remove the slower write from user experience latency, we went with the write-back cache.
Solutions
Using write-back cache means, essentially, we solved our write performance issue the same way we solved our read performance issue. We have a fast but expensive and small cache on top of our cheap and reliable long-term storage. We choose to use Cloud Spanner to implement our write-back cache. Cloud Spanner is the distributed relational database management and search service developed by Google. We made this decision for a few reasons. Spanner is fast, and there’s no write rate limit, so this would help us solve the latency issue, and also can get rid of our request coalescing logic. Spanner is also highly available. It guarantees up to 99.999% of availability. The third reason is that it provides external consistency.
External consistency is what Cloud Spanner calls the highest level of consistency guarantee. As we talked about it earlier, let’s just say that it works as a single process, working on a single copy of data. It is a single process, so we don’t have to worry about concurrency and race condition. It is a single copy of data, so we don’t have to worry about reading stale data from an out of sync replica.
This is what we did. The thing that crossed out was the request coalescing logic. On the right, the first change we introduce is, when we load a library from GCS, we would check if it’s also in the write-back cache, and if it’s in the write-back cache, if there are pending changes that haven’t been flushed out, we would apply the pending changes from write-back cache on top of what we got from GCS, and put that in the in-memory cache, so this library is up to date. If the target library is already in the in-memory cache, we also do that check. We see if there’s any pending changes in the write-back cache that’s not on the in-memory cache, because it’s possible that, for example, during deployment, requests for the same library might be routed to multiple paths. If there are any pending changes in the write-back cache that our local copy doesn’t have, we would apply those changes.
Third, the changes, we write updates to the write-back cache, and we respond to user immediately instead of adding library updates to a batch. The last changes, we introduce a flush service that will periodically write things from the write-back cache and delete them and flush the changes to GCS.
The Results
To release this change safely, we created a canary fleet of Alexandria deployed with the write-back cache branch. A canary deployment is a way to roll out changes incrementally to a subset of users by splitting traffic between an established version and a new version. We want to do canary deployment, because, first, if something goes wrong, the blast radius will be small. Second, when something goes wrong, it’s much easier to switch traffic back to the regular fleet than redeploying the whole application. Shawna Martell talked about the Straggler Fig pattern, making incremental changes and making it easier to reverse. This is along the exact same line. Once we set up a canary fleet, we first did load testing to make sure everything works as expected.
Then we gradually moved production traffic to it. Both the canary fleet and the regular fleet sit behind the load balancer that we talked about when we were talking about consistent hashing. In the load balancer, we route the traffic based on library ID. For a given library, its request either all go to the canary fleet or all go to the regular fleet. The following is a graph of the write endpoint’s request latency as we were ramping up to 100%. The top graph is the p99, and then you can see the p99 is about 10% of what it used to be as we all move to the write-back cache version. Then, the bottom line is the average latency. Average latency is also about 30% of what it used to be. At this point, we’ve achieved our goals of introducing the write-back cache. We got rid of the request coalescing logic, so the developer experience is better. We also got rid of the long tail, so the user experience is also better.
Why Not?
I want to cover some why not that you might be wondering. The first one is, why not the write ahead log? Write ahead log is a very similar technique to write-back cache. A write ahead log is an append only file of records. It’s a history of old updates, while a write-back cache only has the latest state. For example, consider a user updating an asset record twice. With a write ahead log, you would have two records for each update, but with a write-back cache, it only has the end state, the second update would override the first update. We decided to use write-back cache because for whatever we need to read or write, like, for example, when we need to read pending changes from the write-back cache to apply on top of a library, or for flashing changes back to GCS, we only need the end state. We don’t really care about what happens in between.
Another thing you may be also wondering is, why not a queue? Asynchronously committing changes might sound like a classic queue problem, but in our use case, each library is logically its own database. Transactions are scoped per library. We always read and write updates for a single library at a time. When we load the libraries from GCS, we want to apply any M-Flash updates. It’s not ideal if we have to scan a queue with updates for every other library as well, or we’d need a queue for each library, and that’d be millions of queues, which is also not practical.
Mistakes and Lessons Learned
Sounds good so far. What went wrong? If we know one thing about engineering in real life, something is going to go wrong. Now let’s talk about mistakes and lessons learned. First major mistake, when we were doing research on Cloud Spanner, before we started implementing the write-back cache, we didn’t read the limits on quotas page carefully. I think maybe we were subconsciously thinking being in the cloud means we don’t have to worry about that, but cloud is not equal to unlimited. There are lots of limits. We found out two of Spanner’s limits in production. The first limit we hit was data per cell, the size of the data per cell, which is 10 megabytes. Spanner is a relational database, so the intersection of columns and rows are called cells. When we delete and restore assets, we need to update the library’s trash can, and we put the whole trash can into a cell. Turned out that we have some really large libraries that also has really large trash cans that go above 10 megabytes. We saw some delete and restore failures.
Then we have to add a check to say, if this trash can is larger than 10 megabytes, just flush it out immediately and skip the write-back cache. The second limit we hit was mutations per commit. Spanner considers each data cell change as a mutation. After we flush changes to GCS for a given library, we clean it up from Spanner. Initially, we were cleaning up by doing one delete query that deleted everything with a certain library ID. If a library happened to have some large amount of, like a large import batch delete, that one query might be touching more cells than that limit. We can’t clean up that library in one query. We saw some flush failures, and we added logic to do partial cleanup in that situation.
Second major mistake. When we were designing the write-back cache, we were thinking that it’s ok to delay the flushing changes to GCS. Somehow, we all forgot that there was actually one request, it’s a special data, that kind of update needs to flush out immediately. It caused an outage, that was not fun. In our flush service, we added an endpoint that Alexandria can call to immediately flush a library. Then we started thinking how that happened. How come we all missed it? We realized that we didn’t have a up to date diagram of the entire platform. If we had a diagram to refer to in our design process, that would have been an error that we cannot miss.
At that point, everyone had been on the team for at least two years, and we all knew the system pretty well, and we were just all holding it in our head. After that, we did diagramming exercises, and we made sure to explicitly include updating diagram as part of our design process. Why do I want to bring up diagramming? Let’s take a one-minute break from engineering, and please indulge me, and let me put on my brain science enthusiast hat. Our brain is not designed to read or write, we have to learn those skills. Written language or reading and writing have been around for maybe a few thousand years, and in terms of human evolution, that’s like 30 minutes ago. Our brain is designed to process visual information. It has been doing that since the beginning of ice. We are amazingly good at it. We respond to visual data better than any other type of data. In fact, according to some research, we process visual information 60,000 times faster than reading text.
Because our brain can process image elements simultaneously while we’re reading, the brain sees words as individual image that we have to recognize first, and we do that sequentially. Pictures are just easier for our brains to comprehend than words. Since we have also evolved to look for structures and patterns when talking about things like a hierarchy or system, diagrams communicate more information in less time, but also more accurately. To make my point, here on the left is the diagram that we just saw, but on the right is text that describes the same thing. At this very moment, you might be realizing that you are drawn to the diagram first, and that’s because it’s just more natural for our brain, so diagram, diagram, diagram. It may sound like cliche, but it’s so true.
Know The Shape of Your Data
Back to write-back cache. I want to share one last parting thought, know the shape of your data. When making decisions around data, I think the shape of the dataset is very important. I’ve mentioned three shapes of our dataset. It’s large. Only a tiny portion is active at any given moment. Each library is its own database. Most of the decisions we talk about today has something to do with one of those shapes. It’s like choosing clothes. If you’re only thinking about your size, you can find something that works. If you also consider the shape of your body, then you can probably find something that looks even better on you. I think we all think about size a lot, but thinking about the shape would also be very helpful. If you have one takeaway from this talk, I hope this is it. Think about the shape of your data.
Questions and Answers
Participant 1: Since you embraced eventual consistency between your write-back cache and your long-term storage, how did you track your service level objectives, and what kind of service level objectives did you have for the time it took between writing to the cache and actually persisting the data.
Liu: Service level objective, meaning the latency?
We have web graphs for all the request endpoints latency. I think as a team, we have on-call schedule, and every week we do on-call handover, and then we look at the graphs together. We have certain expectations, so if the latency dropped under what we expect, the number will be in a different color.
Participant 1: It sounds like you didn’t have a very strict objective about how long it would take to write from the cache to the long-term storage.
Liu: It’s more like across all the write endpoints, we have an idea of the average latency we’re expecting.
Participant 2: If you have to rewrite again after what you learned with doing the asset library, what is going to be different this time, or if you have any surprise during the development.
Liu: I think the two mistakes that I mentioned are surprises. Another surprise was about load testing. I think I would take load testing more seriously. We did a lot of load testing, and we found a surprising amount of problems. Some of them have obviously been there for a while, but for some reason, it didn’t really affect anyone. For this kind of system, I would take load testing more seriously.
Participant 3: The question is about the know your data, the shape of your data. You do have a lot of different libraries. You said also that you have some libraries which are very large. When you say large, I assumed it’s large in number of objects, number of assets there, now that each asset is big. Do you have any problems? Did you find any problems where you have to load a library, but then this library is giant? Can you do partial loads or load just a slice of it?
Liu: I mentioned multi-segment library. We did slice the library. It’s like data sharding. With these large libraries, we did slice them into many segments so we can load the large library in parallel, and that essentially solved the problem. That’s a real problem.
See more presentations with transcripts