Transcript
Karumanchi: We are going to be talking about write-ahead log, which is a system that we have built over the past several months to enhance durability of our existing databases. I’m Prudhvi. I’m an engineering lead on the caching infrastructure at Netflix. I’ve been with Netflix for about six years. I’m joined by Vidhya, who is an engineering lead on the key-value abstractions team. We both have a common interest of solving really challenging problems, so we got together and we came up with this.
The Day We Got Lucky
How many of you have dealt with data corruptions in your production environments? How many of you have dealt with data loss incidents, or you thought you had a data loss? It makes me happy to see you, that you’re also running into some of these issues. What you see here is a developer who just came, they logged into their laptop, and a pager goes off at 9 a.m. in the morning. The page says, customer has reported data corruption.
Immediately the engineer starts an incident channel, starts a war room, a bunch of engineers get together to understand what is happening. We get to see that one application team, they have issued an ALTER TABLE command to add a new column into a database on an existing table. That has led to corrupting data or corrupting some of the rows in their database. The result was a data corruption. We got a handle on the situation. We all have backups. It’s not a big deal. We can restore to a scene backup, and you can reconstruct your database and life can go on. Is it really that simple? What happens to your reads that are happening during this time? Are you going to be still serving corrupted data? Even if you decide to do a data restore, what happens to the mutations that are happening since the time you took the snapshot to the time you have restored? What is happening to these mutations? We got lucky because we had caches for this particular use case, which had an extended TTL of a few hours.
The caching team was engaged to extend the TTLs because the engineers on the database side are still trying to reset the database. We also got lucky that the application team was doing dual writes to Kafka. They have had the full state of all the mutations that are landing onto this. We were able to go back in time, replay the data, except for those offending ALTER TABLE commands, and we were back in business.
That was a lucky save. This led to a retrospection for the entire team to see how we can handle these issues better if at all they come up in the future. We wanted to do a reality check. What if we were not really lucky on that particular day? What if this app team did not have caches which had a larger TTL? What if they did not have Kafka? We are pretty sure that we would have had a data loss. How can we guarantee protection not just for this app, but for all the critical applications that we have at Netflix? There could be a lot of unknown failure modes that can come up. How can we prepare ourselves for those? I’m pretty sure the next incident won’t be as lucky as the one that we got away with.
Scale Amplifies Every Challenge
At Netflix scale, it basically amplifies every challenge. To give you a high-level perspective, this is a diagram, on the far left, what you’re seeing is our edge or cloud gateway. All the middle circles that you see are a bunch of middle-tier services.
On the far right, what you see is either our caches or databases, basically our persistence layers. There are these humongous amount of interactions happening. Even if there is a single failure that happens in the persistence layer, it could have a larger than impact to the rest of the system. We want to essentially minimize the failure in the persistence domain so that the middle-tier services wouldn’t see the pain of that. All the data reliability challenges that we have faced at Netflix, they have either caused a production incident, they costed a ton of engineering time and money to get out of those situations. We have also had issues where we had the same data that is getting persisted to two different databases like Cassandra and Elasticsearch, and we have to come up with a system to ensure that the integrity of the data is met on those two systems. We ended up building bespoke solutions for some of these problems.
Outline
I want to get into the agenda, where we’ll be talking about Netflix architecture at high level. Then we’ll be talking about data reliability challenges that we have at length. We will be presenting with WAL (Write-Ahead Log) architecture, and how it is going to solve these data reliability challenges that we talk about. We’ll leave you with some of the failures that WAL system itself could run into.
Netflix Architecture – 10,000 Foot View
On the left-hand side, what you’re seeing is a bunch of clients, your typical mobile devices, your browsers, your television sets, which would be issuing requests. We are deployed completely in AWS, at least on the streaming side. There is OpenConnect component, which is our CDN appliances. The talk is not going to talk about anything to do with CDNs. It is all in the purview of AWS or the control plane side of Netflix. When you receive a request, we have Zuul, which is our internal cloud gateway, which is responsible for routing the requests to different API layers. Once Zuul routes that request to a particular API, that will take that request on a different journey for that API.
The most important one here is the playback, which is when you start clicking that play button once you have chosen the title. It will hit a lot of middle-tier services, and eventually all the middle-tier services would hit the stateful layer, which is having a bunch of caches and databases. These are some of the technologies that we have that power the stateful layer, Cassandra, EVCache, with Data Gateway being the abstraction sitting on top of these. This is by no means an exhaustive list of our stateful layer, but just a sample set that we could fit in in the slide. Let’s take a closer look on the two components that are interacting with the stateful layer. We have the middle-tier service, which is either talking to the databases directly or they talk via the Data Gateway.
Netflix has made a significant effort over the past few years to migrate most of the database workloads happen via Data Gateway, which is our key-value abstraction. We did that deliberately because most of the time application developers don’t know how to use the databases correctly. We wanted to take on the challenge or problem and solve it behind Data Gateway and just expose a simple key-value semantics to the application. Life is good. We don’t need to talk anything here.
Data Reliability Challenges
Vidhya: Data reliability challenges are hard, especially at Netflix scale. We wanted to walk you through some of those challenges and walk you through WAL architecture, and then bring you back to these challenges again and see how we solve these challenges. The first challenge we want to talk about is data loss. When you write anything to a database and database is 100% available, this is a great thing to have. We don’t have any challenges. It’s streaming the data through the database. Life is happy. What user sees is data is getting durable into the database. It sees when you read the data, the data is readable and visible to the client. Life is great. When the database is available, either it’s partially unavailable or it’s fully unavailable.
Example, an operator goes and does a truncate table, which I’m sure, remove -rf or a truncate is an easy operation to do in a database. When that happens and you fully lose the data, then any amount of retries is not going to help with the situation. You really have a data loss. The same thing Prudhvi talked about earlier with an ALTER TABLE statement. When the ALTER TABLE statement corrupted the database, some of the writes which were going in was further corrupting the data. Really what we are looking for in that case is how to root cause it.
If we root cause it, how do we stop further corruption from happening? If we can stop the corruption, we can get into a better state sooner. If we can’t even identify the problem, that’s a bigger problem to fix. That’s a data corruption problem. Data entropy, it’s a common problem. It’s not enough to just do point queries. Like, if you just reach a primary store and it can only support primary ID only queries, or partition queries, then that’s not enough in all the cases. Sometimes you want to add an indexer, like Elasticsearch, and do indexing queries, the secondary index queries. When that happens, you want to write to multiple databases and read from both those databases, one or the other.
When secondary is down, the data is durable because you wrote to primary. It’s not really visible because some of the queries don’t work. Any amount of retries might not get you into a synchronized state because we lost some of the data being synced to the secondary. We can deal with that problem by doing some asynchronous repair. What really you’re doing is copying the data from primary and synchronizing it to the secondary. That’s an easy fix. Visibility to the customer is unclear. Here, they’re sitting and wondering, what really is happening? Where is my data? When will I get the data in sync to the secondary? The other problem you have is when the primary itself is down, it is very close to the data loss problem we talked about earlier. We can’t sync from secondary back to the primary here because we don’t know if secondary is really in a good state for us to sync the data.
The next problem that I want to talk to you about is multi-partition problem. This is very similar to entropy problem except it happens in one database instead of two databases like we talked about in the earlier entropy problem. One database because you take a mutation but that mutation has a batch which mutates two different IDs in the system. Those two different IDs can go to two different nodes, and one mutation can go through, whereas the other mutation which has to happen in another node does not really happen. Database is sitting there and retrying that mutation. All this while the customer is wondering what is happening, where is my data. It got an acknowledgment it’s durable, it wrote to the commit logs, but further, there is nothing that the customer has visibility over. You have to read logs or connect with the operator to find more issues. Data replication in our systems is also something that we want to deal with along with the data reliability problems.
Some of the systems that we support like EVCache which uses Memcached and RocksDB that we use internally, does not support replication by default. When that happens, we need a way or a systematic approach for replication to happen. Think about where we have in-region replication as well as cross-region replication. Netflix often does region failovers, and when region failovers happen, we want the cache to be up-to-date, warm, for the queries to be rendered through cache. EVCache does cross-region replication.
Some of the challenges we face are, how do we speed up the customer? How does the traffic that is coming in not affect the traffic that is being done, or the throughput of the consumer itself? Those are some of the data reliability challenges we want to talk about. There is much more that we can talk about. These are the main ones. Taking these challenges, what are the effects of it? Accidental data loss caused some production incident. System entropy cost some teams time and money. Now you have to sit and write some automation for you to sync from primary to secondary. Multi-ID partition really questions data integrity. The customer is asking you, where is my data? When will it get synced to the secondary? How do I deal with timeouts? Data replication needs some systematic solution.
WAL Architecture
With those challenges, how do we really solve this problem at scale? Especially with Netflix scale, these challenges enhance itself. Write-ahead log is one of the approaches that we took. What is write-ahead log giving us? It ensures that whatever data we modify is traceable, verifiable, restorable, and durable. Those are the challenges we are trying to solve. We want to talk about the internals of how we built write-ahead log.
Karumanchi: Is everyone excited to solve these challenges? Just a quick recap. We have the client applications interacting with database and caches. I just extended that notion to client app could be interacting to a queue or it could be interacting with another upstream application. We took some inspiration from David Wheeler who said, all problems in computer science can be solved by adding another level of indirection. We just inserted write-ahead log in between and said, we can solve this. Let’s zoom in a bit with write-ahead log. We have what we call as message processor, which would receive all the incoming messages coming from client application. You have message consumer which would be processing all the messages. We also maintain a durable queue. We also see control plane. We’ll talk more about control plane and how it fits into write-ahead log in later slides.
Control plane in simplest terms is the configuration which write-ahead log uses to take on a particular persona. We’ll be talking about different personas that write-ahead log itself can take. The request would contain something like namespace, payload, and a few other fields will get into the API itself in later slides. The most important thing in this slide is namespace, which says, playback, and message processor would ask control plane, give me the configurations that is relevant to this namespace called playback. Message processor would know what is the queues or what are the other components that it needs to work with, and adds it to the queue. Message consumer would be consuming from the queue and sending it to the destination.
This is the same slide that we saw before, but we also wanted to make sure that we maintain separation of concerns with processor and consumer deployed on two independent machines, and the queue themselves could be either Kafka or SQS, just to put a name to the queue. Some of you here might be wondering, this pretty much looks like Kafka. I have a producer. I have a consumer. What is the big deal? Why do you need to build a system like write-ahead log? What is the value add of this? If I refresh your memory on some of the problems that Vidhya has alluded to earlier, this architecture might solve some of the problems that we saw, but the multi-partition mutations or the system entropy problem that Vidhya has discussed where a single mutation itself could be from client’s application, it could be landing in different partitions on the database itself, or a system where a single mutation could land on two distinct databases. Those problems will not be solved by this, and we’ll see why.
You have the client application, and imagine we put a name to the individual mutations, so imagine you have a transaction where you have multiple chunks that need to be committed, and then the final marker message to indicate a typical two-phase commit. In the architecture that we just saw here, because we are just dealing with a single queue, we start to see problems happening right from this layer itself. Because the chunks themselves can arrive out of order, and also the consumer must be responsible for buffering all these chunks before it sends to the database or whatever that final target could be. If the final marker message arrives before the other chunks, because you’re dealing with multiple transactions, you could be creating a head-of-line blocking situation.
Memory pressure is real, and we use JVM, and we can easily see our nodes getting into out-of-memory situations very fast. What if the database itself is unavailable or if it is under stress for some amount of time? Even then, we need to do some retries. Within the architecture that we just saw earlier, you can do it, but it’s very hard. Another thing that I wanted to call out was, we are dealing with database mutations, and if you end up putting all the payload in the Kafka or any queue that you end up picking, you’re adding significantly high data volume on the underlying Kafka clusters. The size of your Kafka clusters can grow up pretty fast, and you’re dealing with complex ordering logic, and there is also memory pressure that we just spoke about, and really hard to handle retries in an architecture model.
Let’s refine it a little bit more, and what we see here is, I see two new components, cache and a database. Why do we need to have cache and database? The thought process here was, especially in multi-partition mutations, the chunks that we saw earlier, we want to just park them in cache and the database, and only add the final marker message which basically indicates all the chunks that belong to that transaction are committed before itself. The only piece of message that goes into the queue is the metadata indicating all the chunks relevant to this particular marker have been committed. The job on the consumer side becomes extremely easy, because all consumer needs to do at this point is get all the rows or all the chunks that are stored in cache or the database, and then send it to the final destination, which is any of these components that we see.
The reason why we use cache here is all of this is immutable data, and it would significantly boost the read performance. Let’s replay the same architecture that we saw. With WAL processor, whenever we have these chunks coming in, all of them, they get stored in durable store first, only the marker message ends up going into the queue, so you’re not putting a ton of pressure on the queues itself. WAL consumer receives that final marker message. It will fetch all the chunks that belong to that transaction, and then you’re sending it to the final destination. We do clean up because all of the data that is maintained within WAL is just for that duration of time.
One of the things that we kept talking about is we build this as an abstraction. All we do is expose one simple API and that basically masks or hides all the implementation details of what is happening underneath the covers of this. It also allows us to scale these components independently because we have split processor, consumer, caches, and databases, and everything can scale up or down independently. We spoke about control plane which is responsible for the dynamic configuration, and it would allow WAL to change its behavior.
We’ll also run through some of the examples of control plane configurations here. We have not built everything from scratch, like the control plane or some of these abstraction notions are something that the teams have been working over several years in making sure that you can come up with any new abstraction very fast, because we have something called Data Gateway agent or Data Gateway framework. There is a blog post if you want to go and read about it on how it is done. We are basically building on top of that. Finally, target is something where the payload that you have issued is supposed to eventually land. We also have backpressure signals baked into the message consumer side. In case any of these targets are not performing fast enough we have a method to backoff and retry.
Let’s look into the API itself. API looks pretty simple. We have namespace, which is again an index into the configuration itself. We have lifecycle which indicates the time at which the message was written, and if somebody wants delayed queue semantics where, I want to deliver this message after 5 seconds or 10 seconds, you could dictate that in your proto itself. Payload is opaque to WAL. We really don’t want to know what is that payload itself. Finally, we also have the target information which indicates, what is my destination for this? Is it a Cassandra database, or is it a Memcached cache, or is it another Kafka queue, or is it another upstream application? This is how a typical control plane configuration looks like. In here we are looking at a namespace called pds, and it is backed by just a Kafka queue. That’s the cluster and topic where all the messages would end up going into.
Every WAL abstraction that we have is going to be backed by a dead letter queue by default, because as Vidhya has mentioned, there could be some transient errors but there could be some hard errors as well. None of our application teams need to know anything about this. All they need to do is just toggle a flag here, I want to enable WAL for my application, and they get all of this functionality. On the left-hand side what I’m showing is again one of the WAL which is backed by SQS queue. We do use SQS to support delayed queue semantics. So far, we haven’t seen situations where SQS couldn’t perform for our delayed queue needs, but if a time or a need comes, we would probably want to build something of our own.
On the right-hand side what we see is what will help with the multi-partition mutations where we need the notion of the queue and we also need a database and cache. In there, what we put in is DGWKV, which stands for Data Gateway Key-Value, which abstracts all the cache and database interactions for us. We talked about target. Target, if you look at again the proto definition, we did not leak the semantics or the target details in the API itself. We kept it as a string value, so you could put in any arbitrary target information, but obviously you need to work with the WAL team to make sure the target relevant code is written in the message consumer.
How WAL Addresses Challenges
Vidhya: That’s a very good deep dive into our WAL architecture. With these, can we solve our challenges that we talked about earlier, is the question. Let’s look at it, data loss. When the database is down all you need to do is enable writing to WAL. As soon as you start writing to WAL and the database becomes available again, you can replay those mutations back to the database. In this way, we are not losing any writes. We are not corrupting the database. We are being in a very safe space. That does come with tradeoffs, where you are saying now that data writes are eventually consistent. It’s not immediately consistent. The database has to be available for you to be doing read-your-writes. There’s also challenges about transient errors versus non-transient errors. Transient errors can be retried and the mutation can be applied later, but non-transient errors like, for example, if you have a timestamp that is invalid for the system that you’re mutating, that’s a problem. You have to fix the timestamp before you mutate it again.
For those cases we have DLQ, we write it to DLQ and mutate those after fixing or massaging the data. Corruption. We corrupted the data, now what do we do? We add WAL. We start writing to WAL. That way we are protecting the writes that are coming in from some other system or application. We create a new database. Restore the database using point-in-time backups. That’s what we talked about earlier. We had backups. Now we point our microservice to the new database. That’s great but we still lost some data, that is in WAL now. We replay those mutations. This way we replayed the mutation after fixing the system and we did not lose the data.
Again, there is some manual work that needs to be done. There are tradeoffs there. It’s eventually consistent. We need point-in-time recovery. It takes time if you don’t have automation. That all adds up to the time that you’re parking the data in WAL, and consistency or eventual consistency requirements. Sometimes you don’t really require point-in-time recovery. An example would be a TTL data. If the data is already TTL’d out, you don’t really need to recover that. You just have to point to a new database or clean off the database and restart replaying the data. The next two set of problems that I talked about, the system entropy and multi-partition mutations, I want to club them into one solution. It is very similar. Either it goes to one database with two mutations in place or two databases. The problem appears same.
You first write to WAL, no matter what, and then mutate using WAL consumer into your databases. This way you don’t need to do asynchronous repairs. WAL is prepared to do those asynchronous repairs and retries. It will back off when you get the signal from the database that its database is not available. You need to provide those signals to WAL so that it can back off as well. That’s something that you might want to think about. Those also have tradeoffs. You have an automated way of repairing the data now. You don’t have to manually intervene and do those repairs. The system itself will take care of that. It will also help in a secondary index feature, developing some features like secondary indexes or multi-partition mutations. It has eventual consistency as your consistency requirements. That’s the problem there. Data replication. We talked about extensively some systems do not support data replication by default.
You can either write to WAL and let WAL replicate the data to another application. We made WAL so pluggable that now the WAL target can be another application, another WAL or to another queue. It is totally agnostic of what the target is. It’s all in the configuration. We can manipulate the configuration of namespaces to write to different upstream services.
Here, I’m writing to an upstream app itself which is mutating the database. You can also choose to write to another WAL which will mutate the database. When the region failover happens, your data is already synced to the WAL or the database in the other side of the region. It gives you uninterrupted service. It will also help in data invalidations, in some cases when the writes are in one region and your cache is stale in the other region. Cross-region replication is expensive. You really want to think about cross-region replication in some cases. It’s a pluggable architecture as I told you. The target can be anything that you choose. It could be a HTTP call. It could be a database. It could be another WAL. That pluggability gives us more knobs to turn.
WAL is not cheap. It adds cost to your system. You really have to think about, is WAL really necessary for your use case? It adds cost. If durability is very important, that cost should be handled. If durability is not important for some non-critical use cases and you can take some data loss, that’s totally fine. At Netflix we use this for very critical use cases where data loss cannot be acceptable. It adds latencies and consistency is loosely tied. If you want a strict consistency, WAL is not something you want to use as your primary. WAL does not help in reads. That’s very obvious. It only helps when you have data loss during writes.
The other part of the problem we saw earlier when Prudhvi talked about data loss is we had caches. We had to extend TTL on those caches so we can hold the data during data loss, furthermore. Think about it, really, if you really need it, do add it. Adding an extra layer of indirection is great but it’s not needed in some cases. Too many levels of indirection is also not good.
Failure Domains
WAL also has failure modes. We want to talk about that. WAL is not agnostic to failures. It’s not fail-safe. Netflix uses abstractions to deal with these problems, and how, I’m going to talk about it. The few failure scenarios I want to talk about is traffic surges, slow consumers, and non-transient errors. When you provision a cluster and you know it’s 1000 RPS, you only expect 1000 RPS. You don’t want to provision extra and cost more money for your org. You provision for the request that you’ve been asked for. It does happen that sometimes we get a surge of traffic. When you get the surge of traffic, we want an easy operatable datastore. You don’t want to scramble during that time. How do I fix this problem? How do I expand the cluster? Expanding a cluster is one way of mitigating the issue.
For example, if you have a queue, you want to increase the resources in the queue. That takes time. You might have to add more brokers. That might take 5 minutes, 10 minutes. You might also move some of the data from that database around so that you can spread the data equally as well. All of that costs time.
When that happens, the easiest solution that we can do is to add another separate processor and a queue and split the data 50-50. When you split the data, it is easy for us to expand itself. That’s some of the mitigation strategies we want to employ here. That’s traffic surges. When traffic surges happen, we also have a slow consumer. It’s one consumer consuming both the queues and dealing with a slow database or something like that. You part the data. Now you have a slow consumer, how do we deal with that problem? Either you can add a consumer layer that consumes from both the queues, or if you only have a single layer of queue and proxy, you can add more nodes to it and deal with that problem by consuming more.
The caveat here is that your database has to be well-provisioned enough or the target has to be well-provisioned enough for you to deal with that surge of messages that is coming in. That’s why we use backpressure signals to look at it and carefully tune our knobs. The system automatically tunes to the CPU and the disk that is in place. Non-transient errors, we talked a little bit about it before. This time, we want to talk to you about how do we really deal with the non-transient errors. It can cause head-of-line blocking when you retry the data multiple times and you’re sitting there waiting for data to get better. One way is when the database is down, you want to part the data and not pause the consumption.
The second way is to add a DLQ when non-transient error happens, and a DLQ will take care of head-of-line blocking issues for you. It’ll sit there retrying or massaging the data before retrying. All that costs time, and the existing data that has non-transient errors or it can be applied sooner doesn’t deal with latencies when you do that in a separate system. Again, it requires some manual intervention sometimes. When non-transient errors are not easily applicable, the user can look at it and deal with the problem.
Key Takeaways
With all of these problems, we have shown how WAL is helping us, but we also wanted to give you a takeaway on what are the key things we considered while developing the system. We talked about pluggable targets. We have namespaces as well, which helps you in the configuration. We can expand the scope of the namespace to multiple queues. We can use abstractions to deal with a lot of failure modes. That pluggable architecture was core for our design consideration. We had a lot of these building blocks in place already, like control plane, key-value abstraction, which already we had in place. That had helped us build this pluggable architecture. Separation of concerns: your throughput that is incoming should not affect the throughput that the consumer has. From the beginning, we thought about, how do I grow this system that is complex individually without being blocking in each of these components. Systems fail, please consider your tradeoffs carefully.
Questions and Answers
Participant 1: When you commit to write to log, and finally it’s going to get to your database or cache or wherever that is, there must be delay. How are you handling that? Is your application ok with that kind of delay? How do you make sure the reads are not dirty when they’re going to go to primary before your message consumer is actually pushing that data to the database?
Karumanchi: Obviously, this is an asynchronous system, how do you manage delays?
I think in one of the slides that Vidhya has shared, definitely this is not going to ensure that the applications will get read-your-write consistency guarantees. What we are promising is, with this, your data will eventually be there. The key-value abstraction that we have built, it has essentially two flags that indicate whether the data was durable and visible. If the response says that data is visible, which means it is visible in the read path. If the user gets an explicit response from that write API saying durable is true but visible is false, that data may or may not be visible.
Vidhya: Eventual consistency is what we are talking about.
Karumanchi: The write API explicitly states that.
Participant 2: If the client application can be a consumer of the queue and query the metadata itself, would it not solve the problem of the tradeoff of eventual consistency where you’re just reading the metadata, you’re not even going to the cache or the DB within the WAL, just to query the metadata. If it exists, you can pull it. If not, you can go back to the DB.
Karumanchi: I think it’s basically the decision between, do we want to abstract the details of the cache and database to the application or do we want to make it a leaky abstraction? I think that was the choice. The approach that you mentioned definitely would help in increasing the performance. No question about that. The downside of that is you’re going to be leaking the details of, these are the underlying caches and databases that this WAL abstraction is supporting. Those are the tradeoffs that we had to make. Maybe we would lean on to the approach that you mentioned for some of the use cases down the lane. For now, we are pretty firm on the application should not know anything about what is happening under the hoods of WAL.
Vidhya: That is a great idea. The systems we’re dealing with is mostly in a failed state. We’re adding extra layer of durability for us to maintain critical applications. A read-your-write is important. Then, what you are mentioning is an option that you can wait for all the data to be mutated before reading that data. That might add latencies. If those latencies are ok, then you can do that. One of our systems, Hollow, does that right now.
Participant 3: From the architecture, it looks like you or any consumer will only enable WAL when there is a problem. Then, when the problem is resolved, application would still keep on updating the database as-is. When there’s a downtime, you have certain transactions which are pending in WAL to be updated in the database. When the problem is resolved, there are competing transactions on the same row, for example. Are these use cases or this particular architecture just solves for eventual consistency? Or, if you have incremental update and you want to take each update to the database, do you care about sequencing or orders?
Vidhya: I don’t want to put transaction as the word there. Transaction is not the word. We don’t want to support transaction. That’s a big word. Sequencing is great. We wanted to use the idempotency token that the client provides as a mechanism of deduplicating the data. If you’re looking at, if X is A, then make X as Y. That kind of system cannot be supported without the sequencing that you are mentioning and all the bells and whistles that a transactional system supports. WAL is not for that. WAL is for just data reliability at scale, and eventual consistency.
Participant 4: Have you just moved the failure domain from your persistent DB to WAL DB? What happens if WAL DB fails?
Karumanchi: Those are some of the failure domains that Vidhya did allude to. Definitely there is a point of, yes, it can fail, especially the database or caching layer itself is running into the same issue that we ran with. The hope or the idea is, systems failing concurrently, the probability is extremely rare. It’s like we are basically hedging on that.
See more presentations with transcripts