Transcript
Burmistrov: I’ll start with a short story that happened to me. It was a normal working day. I came from work to join my wife and daughter for dinner. I’m really excited because the model our team has been working on has finally started showing promising results. I’m really eager to share this with my family. My wife is busy though, because she’s trying to feed our daughter who is a fast eater. She doesn’t want to talk about models or anything else at that point. Then, so excited I cannot wait, so I started telling her, “We had this model we built, and it wasn’t working. We’ve been debugging it for more than a month.
Finally, turns out that we thought was just stupid bugs like typos, like prompt feature name, this kind of stuff. We finally hunted them enough, so the model is now performing as we expected”. I shared with excitement and I expect some reaction or something, but she’s been busy. She barely listened, but she doesn’t want to be rude, so she understands that she needs to do something, so she’s offering her comments, like, “Nice, this model behaves just like you do at times”. I’m like, “What do you mean, how is it even relevant?” She said, “When you’re mad at me, you don’t simply tell me what’s wrong. I have to spend so much time figuring out what’s going on, just like you guys with this model”. She’s actually spot on.
The models are really hard to debug. If you build a model and it’s underperforming, it can be a nightmare in debugging. It can cost days, weeks, or months, like in our case. The least we can do to ease the pain is to ensure that the data that we feed to the model is ok, so the data doesn’t contain these bugs. This is what feature platforms aim to do. They aim to deliver good data for machine learning models. Feature platforms is what we will be discussing.
Background, and Outline
My name is Ivan. I’m a student engineer at ShareChat. ShareChat is an Indian company that builds a couple of social networks in India, like the largest domestic social networks. I am primarily focusing on this data for ML, and in particular, this feature platform. Before ShareChat, I had experience working at Meta and ScyllaDB. One of the social networks that ShareChat builds is called Moj. It’s a short video app, pretty much like TikTok. It has fully personalized feed, around 20 million daily active users, 100 million monthly active users. The model in the story is actually the main model for this app, so the main model that defines the ranking, that defines which videos we will show to users. We will be talking about how we built the feature platform for this app, or any app like this, in particular. We will have the background about what a feature platform is, or what are features, high-level architecture of any feature platform. The majority of time will be spent on challenges and how to overcome them, based on our examples. Finally, some takeaways.
Introduction to Feature Platform
Let’s refresh the memory of what the feature is for ML. A feature is pretty much anything we can derive from data. Typically, it’s attached to some entity in the system. For instance, it can be the user as an entity, or post, or creator, which is also a user, but the one who posts the content. There are multiple types of features one may be interested in. Like, for instance, window counters is like a feature, give me the number of likes for this post in the last 30 minutes, or give me the number of engagements from this user in the last one day. These kind of window counters, because they have time window. There can be lifetime counters. Lifetime don’t have windows, so like total number of posts or likes for the given post, or total number of engagements from the given user. There can be some properties, like date of birth for a user, or post language, or this kind of stuff. Or something like last N, give me last 100 interactions with this given post, or give me the last 1,000 engagements for the given user.
In this talk, we’ll primarily focus in on window features, the somewhat most interesting. To summarize what the feature platform is, feature platform is a set of services or tools that allows defining features, like give some feature API. They list features and read what they mean. It allows launching some feature pipeline to compute. Finally, feature store is to read the data when it’s needed. One important aspect I wanted to stress is that it’s really important that a feature platform helps in engineering velocity, meaning that it doesn’t stay in the way. Instead, it allows for fast iteration. The people who build the model, they can easily define features, update them if needed, or read what they mean. Because when it comes to some kind of investigation, especially a model underperforming, let’s say, it’s really important that people can go and easily read, these are the features that we feed, and this is what they mean. It’s pretty clear. It is not hidden somewhere in tens of files, in cryptic language or whatever.
Architecture – High-Level Overview of Feature Platform Components
In typical architecture for any feature platform, it starts with some streams of data. In a social network, it may be streams of likes, streams of views, video plays, this kind of stuff. Some processing engine that gets these streams of data, computes something, and writes to some kind of database. Finally, in front of this database, there is some service that gets a request for features, transforms it to whatever queries it requires for the database, and probably performs some last aggregations, and returns. Speaking about last aggregations, in particular for window counters, how to serve the query, like give me number of likes in the last 30 minutes or one day. Typically, the timeline is split into pre-aggregated buckets, we call them tiles, of different size.
For instance, we can split the timeline into buckets of 1 minute, 30 minutes, 1 day. When the request comes, like let’s say at 3:37 p.m., we want number of likes in the last two hours. We can split this window of two hours, cover it by the relevant tiles, and so then we request the database for these tiles, get the response back, and aggregate across this tile and return the result. It’s a typical technique to power these window counters.
Challenges and Solutions: Story 1 (The Boring One)
That was it about architecture. Now let’s go to the challenges, solutions that I will be talking about in form of stories. The first story is a boring one. It’s about the choice of streaming platform and database. For streaming, we picked Redpanda. For database, we picked ScyllaDB. ScyllaDB and Redpanda, they are like siblings. They share a lot of similarities.
First, they were born with API compatibility with existing systems, like Scylla was born with Cassandra API, and later it did DynamoDB API. Redpanda is Kafka compatible. They’re both obsessed with performance. They do this via so-called shard-per-core architecture. Shard-per-core is when we split the data in some shards. Each shard is processed by a given core in the system. This architecture allows eliminating synchronization, overhead, or logs, this kind of stuff. This helps them to get to the performance numbers they desire. It’s not easy to build applications using this shard-per-core technique. They both leverage Seastar framework. It’s an open-source framework. It’s built by Scylla team. It allows to build these kinds of applications. They both don’t have autoscaling. It’s actually not entirely true anymore for Redpanda. They just announced that they launched Redpanda serverless in one of their clouds. If you install them in your own stack, they don’t have autoscaling out of the box.
Despite that, they’re extremely cost efficient. In our company, we use them not only in this feature platform thing, but actually migrated a lot of workloads to ScyllaDB: to Scylla we migrated from Bigtable, and to Redpanda we migrated from GCP Pub/Sub and in-house Kafka, and achieved really nice cost saving. They both have really cute mascots. If you work with the systems, you get to work with some swag, like this one. That’s it about the story, because these systems are tragically boring. They just work. They really seem to deliver on the promise of operational easiness. Because when you run them, you don’t need to tune, so they tend to get the best out of the given hardware. My view on this is biased a bit, because we use managed versions. We tested non-managed versions as well, and this is the same. We tested it, and we picked managed, just because we didn’t want to have a team behind this.
Story 2 (Being Naive)
Let’s go to the less boring part. When it came to streaming and database, there were a bunch of options. When it comes to processing engine, especially at the moment when we made the decision, there were not so much options. In fact, there was only one option, when you want real-time stream processing. What’s important, we want real-time stream processing, and we want some expressive language for the stream processing.
Apache Flink is the framework that can process a stream of data in real-time, has some power capabilities. What’s important, it has Flink SQL. One can define the job in SQL language, and it’s really important for feature platform, because of that property we want to deliver so that it’s easy to write features, and it’s most importantly easy to read features. We picked Flink. We can build features using this streaming SQL. What it does, this query, it basically works on top of streaming data, selects them, GROUP BY by some entity ID and some identifier of the tile, because we need to aggregate these tiles. This query forms this so-called virtual table. Virtual, because it’s not really materialized, it keeps updating as the data comes. For instance, some event comes, and now we updated the shares and likes. Some new event comes, we again update it. This process continues. Super nice, we build this proof of concept pretty quickly, and we’re happy. From developer experience, Flink is really nice. It’s easy to write on using developer workflow.
Then the question comes, so we build this query, we launch the job, it’s running, it’s updating this virtual table, all good. Now we want to update the query. Flink has this concept called savepoint, checkpoints. Basically, because it’s stateful processing, it has some state. What savepoint does is that we can stop the job, take the snapshot of the state, then we can update the job, and start again, and start from this snapshot. It continues working without huge backlog or something. This was the expectation. Nice, we have Flink, so Flink has savepoints. Now we have Flink SQL, so we can update this SQL and restore from savepoint. Unfortunately, it doesn’t work like this.
In fact, it’s impossible to upgrade Flink SQL job. For us, at that moment, kind of unexperienced Flink engineers, it came like a big shot. Because like, come on guys, are you kidding? What are we supposed to do? We launched the job and should expect it never fails? Like, works forever or what? When the first shock faded, and we thought about this a little bit more, it’s not that surprising. Because this query actually gets translated to a set of stateful operators. When we do even slight change in the query, these stateful operators may completely differ. Of course, mapping one set of stateful operators to another is a super-difficult task. It’s not yet implemented. It’s not a Flink fault that it’s not implemented. Knowing this doesn’t make our life easier, because for us, we want to provide a platform where users want to go, update SQL, and just relaunch the job. It should just pick and continue working. The typical recommendation is to always backfill.
If you want to update Flink SQL, we compose the job in such a way that it first runs in batch mode, already process data, and then continue on the new data. It’s really inconvenient, and it’s also costly. If you want to do this every single time we update these features, it will cost us a lot of money. It’s just super inconvenient, because backfill is also taking time. We want the experience so that we updated the job and just relaunched them, and it continues working.
One thing where Flink shines is so-called DataStream API. DataStream, in comparison to SQL API, DataStream is where we express the job in a form like some JVM language, like Java, Kotlin, or Scala. There is a really nice interop between SQL and DataStream. In particular, it’s called Changelog. When we want to get from SQL to DataStream, there is a thing called Changelog. Basically, SQL will send so-called Changelog rows. What is it? It’s a row of the data with this marker, like on this list, +I. +I means that it’s a new row on our virtual table. There could be -U, +U. They come in pairs. -U means that this was the row before update, and +U means this is the row after update. Once SQL is running, it keeps issuing these Changelog entries. What’s interesting about this Changelog, if you look at this, you can notice that the final values of our columns, our features, is aggregation over this Changelog.
If you consider + operation is plus, and – operation is minus. Basically, here we see three rows. If we want to count shares, we do, 3 – 3 + 5, final result is 5. The same continues. We can keep treating these Changelog entries like commands, either + command or – command. If we want to express this in form of this DataStream, there will be a function like this. Super simple. We have row update. We have current state. We get the state. We see if it’s positive update or negative update, and we perform this update. Super simple. So far, it’s super obvious. What’s interesting about this model is that it survives job upgrades. Because when we upgrade the job, we lost SQL state, fine. This SQL will continue issuing these Changelog entries. We have this set of +I, -U, +U.
Then job upgrades, and now this row from the SQL engine perspective, is a new row. It will start sending these +I entries. It’s fine. This +I represents exactly the change from the previous data. From the logic that performs aggregation over the Changelog, nothing happens. It just keeps treating these entries as commands, like + command, – command. We see, we had this aggregation. Now job upgrade happened. We don’t care. We keep aggregating these Changelog entries. We can compose the job in this way. There is SQL part, and there is Changelog aggregation part. SQL part, we don’t control the state there, because of this magic and complexities of this SQL and so on. In this Changelog aggregation part, it’s expressed in DataStream. This is where we control the state. We can build the state in such a way that it survives job upgrades. This part of the job can be restored from savepoint and continue. The flow will look like we updated the query and just relaunched the job from savepoint. Computation will continue.
Story 3 (Being Tired)
Now we have the system that we can write and also update. It’s pretty exciting. The next problem that we might face is that when we launch the job, performance over time may decline. We call this the job getting tired. It’s clear why. Because this is a stateful job, and we have state. The bigger the state, the less performant the job. It’s a pretty clear dependency. The typical recommendation, you have the state that just applied TTL. The state will expire. It will be constant size. The job will not get tired. It’s a file recommendation, but it’s not really clear what kind of TTL to pick. For instance, in our case, the largest tile that we use in this system is a five-day tile. If you want counters to be accurate, the TTL must be at least five days. It’s not even entirely solved the problem of lifetime counters, where we don’t have window. It’s not really clear what kind of TTL to pick. Assuming they’re fine with lifetime counters being inaccurate, and assuming that we find the five-day TTL in the context of lifetime counters, the problem is that five days is too much.
In our experience, the job processing hundreds of thousands of events and performing millions of operations per second with the state, it shows signs of being tired pretty quickly, just a few hours after launch. Five days is just too much. Of course, we can upscale the job, but it comes with a cost. What can we do about it? Remember that we now have two types of state. One is SQL state, and now, Changelog aggregation state. The good news about SQL state is that we shouldn’t do anything, because we already survived the job upgrade because of this Changelog mode.
Now from the SQL perspective, it doesn’t matter if it lost state because of job upgrade, or it lost state because of TTL. It doesn’t matter. We just set TTL on the SQL and keep treating these Changelog commands as commands. We shouldn’t do anything. For Changelog aggregation state, we can modify our function a bit. When we access the state and it got expired, we can just query our Scylla, because the data exists in Scylla. We modify, like have this in green update of our function. Now we can set TTL for Changelog aggregation as well. It will keep running and recovering itself from the main database. Now the jobs are no longer getting tired, so it has consistent performance because the state has a consistent size.
Story 4 (Co-living Happily)
Good. Now we can launch the job, update the job, and it’s not dying. The problem though now with previous change, now jobs not only write to database, but also read from them. It’s actually a big deal. A big deal for the job to read from the database, because for a database like Scylla or Cassandra or similar kind, reads are somewhat more expensive than writes, especially cold reads. If we read something which is cold, which doesn’t contain database cache, a lot of stuff happens. Because we need to scan multiple files on the disk to get the data, merge them, update the cache, a lot of stuff. What’s interesting about jobs, is that they are more likely to hit cold reads than the service that serves features. What would happen is that when we do something on the job site, for instance, we launched a bunch of test jobs. We want to launch test jobs because we want to unlock this engineering velocity, and we want to experiment, and so on. Or maybe some job got a backlog for some reason and needs a huge backfill or whatever.
We launch the job, and they start hitting these cold reads, especially if it’s in backfill, and they try to process the backlog: they hit a lot of these cold reads. It thrashes main database and affects the service latency that accesses the features. What do we do about that? The first thought may be that we need some throttling. The problem, though, it’s not really clear the level where we should apply the throttling. We cannot throttle on individual worker’s level in the Flink job, because at least we have multiple jobs, and a new job can come and go.
Instead, we can have some kind of coordinator in between jobs and Scylla, which is basically a proxy. It’s a tricky thing, though, because this proxy, Scylla and Scylla clients are really super uber-optimized. If you want the same efficiency for the reads, this proxy should be at least optimized as well as Scylla itself, which is rather tricky. Overall, this solution is complex. It actually has extra cost, because we need this component, which needs to be scaled appropriately. Likely, it’s not efficient, because it’s not really easy to write this proxy in the same way that it will be as efficient as Scylla itself.
The second option is called data centers. Scylla, the same as Cassandra, it has data center abstraction. It’s purely logical abstraction. It doesn’t need to be a real data center, real physical. Basically, we can split our cluster into two logical data centers. Job will hit data center for a job, and feature store will hit data center for the feature store. It’s pretty nice, because it has super great isolation. We also can independently scale these different data centers. The downside is that the cluster management for Scylla becomes much more complex. It also comes with cost, because even though we can independently scale these data centers, it still means extra capacity. Also, the complexity of cluster management shouldn’t be underestimated, especially if you want real data centers. Now, for instance, we wanted our database to be in two data centers, and now it’s in four. The complexity actually increases quite a lot.
The third option, the one that we ended up with, is so-called workload prioritization. There is this feature in Scylla called workload prioritization. It’s that we can define multiple service levels inside Scylla and attach different workloads to different service levels. How does it work? Any access to any resource in Scylla has the queue of operations. For instance, we have job queue and service queue: job has 200 shares and service has 1,000 shares. What does it mean? It means that for any unit of work for the job queries, Scylla will perform up to five units of work for service queries. There is the scheduler that picks from these queues and forms the final queue. What does it mean? It means finally we will have consistent latency. Of course, job latency will be higher than service latency. This is fine because job is a background thing and doesn’t care about latency that much. It cares about throughput.
Service, on the other hand, it cares about latency. Summarizing this final solution, how it will look like, basically we don’t do anything with the cluster, with Scylla itself. We just set up these different service levels. From the job, we access the cluster using user for the job workload. From feature service, we access user for the service-service level. This is super simple. No operation overhead. Best cost because we don’t need to do anything with the cluster. The only downside is that job and service, they connect to the same nodes in Scylla. Theoretically, they are not entirely isolated. It’s a nice tradeoff between cost and safety.
Story 5 (Being Lazy)
Final story is about being lazy so that now we can launch the job, they are running, and they don’t impact our main service. Now it’s time to think about the data model in database to serve the queries. We need to be able to query these tiles to aggregate window counters. The natural data model is like this. Scylla has the notion of partitions. A partition is basically a set of rows ordered by some keys. We can attach each entity ID to partition. Inside partition, we can store each feature in its row. Basically, feature will be identified by timestamp of the tile and feature name. We have these rows. This schema is nice because we don’t need to modify the schema when we add each feature. It’s schemaless in terms of, we can add as many feature types as we want, and the schema survives. However, if you do some math, we have 8,000 feed requests. On average, we rank around 2,000 candidates. For each candidate, we need to query 100 features.
For these features, we need to query like 20 or something tiles. Also, assuming that our feature service has some local cache, and assume we have 80% cache hit rate, then multiplying all of this, we will get more than 7 billion rows per second. This is the load that our database needs to perform in order to satisfy this load. This is totally fine for Scylla. It can scale to this number. The problem is that it will use some compute. Of course, our cloud provider, GCP, and Scylla itself, they will be happy to scale. Our financial officer might not be that happy. What can we do? Instead of storing each feature in the individual row, we can compact multiple features into the same row, like this, so that now rows identify only by tile timestamp. Row value is basically bytes, some serialized list of pairs, like feature name to feature value, feature name to feature value. This is nice, because now we don’t have this 100 multiplier anymore, because we will query 100 less rows, and this number of rows per second looks much better.
The question may arise, whether it’s really a cost saving, because it can be that we just shifted the compute from the database layer to basically the job. Because we used to have this nice schema, when we updated each feature independently, it was nice. Now we have these combined features. In protobuf, it can be expressed in a message like this. We have FeaturesCombined and map in string to some feature value. What does it mean? It means that whenever a feature is updated, we need to serialize all of them together, every single time. Basically, it may look like the cost of updating a single feature now gets 100 times bigger. It’s a fair question. There are fairly easy steps to mitigate it. The first is a no-brainer, is that we don’t need to store strings, of course. We always can store some kind of identifiers of the feature. We can always have a dictionary mapping feature values to some IDs. Now we need to serialize mapping of int to feature value, which, of course, for protobuf is much easier.
The second observation is that protobuf format is pretty nice, in the sense that this map, can actually via equivalent to just a repeated set of key-value pairs. Basically, we have two messages. One is FeaturesCombined, and another is FeaturesCombinedCompatible, which is just repeat MapFieldEntry. We can serialize the FeaturesCombined, and deserialize FeaturesCombinedCompatible, and vice versa. They’re equivalent in the form of bytes that get produced. Moreover, they’re actually equivalent to the just repeat bytes feature, so basically, array of arrays. All these three messages, FeaturesCombined, FeaturesCombinedCompatible, FeaturesCombinedLazy, they’re equivalent in the form of the bytes that get produced by protobuf. How does it help? It helps that in the Flink state, we can store the map from feature ID to the bytes. Bytes serialize this MapFieldEntry. When we need to serialize all of these features, we just combine these bytes together, have this array of arrays, form these FeaturesCombinedLazy message, and serialize with protobuf.
This serialization of protobuf, it’s super easy because protobuf itself will just write these bytes one after another. This serialization is extremely cheap. It’s much cheaper than serialization of original message. In fact, when we implemented that, we didn’t need to scale the job at all. In comparison to other things that the job is doing, this step is basically negligible.
Summary
Assuming that you decided to build a feature platform, first, good luck. It’s going to be an interesting journey. Second, take a look on ScyllaDB and Redpanda. The odds are that they may impress you and be your friends. Third thing is that Flink is still the king of real-time stream processing, but it takes time to learn and to use it in the most efficient way. The fourth thought, there are multiple vendors now who build SQL-only stream processing. My thought is that, in my opinion, SQL is not enough. I don’t understand how we can build what we build without this power of Flink DataStream API. Probably, it’s possible via some user-defined functions or something, but it likely would look much uglier and harder to maintain.
In my opinion, Flink’s ability to have this DataStream API is really powerful. Finally, lazy protobuf trick is a pretty nice trick that can be used pretty much anywhere. For instance, in addition to this place, we also use it on our service to cache gRPC messages. Basically, we have gRPC server, and there is a cache in front of it. We can store serialized data in the cache. When we need to respond to gRPC, we just send these bytes over the wire without a need to deserialize the message, to serialize it back.
Questions and Answers
Participant 1: You’re looking into C++ variants of various classically Java tools. Have you looked into Ververica’s new C++ version of Flink yet? Partially backed by some team at Alibaba, Ververica are launching a C++ rewrite of Flink. It will have similar properties to like Scylla’s rewrite of Cassandra. It’s called VERA, their new platform. Have you looked into using that as another way to get more performance out of Flink? There’s a new one called VERA by the Ververica team, which is a rewrite of the core Flink engine in C++.
Burmistrov: The question is, there is some solution, which is a Flink rewrite to C++, pretty much like what happened to Scylla, rewriting Cassandra, which is Java to C++, and Redpanda the same, like Kafka, which is in Java to C++. In fact, I don’t know about the solution in C++, but there are other solutions that claim to be Flink competitors, which are written in Rust. The downside of all of them that I mentioned, they claim to be all SQL-only. This is harder. We took a look at multiple of them. Performance may be indeed good, but how to adapt them with the same level of what we can do with Flink, we didn’t manage to figure out.
Participant 2: Can you comment a little bit on the developer experience of adding new features, and how often does that happen? You said you have 100 features. How often does that grow, and what’s the developer experience of that?
Burmistrov: What is the developer experience of adding features, and how does it look like in our solution? It’s far from great. For each feature, we have some kind of configuration file. We split features into so-called feature set. Feature set, they are logically combined. Logically, for instance, we have one model, and user features for this model, or this kind of stuff, post features for this model. They are somehow logically combined. This configuration file is basically YAML that contains some settings, and also query. Query in Flink or SQL, but the query is simple, so it’s equivalent in any SQL dialect. This query is basically select.
Then, people can do something with select, like transform something, whatever. They define the query, and then basically they can launch the job. There is deployment pipeline. They can push the button, and a job gets upgraded. That’s the flow to define the features. There is also the process of how this feature gets available for training. We still use so-called wait and lock approach. Basically, through the lifetime of accessing the feature, we lock the values, and using this lock, model gets trained. There is process. When we edit features, we now start to access it for locking. Then enough time passed, so model can start being trained on this data.
Participant 3: Can you maybe elaborate on why did you choose Redpanda over Kafka?
Burmistrov: Why did we choose Redpanda over Kafka? The answer is the cost. First of all, we wanted managed, because we didn’t want to manage. We didn’t have people to manage Kafka. We use the experience of managed Kafka in-house, we just didn’t have a team to continue this. Then, we started evaluating the solutions. There is Confluent and other vendors. Then we compared prices. Kafka was the winner for the cost. Also, there are a few other considerations, like they have remote read replicas. They’re moving towards being a data lakehouse. Every vendor is actually moving to that direction. We just liked their vision.
Participant 4: The DataStream API for Flink is very similar to Spark structured streaming, in terms of the ability to do upserts on the tables. If jobs fail, we can use the checkpoints to trigger jobs. What about the disaster recovery strategies if the data is lost? What usually have you thought in terms of a data backup. Then the trouble becomes that the checkpoints are not portable to another location, because of issues of hardcoding of the table IDs and stuff like that. Have you thought about that?
Burmistrov: What are the best practices of using Flink or Spark streaming, which is equivalent, in terms of disaster recovery, like if job died or something happened? First of all, we have checkpoints enabled. They’re actually taken pretty regularly, like once per minute. Checkpoints get uploaded to cloud storage, S3 equivalent in GCP. We have a history of checkpoints. Also, we take savepoints once in a while. We have all this stuff ready for a job to recover from.
Sometimes, to be fair, with Flink at least, actually the state can get corrupted. The state can get corrupted in such a nasty way that all checkpoints that we store, let’s say we store like last 5, 10, whatever checkpoints, they all can get corrupted. Because the corruption can propagate from checkpoint to checkpoint. Now we have the job with unrecoverable state, what do we do? The good thing about the approach I described, that we don’t do anything, we just start a job from scratch. Because it will just recover from the main database by itself. There will be a little bit of incorrectness probably due to last minute or whatever of data. In general, it can get to the running state pretty quickly.
Participant 5: I have a follow-up question to the developer experience. I can imagine that when you’re trying to understand, especially the part where you talk about the Changelog and the Flink operators, as a developer, I would love to be able to interact with the state. I know that the queryable state feature from Flink was deprecated. I don’t know whether you were able to figure out a different way to see what’s in the state and help you in your ability to create new features and stuff.
Burmistrov: What’s the developer experience when it comes to figuring out what’s going on inside the job? Basically, we have two states: one SQL state and then this Changelog, aggregation of a Changelog. What do we do? It’s in the works now for us. We didn’t have this for a while. Relied on basically feature lock already down the line. When a feature computed and access it, we lock it and we can have basically match rate over raw data plus feature lock. Of course, it’s pretty tricky. We actually want to dump this intermediate Changelog operations to some OLAP database like ClickHouse or similar. In this way, we will have this full history of what happened and ability to query and see. It’s not yet ready, so we’re working on it too.
See more presentations with transcripts