Building LinkedIn’s Resilient Data Storage: A Deep Dive Into Derived Data Storage With Felix GV

Transcript

Olimpiu Pop: Hello everyone. I’m Olimpiu Pop, a veteran editor for InfoQ, writing mainly for the Java Queue and occasionally on topics related to ML, data and other general technologies. Today, we have Felix, a principal staff engineer who solves global-scale problems for LinkedIn. Thank you for accepting our invitation, Felix. Please introduce yourself and introduce us to the problems you’re targeting these days.

Felix GV: Thanks for having me, Olimpiu. So, like you said, my name is Felix. I’ve been working at LinkedIn for a bit more than a decade now, part of the Voldemort team, but it transformed into what is now the Venice team. Venice is the system that I worked on. It’s a database that we built in-house and open-sourced in 2022. It is basically a derived database, meaning that we store data that’s been generated from other data, in particular, AI feature data sets are quite popular to store in there, so there’s a lot of AI inference workloads going on using Venice. It’s been very exciting. The scope and scale of the system is continuously growing over the years, so looking forward to chat about all of that.

Why are recommender systems a good candidate for derived data storage solutions [01:40]

Olimpiu Pop: So what will be a place where you would use VeniceDB? What would be the problem that it actually solves?

Felix GV: One way to think about it is you could split the ecosystem of databases into one dimension. Which is: is it storing primary data or derived data? That’s one way to divide the space: divide and conquer the space. Primary data – which is not what Venice focuses on, but just to put it in context so folks can understand, primary data is also called source-of-truth data. This is the type of data that you will store in something like Postgres or MySQL, Oracle. It doesn’t have to be in a relational database. You might store primary data also in something like HBase or Cassandra or DynamoDB, CosmosDB, etc. The point is that primary data typically is human generated. These are things like messaging data between users of a communication platform or maybe the profile data that users write on a social network site or the posts that they write on something like Medium or Substack. All of those are human-generated primary data.

Now, let’s contrast that with derived data. Derived data as opposed to primary data is machine-generated. It could be that it’s a denormalisation of the primary data because we want to optimize for some different access pattern. For example, maybe it’s the same data conceptually, but it’s organized in a different way. For example, the primary data is stored in a row-oriented format that makes it very fast for OLTP workloads, like transactional workloads, whereas the denormalised version of that might be a column-oriented format for OLAP workloads, meaning analytical processing workloads. That’s one form of derivation is when the shape of the data changes, there could be other forms of derivation where the content of the data itself is changing. For example, the use case that Venice focuses on, not exclusively, but it’s a popular use case within Venice, is ML feature data.

So this is data where you take a bunch of inputs, like for example, at LinkedIn, an example use case is people you may know, for example, the inputs are a variety of other data sets like the connections between users, who’s messaging who, who is liking whose posts, the industries they’re a part of, the schools they went to. All of these things end up being analyzed, so to speak, or crunched through machine learning jobs that will end up spitting out some opaque score. Maybe you can call it like an embedding or just some floating point values, that is the output. And these machine learning-oriented outputs are in terms of content completely different from the inputs. It’s unrecognizable for a human looking at this to see the relation, but of course they are related in the way that they’ve been generated.

Now this is another type of derived data, these ML feature data, and they can be loaded into a system like Venice so that the developers of the “people you may know” service (that we call PYMK internally), I may be using one or the other interchangeably. The developers of PYMK will then use Venice to load all this feature data and be able to query it at very low latency from their online application. So when you go to the website or you go to the mobile application and you go to your network page, you’ll see some recommendations of people you might know, and this is essentially what we call a recommender system. So that’s one of the very common use cases hosted in Venice.

Olimpiu Pop: Okay, thank you for the explanation. So let me see if I got it right. A primary database are the things that you usually put with your own hands. So it would probably be my profile, the jobs that I had and other stuff that I’m putting in for my profile, if you’re talking about LinkedIn. So there are the things that, I’m usually, if I’m an engineer, if I’m defining a database, those are the tables that I just fill in, right?

Felix GV: Yes.

The source of data shouldn’t dictate the choice of the database. At a small scale, any database works [06:57]

Olimpiu Pop: If that’s an SQL-based database or relational database. And the derived data is data generated mainly by machines that are either optimizing in some sort or rearranging the things based on the data introduced in the primary database?

Felix GV: Yes.

Olimpiu Pop: So that it allows the system to function in given parameters, right?

Felix GV: Right. And by the way, where the data comes from, whether it’s human-generated or machine-generated, does not necessarily need to dictate which database technology choice you picked. For example, if you’re using Postgres as your primary database, you could have a job that siphons off the data from Postgres, runs a machine learning job on it, producing a bunch of embeddings, and then pump those embeddings back into Postgres. Nothing prevents you from doing that, you could use the same database for your primary data and your derived data. At a small scale anything works. So if you want, we can get into, well, what is the point of having a specialized system for hosting derived data? Why not just use Postgres for everything?

Olimpiu Pop: That was my next question.

Felix GV: Oh, sorry to preempt that.

Why was VeniceDB necessary in the already crowded database space? [08:21]

Olimpiu Pop: Yes. Well, it’s good to have a flow in that direction, but the database space is quite busy. There are a lot of databases in all different aspects, relational databases, no relational databases, key value pairs, you name it, and on all different perspectives, why a new database? Why VeniceDB?

Felix GV: Right. So by the way, I go into more details about all of that in a QCon London talk that I gave called, What is Derived Data, And Do You Have Any?, so you can go check that out for a more in-depth explanation. But essentially what I would say is, as I alluded to earlier, at small scale, anything works and there may be benefits to reducing the overall complexity of your system by having fewer technologies in it. So if you can get everything working with just Postgres, by all means use only Postgres. That’s a totally sensible technical decision. At some point, you may start getting some operational pressure from certain issues.

For example, one of the ways that primary data and derived data differ, that’s not a hard rule by the way, but it’s just a tendency that we’ve seen over and over, derived data tends to have a very high volume of writes, or you can call it refresh. The primary data is oftentimes heavily skewed towards reads. So that means you have, let’s say a 10% ratio of writes, 90% ratio of reads, or maybe even more skewed like 5/95 or even 1/99 in terms of write versus reads in a primary database.

For the derived data sets, it is often the case that the write rate is very high because let’s say you want to regenerate new ML recommendations every day or maybe even several times a day, or maybe you’re doing it in a streaming fashion and it has a high amplification rate, like let’s say two persons connect together, they add each other as contacts or as friends on a social network. Well, now there’s a downstream side effect, which is that a bunch of new people have now become second degree connections. And so potentially for just the two persons that connected together, there is a side effect that hundreds or thousands of other people are now more likely to be connected to one another because they’ve become second degree connections, whereas before there were third degree connections.

So all of these factors result in the write rate being potentially much higher for those derived data sets. And so then you get to a point where does it make sense for me to have this Postgres database, which is mission-critical because it’s serving all of the requests for loading the profile, loading the user’s messages, all that stuff, and once a day that database gets completely slammed because a table needs to be rewritten entirely.

So then at that point, maybe you get into a situation where you say, “Well, geez, that’s untenable. I’m going to split those databases into separate instances”. So I’m going to have one Postgres instance where I host only the primary data. That one is more typical in the sense that it has like let’s say 10% writes 90% reads, and I optimize it for that. And then I have a second Postgres instance, which perhaps is connected via change capture to the first one, asynchronously, updated, or maybe not and it also gets these batch jobs that will slam it once a day and regenerate the entire tables, and I might have different tunings that I apply there.

So already you are sort of having a you can say maybe a Balkanization, right of your infrastructure. You are isolating pieces that don’t play well together, that cause interference or noisy neighbor issues with one another, you’re separating them, but you’re still using the same technology in both silos. You’re still using Postgres on both sides. But then at some point you may get to a situation where you say, “Well, look, there’s a bunch of functionalities that Postgres gives me that I’m not using anyway, Postgres is very general purpose, it does everything basically, but there’s a bunch of stuff I’m not using and I still have some scalability issues when I need to rewrite these entire tables, and so on. Is there something better that I can do?”

Then at that point, you can start asking yourself, well, maybe there’s a specialized database technology that I can use instead that will be very optimized at bulk loading batch data, optimized at supporting high write rate, still give me very low latency, perhaps even better latency than what Postgres would give you. But in exchange, of course, nothing is free in this world, in exchange, that system is much more specialized, so it doesn’t have all the bells and whistles that Postgres gives you. It’s more focused. So that’s kind of a natural progression as a company scales its data needs, it will go from a heterogeneous to a more homogeneous stack where it has fewer technologies, less complexity in terms of variety of technologies towards a more varied diverse stack where it has a bunch of different technologies and each of them is specialized for a given workload.

Olimpiu Pop: To conclude, Venice DB is a very niche database mainly targeted for derived data that emphasises having multiple writes, more writes than reads, more or less, right?

Felix GV: Yes, it certainly supports a very high write rate. It also supports high read rates. It supports both.

An aging product can be replaced more quickly than a modern one [15:22]

Olimpiu Pop: Okay, so it’s about the trade-off depending on the scenarios that we’re going to use it, you can configure it based on your needs and mainly focusing on that, but still the response for why a new database is it felt better to build it from scratch for the tailor-made problem that you’re trying to solve, or there was another reason for that as well?

Felix GV: As I alluded to when I joined LinkedIn, I was in the Voldemort team, and Voldemort was responsible for that type of workload. It was also responsible for other workloads that ended up getting migrated to other non-Voldemort systems. But if we zoom in on just the ML, like the recommender system kind of use cases that we’ve been talking about so far, those were served by a flavor of Voldemort that we called Voldemort Read Only. It allowed to swap the whole data set in a batch fashion periodically, maybe once or a few times a day. That worked fine for many years. Voldemort had a run of about 10 years at LinkedIn, so it was very successful in that sense, but it also had a bit of an aging architecture.

So we had the choice essentially, are we going to modernize Voldemort’s architecture to solve these pain points that we were facing, or are we going to replace it and if we replace it, is there something off the shelf that we can pick that’s already been built by someone else, or do we need to build it from scratch? So that was the improve or build or buy decision. And the architectural limitations of Voldemort were thought to be sufficiently large that it was not that attractive to try to upgrade it in place for various reasons.

Those limitations included the data placement across partitions and replicas was very static. It was like some XML file somewhere that the human had to go alter and manually shift data around when needed. It was kind of tedious in that way, whereas at that point we had new technologies, new building blocks that we could leverage to make it more dynamic so that as machine failures happen, the data rebalances automatically, all that stuff. And there were other such kind of operational pain points that we wanted to improve.

In terms of the build versus buy. Buy here doesn’t necessarily mean money, it could also mean open source that somebody else developed, but let’s say build versus adopt. There weren’t really any other significant contender in that space at the time that we did this evaluation. This is like circa 2014 is when we started seriously looking into replacing Voldemort Read-Only with something else.

At the time, Twitter had something similar in design called SEA DB, C-S-E-A, but that was a proprietary system. They had talked about it in some blog posts, but that’s about it. We couldn’t use it. Then there were things like HBase had some H file side loading mechanism that could have fit in that space, but it was not very mature based on… I spoke with HBase, committers and stuff like that, that I knew personally at the time and HBase users, and it was like, “Well, okay, it works, but there’s still a lot of scaffolding that you need to do manually to prop it up”. It could have worked, but it wasn’t as attractive as we hoped it was. And then I think there were probably some other proprietary solutions in other companies, but in terms of something buyable off the shelf in the public space, either open source or commercial, there wasn’t really.

So Voldemort was open source, but nothing else was. So that’s kind of the thought process. By the way, we did reuse certain little parts of Voldemort, we didn’t throw it away entirely, but we did rewrite a large chunk of it.

Venice DB is a data storage system constructed on the shoulders of “giants” like Kafka, Apache Helix or RocksDB [19:40]

Olimpiu Pop: How do you build a database from scratch? Because usually when you’re talking about architecture, you’re talking a lot about loose coupling, but on the other side, when you’re talking about efficiency and performance, you’re talking about taking some decisions, some trade-offs to make sure that you have the best outcome in terms of performance. What are those decisions? How do you handle it now it’s a decade old, it’s already in production. What are those decisions that you took initially that ensure the proper operation of the database now?

Felix GV: Well, we certainly did not build a database from scratch. We stand on the shoulders of giants, and I want to give credit where it’s due. We rely on some very critical dependencies. Those include Kafka, our data propagation layer, and our write path. All the data comes in first through Kafka in Venice architecture. We also use Apache Helix for cluster management. So that’s what I alluded to earlier, where we wanted to get rid of Voldemort’s static partition placement configurations and instead make it dynamic. That’s handled via Apache Helix, which also Kafka and Helix both come out of LinkedIn. Then we’re also using other things like Netty, nowadays we’re using RocksDB, we didn’t use it at first, but nowadays we’re using RocksDB and there’s a bunch of other dependencies.

So we didn’t build the whole database from scratch, but we used, I would say, some very solid industry-provided building blocks, and then we composed them together in new ways that we didn’t see anything similar that existed at the time. And maybe even now, Venice is still a fairly unique system. I think there are others in the derived data space, but it’s difficult to find a comparison point in terms of what exactly we provide with Venice. It’s a little bit unique in its own niche.

Loose coupling needs to be used with purpose wherever it’s needed [21:56]

Olimpiu Pop: Let me summarise that. You glued together, so it looks like more of a tailor-made software development than building a database per se. So, VeniceDB is a system that stores derived data that puts together multiple building blocks already known and notorious in the open source space, such as Kafka, for ingestion. You have Netty on the communication side and on the transport side, and at the core of it, you have RocksDB as the database engine. If I remember one of our previous conversations, you thought about VeniceDB to be able to swap the engine, but that wasn’t initially needed because it served the purpose; that changed in the meantime. Can you walk us through that as well? Besides RocksDB, you’re in the process of adding another engine that serves another purpose. So walk us through that process.

Felix GV: The storage engine part of Venice I think, is interesting. We did want the storage engine to be pluggable from the get-go, and as I hinted at, we did not start off with RocksDB. We actually started off with the storage engine that Voldemort was using, which is called BDB JE, it’s the Berkeley Database Java Edition. It’s basically a B-tree held inside of the Java heap. It provides pretty good read performance, okay write performance, and that’s what we started with and Venice was successful for a few years leveraging that building block. But the storage engine was hidden behind abstraction, and at some point we had increasing scalability challenges in terms of tuning the garbage collection and all that stuff.

And so we got all this data sitting on the Java heap, not the ideal use case for the JVM. The JVM works really great for short-lived data, long-lasting data can be fine, but then the real killer is like intermediate lifespan data, the data that lives long enough to graduate through the generations of the generational garbage collector, but eventually does still need to be garbage collected out of the older generations. That’s sort of a bad spot for the JVM, at least in those years. Everything is continuously improving in the Java space, so maybe it’s better nowadays, but in those years it was a bit painful, and so we had long garbage collection pauses and so on. So we looked at alternatives and RocksDB was a promising one at that time. And so we did exercise our ability to use this storage engine abstraction. We plugged in a new one and we essentially performed a whole migration.

At that point, we probably had fewer than a thousand different data sets in those years. And we essentially invisibly migrated them. The users never knew about it. It was completely on the operator side, which in my opinion is the way ideally technical migrations should be done with as little burden as possible on the user, the infrastructure should be able to just uplift itself in an invisible way. So that’s what we did. And so we swapped this BDB JE engine in favor of RocksDB, and that really improved our performance quite a bit. So that’s one thing we did essentially by leveraging decoupling in the right places.

And then more recently we have started plugging in new types of engines again, and we had a pretty interesting hack week thing going on last month where we decided, I teamed up with one of my colleagues named Kuros, and we decided to see if we could plug in and SQL engine instead of a key value store. And so we chose DuckDB, which is a great open source project that I encourage everybody to check out. It’s super powerful, it’s specialised in analytical workloads, and we plugged it in and now there is an option to load Venice data directly into DuckDB. Then you can have a populated DuckDB instance in your application that’s been fed off of Venice data and then do any SQL queries on that.

So then that expands the scope of capabilities that Venice offers, it’s no longer what RocksDB provides, which is essentially a key value access pattern. Now, there is also this alternative mechanism for accessing Venice data via SQL. It’s still pretty early days on that front. There’s a new module you can use to access that capability. It’s still pretty cutting edge, but we’re hoping to polish it and stabilize it over time.

Olimpiu Pop: Okay. So initially, VeniceDB and what you’re currently using in production is a key-value pair, and that’s provided through RocksDB. It has SQL abilities through DuckDB, but that’s only just the beginning. It’s just the product of a hackathon, right?

Felix GV: Yes.

How to glue together a data storage engine with multiple open-source pieces in Java [27:40]

Olimpiu Pop: Okay, thank you. This leads me to the next question. RocksDB is written mainly in C++. The primary language for Venice is Java. How do you interface that? Well, it’s already a decade, also, it’s JNI?

Felix GV: Yes, that’s right. We use the Rocks Java dependency, which is a JNI wrapper on top of RocksDB, a very solid open source library, and it works really great for us. JNI of course has a cost, but it’s very minimal in our experience. So yes, it’s been working great and it allows us to essentially use each language where they shine the most. C++ of course is more complex to get right. It can have memory leaks or seg faults and so on if you don’t use it correctly, whereas Java is extremely safe. The garbage collector has got your back… Well, not fully. You can still have memory leaks in Java also if you don’t do the right things, but it’s overall much more robust, much easier, very high productivity.

So Java is great I think for building distributed systems, which are already difficult enough to get right. Handling all of the replication fault tolerance, failovers, all that stuff, handling all that reliably is difficult enough without adding the complexity of a unmanaged memory language. But for the parts where it matters the most, like managing the state, essentially, managing the state of the database, we do lean on these very robust C++ dependencies such as RocksDB and now DuckDB, both of these are C++ libraries that we access via JNI. So I think it’s like the best of both worlds basically.

Olimpiu Pop: That sounds quite right. I mean, Java is quite readable. It’s fast to move it regardless of what people are saying now in the space because Java is old and slow. Still, in my opinion, it’s a very powerful tool, especially for systems that need to be reliable, and that’s pretty much what I’m hearing from you as well. But still, I have to ask you, as you mentioned, the system was built in the last decade, and Java picked up the pace lately they have two releases per year. They have LTS every fourth version. What Java versions are you currently supporting? What’s your preferred tool? What’s the pick that you use?

Felix GV: So for now, I will say, unfortunately, we still have some internal stragglers on LinkedIn who are still on Java 8. There are very few of them, and we’re trying to squash them away, but… So, we are still targeting support for Java 8. We support Java 8, 11 and 17 and run all our unit tests on all three of those.

JDK Versions Supported by Venice DB [30:55]

Venice provides three services that as Venice operators we own, these are the controller, router and server. And in practice we run those services all on Java 17, which we’ve found gives us the best performance. But the client libraries, which there’s a variety of them that we offer in the Venice ecosystem for different needs, those are the ones where it’s important for us to keep supporting Java 8, at least for the time being. Honestly, I would love to drop support for Java 8. I’m really tired of being handcuffed to the language constraints of Java 8. I would love to start using some new language features, but for now we target support for those three, 8, 11 and 17. Hopefully we will add support for 21 soon. There’s some very minor things getting in the way of that, but they should be solvable. So yes, that’s kind of the picture for the versions we currently support.

Olimpiu Pop: Sounds about right. I mean pretty much everybody has still a small drag with Java. Most of the people are still using it in a couple of places. How about Project Panama? JNI did its job. It’s very good at what it’s doing, but as you said, it has some trade-offs. Project Panama is promising to take out some of the headaches that were coming with the initial native interface. What are your plans into that direction?

Felix GV: That’s a good question. For now, we’re not really considering doing anything in that space, given that we still need to support those older JVMs. Once that gets unblocked, we’ll certainly be interested at leveraging those newer functionalities. But for now, it is kind of a back burner issue. We’re not really looking at it seriously yet.

Olimpiu Pop: Okay. So that translates into the performance is good enough for now. You have to support the ongoing versions that you currently have, but you have a place where you can improve in the future.

Felix GV: Yes.

Optimise the hot paths of the Java application [33:19]

Olimpiu Pop: And how was running Java in production? I mean, it’s definitely quite simple as you mentioned to write it at the language level, but in my opinion, the platform itself is quite hard to manage, or at least if you want to optimize it and to run it at the optimal level, you need to know how to do it. You need to know where to look into that. How did you handle optimizations at heap level, at the platform level? I know that you had some discussions online about it as well in the heap space. Can you share more in-depth about that? What were your learnings?

Felix GV: I would say the Venice code base, which is in Java, like we’ve been saying, actually has a fair amount of variety in how optimized is it? There are some code paths where performance doesn’t matter at all, and it’s really all about readability, maintainability, those are the only things we care for. And then there are some other paths which are hot paths and we care very much about performance and those can end up being a little different looking.

So for the hot paths, we use a lot of tricks to make the Java code as efficient and performant as possible, so we are careful not to generate too much garbage. We do object reuse where it makes sense. We use primitive types instead of objects where it makes sense. We use things like fastutil for example, which is this library that gives you alternatives to collections for when you use the collections to store primitive types.

So for example, if you have an array list, you can have an array list of integer, but that’s like the object type integer. So it’ll have boxing around each items in the list, whereas you could also have an array of int which is the primitive type, so that’s smaller. But of course the array list variable size and the array is fixed size, so how do you bridge the gap between these where you can actually have a primitive equivalent of the ArrayList? So these are all tricks that we use to minimize the memory footprint and so on. So there’s a lot of things we do in that space to try to reduce allocations, reduce garbage, and ultimately our goal in doing that is that we care about tail latency. We don’t care only about average latency, we care about the long tail, the P99, the P99.9, essentially the worst of the worst latencies that the system provides, those end up getting affected by garbage collection pauses so we do care about minimizing garbage in some parts of the system, not all of them.

Looking at the Java Heap [36:22]

Olimpiu Pop: Can you share how you managed to look more into this space? I know that you have a choice of tools, but I know that you built it yourself. Can you share more about this experiment?

Felix GV: Recently I shared online some work we did in terms of Java heap measurement, and the use case essentially is that in various parts of the system we have buffers. So some pieces of data, let’s say, get consumed from Kafka, and those come in mini batches and they land in some buffer, and then there’s another thread or thread pool somewhere that de-queues item from that buffer and processes them, puts them in the storage engine or whatever. So this gives us better end-end throughput, but we don’t want this buffer in the middle to grow unboundedly, otherwise we’ll hit out of memory exceptions. So clearly the buffer has got to have some limit on it, but that’s actually challenging to get right because it’s used for a large variety of data sets that all have different shapes.

We have more than 2,000 data sets in production. Some of them have large key value pairs, others have small key value pairs. And so when we buffer the items in that buffer that I talked about there is the payload, which is the user provided data. When I say user, I mean like the Venice user, the internal user right. There is the payload part, which is the key and the value that the internal user is writing to Venice. And then there’s the Venice metadata. There’s various pieces of metadata we keep to ensure data completeness, data integrity and various other details like the timestamp the data was written at to monitor the lag of the system, all of these things are part of the metadata.

And so the point is, if I do a very naive implementation of this bounded buffer concept, I could say I want no more than let’s say 10,000 items in the buffer. Well, that’s one way to bound how much memory it’s going to take, but that ignores the fact that the payload part is variable in size. If I have 10,001 kilobyte payloads, that’s very different than if I have 10,100 kilobyte payloads. The memory cost of that is very different.

The point is we want to measure the entirety of what is present in that buffer. We want to measure the variable size payload. We also want to measure all of the Venice metadata, but there’s more, there’s a third thing in there, which is the overhead of the Java objects themselves. So every object in Java has a header. The size of that header depends on a bunch of different JVM settings, including the Java version, but also whether it’s 32 bit or 64 bit, and whether the heap size is under 32 gigs or not, and then some other settings as well. All of that influences what the header size is. And so if we want an accurate measurement of the contribution of that bounded buffer to the heap, we need to take all of these components into account, the user payload, the Venice metadata and the Java object overhead.

And so we did some work in that space. We also looked at some off-the-shelf libraries that do this. There’s a few open-source ones like Ehcache’s one. Various systems have come up with their own kind of heap size estimator. In the end, none of the off the shelf solutions completely fit what we wanted, and so we ended up rolling out our own. It’s not that much code by the way, it’s like a couple of classes, a few hundred lines, that’s about it. It’s not like a huge piece of code. But it was pretty interesting because as part of writing this, I had to brush up on a lot of things that I had kind of superficial knowledge about, but I needed to really clarify all of those details in terms of how the heap and Java works, all the object structures are laid out and stuff and that ended up being this custom heap measurement module, so to speak, or class inside the Venice code base that now gives us precise measurements of these objects. So that was a pretty interesting project I thought.

Olimpiu Pop: Yes, that sounds quite interesting because it kind of feels like you mentioned while working with different versions of Java. When you get started, it’s all about writing code, but then there are other things that are maybe the long tail of the things when you’re running them in production, like what versions do you need to support for your organization? Because even if you would like to work with the shiny stuff that is in Java 21, most of the systems are probably around in 11 or 17, but there are still some legacy systems that are still anchored in 8. And when you’re running in production, again, there are a lot of things that you’re not paying attention while you’re writing, and then it’s about the optimization. And even if, it’s about pretty much finding the right balance, the right sum for everything that’s working. So that’s nice for those that are interested about it. It’s not for everyone, but I mean somebody has to do it. Pretty much that wraps my questions. Is there anything else that I should have asked you but I didn’t?

Felix GV: Well, if people want to learn more, we have our documentation up at venicedb.org that includes all of the conference talks that we’ve given, other podcasts on the subject of Venice. There’s also a page in the documentation about the stuff I just talked about, the Java heap, including a bunch of links that I want to give credit to Aleksey Shipilëv. I used his blog posts extensively to brush up on my own knowledge, and those are a treasure trove. I have a bunch of links to those in our wiki, and if anybody wants to learn more, hit us up. We have a community Slack instance that people can join freely, and we do bi-weekly, meaning once every two weeks, community sync up on Zoom if anybody wants to come in and ask questions. So yes, that’s about it as far as the project is concerned.

Olimpiu Pop: Thank you. Thank you for your time, Felix. Thank you everybody for listening to this podcast. If you want to challenge Felix and his team about the things that they built in Venice DB, they are available either on the community meetings that they have scheduled or online through the channels he already mentioned. Listen to InfoQ podcast and hopefully see you guys around the QCon events. Have a great day.

Felix GV: Thank you. Have a nice one.

Mentioned:

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Building LinkedIn’s Resilient Data Storage: A Deep Dive into Derived Data Storage with Felix GV

Transcript

Why are recommender systems a good candidate for derived data storage solutions [01:40]

The source of data shouldn’t dictate the choice of the database. At a small scale, any database works [06:57]

Why was VeniceDB necessary in the already crowded database space? [08:21]

An aging product can be replaced more quickly than a modern one [15:22]

Venice DB is a data storage system constructed on the shoulders of “giants” like Kafka, Apache Helix or RocksDB [19:40]

Loose coupling needs to be used with purpose wherever it’s needed [21:56]

How to glue together a data storage engine with multiple open-source pieces in Java [27:40]

JDK Versions Supported by Venice DB [30:55]

Optimise the hot paths of the Java application [33:19]

Looking at the Java Heap [36:22]

Leave a Reply Cancel reply

Stay Connected

Latest News

Building Efficient Mobile Streaming Apps

Big Rust Update Merged For GCC 15 – Lands The Polonius Borrow Checker

The chip war between the US and China is already leaving collateral damage. Although in South Korea

Best Online Hearing Tests of 2025

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Why are recommender systems a good candidate for derived data storage solutions [01:40]

The source of data shouldn’t dictate the choice of the database. At a small scale, any database works [06:57]

Why was VeniceDB necessary in the already crowded database space? [08:21]

An aging product can be replaced more quickly than a modern one [15:22]

Venice DB is a data storage system constructed on the shoulders of “giants” like Kafka, Apache Helix or RocksDB [19:40]

Loose coupling needs to be used with purpose wherever it’s needed [21:56]

How to glue together a data storage engine with multiple open-source pieces in Java [27:40]

JDK Versions Supported by Venice DB [30:55]

Optimise the hot paths of the Java application [33:19]

Looking at the Java Heap [36:22]

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News