Transcript
Ricardo Ferreira: I also consider that it is a very interesting time for us to be working in tech because we are solving very interesting problems: new problems and exciting problems. Some of them are not necessarily exciting, because there’s a lot of work, there’s a lot to learn. For the particular theme that we’re going to discuss here, which is integration, I would like us to do a trip on the memory lane about whoever remembers this book, “Enterprise Integration Patterns”. How many of you have actually spent some time working on integration problems in the past? Do you like it? Do you think it’s a good field to be working on? I particularly think it is one of those most amazing fields in computing because I’m a huge believer that the most interesting problems and the value that we create with software engineering happens on the integration side.
Whenever you are building applications, yes, they inherently have value on them. When you start integrating things, you start creating some experiences that were not possible before, when you start connecting things. That book basically delineated my entire career since I started, back in the days. I’m going to list some names here, and let me know if you’ve heard them, like BizTalk, Sonic ESB, TIBCO Rendezvous, something like this, all of those. It doesn’t have to be messaging midware, like file transfers between like, you want to replicate your entire dataset from a branch to corporate, to the specific branch, that’s also considered integration. The interesting thing about it is that that book captured virtually all the problems that we faced until this day.
By the time this book was written, 2015, give or take, I think it captured very well the very essence of the problems that we’re facing. I stopped roughly in 2015. I think we have all figured it out at that point. Integration was no longer something that was reserved to that word there, integration. Every single developer building microservices, building IoT applications, building serverless applications, we’re basically doing integration at the end of the day.
If you go to your home right now, if you ask, Alexa, turn off the lights, that’s an integration scenario right there off the bat. Integration is not necessarily something that used to be fancy, unless when you start seeing the introduction of vector embeddings in the software engineering field, that’s when I realized that, everything that was exciting in the past came back to be exciting again. That was a huge deal specifically for me. The whole point of this presentation is to try to capture those experiences related to integrations in the lenses of vector embeddings. I personally believe that we’re dealing with a new era of integration challenges. I’m going to try to show them to you in this presentation.
What Are Vector Embeddings?
Also, some of the patterns that I was able to build, some of them, not all of them, but I’m going to show it, but for some of them that I was able to capture, they are repeatable which is the very essence of patterns, like the book. What are design patterns, at the end of the day? They are repeatable solutions for recurring problems. Everything that I’m going to show here, it’s something that I would very confidently say that in every integration scenario involving vector embeddings, you’re going to be able to reuse them. How many of you are familiar with the concept of vector embeddings? Let me change this differently. How many of you are not familiar with vector embeddings? Those are the people that I was more interested in. I prepared a very quick demo for you to get at least a feeling about what a vector embedding looks like, because if you don’t have that feeling, it’s going to be hard to see the patterns in it and find, what is the big deal with this? What is the actual problem that we’re dealing with here?
At the end of the day, we’re talking about numerical representations that represents a dataset, but in a form of an array. An array that can be either an array of bytes or an array of floats. I’m going to explain when it’s going to be an array of bytes and an array of floats later on. Then, basically, everything that vector embeddings enable you to do for implementing vector search, semantic search, recommendation systems, AI features.
Demo (Vector Embeddings)
Let me actually jump to this demo that I’ve created. Here goes the demo. This example is basically yet another example of doing some search. It doesn’t matter the vector database that we’re using here. What matters is that in here we’re able to search for the title of the movie, or for something like Ben Affleck, so the actor that belongs to the movie. We call this full text search. There’s no big deal with this. You know the movie School of Rock with Jack Black? Imagine that you don’t know that the type of the movie is School of Rock, and then the only thing you know is the dude who teaches rock. This is what we call semantic search. When we actually create this ability to search for things, we don’t necessarily write down everything, how the text was laid out in the database level. It was able to find School of Rock, and Get Him to Greek, and The Doors, I don’t know about that. It has some relationship in the end. I have a database of movies, I have 4,520-ish of them. For each one of them, they’re stored as JSON documents.
One of the fields of this JSON document is his plot embedding. We have the plot field, which is a string. Basically, that string, when I was uploading this dataset to the database, I created this plot embedding. This is what we call a vector embedding. This is the actual numerical representation of that text string. Then, that is what allows us to do semantic search, or RAG, or whatever AI use case that we’re building these days. What I would like to accomplish with this quick demo is for you to look at this. This is a vector. Let’s remember the embedding model that I’ve used here, I think it was Hugging Face. This one, specifically, this is array, has 384 dimensions or positions. Certainly, this is considered a huge embedding.
If you look, for example, OpenAI embeddings, we’re talking about 1,536 dimensions. It can get greater and bigger, depending on how precise you want your vector search to look like. Point is, one thing when you’re dealing with integration scenarios is to replicate a field like this, which is a plot. No big deal of replicating strings across the wire. When you start replicating those big arrays all over the place, from different microservices, from different regions, from different datastores, that’s when the complication starts. That’s where you start realizing that, yes, I remember that in the AI patterns, we used to have like CBR, Content-Based Routing.
Then every time, depending on the route A or route B, I would inspect the payload of the dataset. Then depending on what is contained in that payload, I’m going to do the decision. How do you do this with vector embeddings? That’s where, when you start thinking about vector embeddings in the world of integration, you’re going to start realizing that you have very interesting challenges ahead.
This is something that I’ve put strategically here in this presentation, because this is actually a quote from Sam Altman. It captures the very essence of what we’re going to see here. Why I think this quote is important, because you’re going to have the feeling, at least I have felt this way, that the patterns that I’m going to show to you, they are not necessarily trivial to understand. I’m going to do my best to actually share all of them with you. You must start realizing that it’s too much complexity for me to worry about. Then when you come back to this perspective over here, you’re going to say, that’s worth it. Then we’re going to actually rethink about this quote throughout the presentation.
Vector Synchronization Challenges
Let’s talk about the challenges that we have with vector embeddings. I think the first and most important of them is that data changes. Remember when I said that the vector embedding that you have in the database was the result of the plot field. This is basically an example of a Python code that uses LangChain to do this, like grab a text string, and then it serializes to create vector embeddings. Then the other one is the equivalent of a Java code using a framework called Spring OM to do the same thing, which is basically grab a string value and then do a vectorized process using annotations and reflection to create a field that is a byte array of things.
The point is, if you have a data model like this, you can almost expect that the original source of data the creating value is going to change. It’s going to. It’s not something that, yes, from time to time. No, it’s going to. It’s a certainty alive. The data source may change. One of the things that also can affect this is that we can use different embedding models and different dimensions from time to time. This is something very common when you are building search-oriented applications or AI applications. You start off your MVP project by using Hugging Face, because you can deploy locally, it’s cheaper, you don’t have to pay anything for the service.
When you go to production, you decided, I want to increase the confidence of my embedding models, and then I’m going to use OpenAI’s embedding. You arguably are going to change the number of dimensions as well. Or, sometimes it might be something related to your business logic. Something about, ok, in this region over here, regardless of what the embedding contains, if the search looks like it starts from this region, certain results cannot appear. That’s the type of thing that when you are dealing with, let’s call loose-y fields like strings, floats, or Booleans, you have a control about how to manipulate them. If everything is compressed and shipped into an embedding, it’s very hard for you to tackle this problem.
The other problem that is also very common is that there’s not necessarily always a one-to-one relationship about a given source of data and the source of the embeddings. I’m going to give two examples. The first one is that we heavily use best practice, and I’m being ironic with the double quotes here, called instead of creating one huge chunk that creates an embedding, it is the best practice to break it down in multiple smaller chunks, and then you create embeddings for one of them. By all means, this is a best practice for efficiency purposes. If you want to get efficient, this is the best way to do it. The problem is that it creates a proliferation of one-to-many relationships about your source data and your embeddings. That’s why I ironically say that’s a best practice. It’s almost like an anti-pattern in the world of integration. What is a best practice for one scenario becomes a bad practice for another.
The other one is that embeddings don’t necessarily rely on a single datastore that is shared across multiple applications. That is, by itself, if you are into microservices architecture style, is an anti-pattern. Each microservice has their own datastore, so if you need one source data to change, you have to replicate data to multiple datastores, concurrently, reliably, and atomically most of the time. That itself is a huge problem. We have some known approaches that some companies use. They basically came from the DNA of batching and ETL that we have been doing for the last 4 years, and they are proven practices, and by all means, they are reliable and good practices. The problem with using some of those practices, like, yes, the content team will flag which data needs to be updated, relies on the human labor, or we’re going to just rebuild the vectors every night, like an ETL batch that runs at midnight.
The problem with this is the source of scale, and the problem of scale that we’re dealing with here. Vector embeddings are very computationally expensive to create. Either if you’re running the code within your own applications, or if you are doing this asynchronously in a batch, they are computationally expensive. If you accumulate all of them that have to be done in a single night, chances are that you’re going to cross over one entire day and not finish your batch, because they are computationally expensive. “Ricardo, but you can always scale up your hardware resources”. Yes, you can do this, but what about the tradeoff about the value to do this, cost? Given the time frame that you have to do this. All of those things start to get into the equation of, that’s a hard problem to tackle.
Three Dimensions of Change
We’re about to start talking about the actual patterns. What I would like for you to reflect a little bit, not sure if you were able to capture that very essence, is that there are three pillars of change that may trigger the need for you to replicate your vector embeddings. The first one, the most obvious one, is going to be the data changes. Source data change, vector embeddings need to change. Second one is going to be application changes.
Like I mentioned before, VP of your application starts using Hugging Face, and then when you go to pre-production or production, you start refactoring your applications to use OpenAI embeddings, or whatever other embedding models you have available. This is going to also trigger a need for you to work with vector synchronization. Also, business changes, such as regulations, business rules, or something that actually is going to force you to think about, ok, this is a very unique particular use case for the company or organization that I’m working for, and I’m going to need to work with this.
Vector Sync Patterns
Let’s talk about some of the patterns that you can use to solve those three pillars of change. I created a summary of all those five patterns. We’re going to talk about the dependency-aware propagator, the semantic change detector, the versioned vector registry, the business rule filter chain, and then adaptive sync orchestrator. As you can see here, each one of them, they have a primary value, or they have a supporting value, or they have a high value. For those scenarios, what they have, like primary value, it means that the pattern was designed to solve that particular trigger.
For example, semantic change detector and dependency-aware propagator, they’re basically created for the need of data changes. Then application changes, the primary value is going to be like the problem of versioning the primary value. The business change is going to be the business rule filter chain. It doesn’t mean that you cannot use the other ones in conjunction with. It only means that your first insight for your first trigger is going to be the usage of that.
1. Dependency-Aware Propagator
Dependency-aware propagator, basically the problem that we’re discussing here is to solve the problem of detecting when a data source changes, and then we’re going to reliably replicate and synchronize that change to recreate the embeddings. Considering that we might have a problem of either there’s going to be a one-to-one relationship between the source data and embeddings, or likely one-to-many. The way you’re going to do this, you have to do this as decoupled as possible of your source data.
At the point that you don’t have to go bothering the development teams to do this by themselves in the application code or the datastore that they manage, considering they use a microservice-oriented architecture. What are the problems that we’re trying to solve here? How we detect when source data changes efficiently. How we actually propagate the change for the vectors, and how to deal with the high volume of this. The solution for this is to use CDC, Change Data Capture. It’s a technology that has been around for quite some time, even before open-source technology started to become mainstream, like Debezium on top of Kafka Connect.
If you go back 10 years, 20 years ago, you would have like Oracle GoldenGate that would do the same thing. Arguably, virtually every database manager has something built-in that allows us to do. The beauty of this type of solution is that it reads all the committed transactions from the read log. It doesn’t necessarily interfere with the performance of the database. All of this to make sure that you are not going to deteriorate the performance of the applications. The point is you also have to create a dependency graph, and that dependency graph has to be built somewhere. More importantly, it does have to exist.
This is not supposed to be a sequence diagram, kind of. It’s more like a combination of a sequence diagram in conjunction with a component diagram, talking UML there. You’re going to have a bunch of source systems, and one of the characteristics that is very unique for embeddings is that all the source data are not necessarily sitting on datastores. If all of them were sitting on datastore, the challenge will be relatively easy to tackle because most of this, like Debezium has connectors specializing on re-inference specific database. What if your source data is an object sitting on an S3 bucket. Or what if it’s just like a file, a PDF file in the file system? You’re going to need to actually create a layer of your patterns to handle change detection. Some of them will be built-in CDC, and some of them you have to implement some watchers, pulling type of thing, the checksum change, something like this.
Then you’re going to have the event pipeline that’s basically going to trigger the execution of the vector processor. All of this we’re going to discuss in the end. It needs to be as decoupled as possible. The persona of an event bus is one of the critical components in this architecture. By event bus, you can use whatever you want. In the projects that I’ve worked with in the past, I use Apache Kafka, because all the teams have expertise on this. Apache Kafka for the messaging and streaming layer, and also Apache Flink for the actual processing layer. For example, the vector processors, that was basically implemented as a job of Apache Flink.
As you can see here, the event bus is going to detect when a change was created. The job here is going to reach out to the dependency resolver. That change impacts which vectors? This, this, and that. Then, it basically triggers its execution then concurrently. Some of them not necessarily in a sequence. Depending on the criteria that you can set here on the dependencies registry. Do they have to be atomic, or can they be updated loose coupling? I’m going to talk later about this need for the historical data, because that’s basically when you combine with another pattern.
For the application perspective, for the code perspective, they always impact implementing this pattern, because it will be beautiful if all of this could be executed asynchronously without the developer touching any code. In order for the dependency registry to actually be filled with data, the developers have to refactor the code to when they create a vector, they have to register the dependencies right there. By the time the code creates an embedding, they have to also register the dependencies. There’s a human interaction here. Whoever is responsible for this, you have to go to your application developer teams and you have to talk to them, you have to educate them, you have to give examples. Probably, you have to create this, we’re using some sort of API or something like this. A lot of work to do. These are some of the metrics that when you implement this pattern, I think this is important.
One of the things that is very important for day 1 is for you to start implementing observability. I’m not going to go over all those metrics over here. Those are the metrics that the technical teams can follow up if something is going good, but some of them are going to be metrics created specifically for the business teams, for them to evaluate if this whole investment and infrastructure are paying off. One of them, for example, the change fanout. Yes, I can create a summary of this metric by the end of the month and say, this month we tripled the amount of vector embeddings that need to be updated, and it explains why there’s an increase in infrastructure cost. It’s important for you to capture those metrics, especially because everything is implemented using Kafka, for example, or whatever streaming messaging technology you’re going to use, you can capture from the bus. It’s very easy to implement them.
2. Semantic Change Detector
Semantic change detector, remember when I mentioned that computing and reprocessing vector embeddings is computationally expensive. I meant that. It is. What you want to avoid is that some changes do not necessarily equate into you actually updating all the vector databases because the changes might be very succinct or minimal that you don’t necessarily need to go the whole change cost.
For example, one of the practices that people use when creating vector embeddings is to associate metadata with it. Think about metadata as a header on top of your vector embedding, so you’re not changing the vector embedding itself. Let’s say, for example, that you want to avoid reprocessing an entire vector embedding if you only change the metadata. The pattern they’re going to use is semantic change detector. How this pattern works, you’re going to still have this ability to detect when a change occurred, so the relationship with the previous patterns is very strong here. Then you are going to use some analysis pipeline. The analysis pipeline, the way we’ve implemented this in the past has three stages.
The first stage, you’re going to do light analysis, the size change, the field change, a very simple detection. You can eliminate some of them right off the bat. You’re going to categorize them into significant changes and insignificant changes, so bit is going to try to capture in those two clusters of applications. If you are not able to actually decide based on the light analysis, you’re going to do a text similarity, the source data itself. Does it change considerably the whole syntax? You can do some regular expression to detect that. It’s very simple. Basically, 90% of the characters change. That can be a criteria. Or, if you’re not able to do this, you’re going to rely on the most deep semantic analysis. This is computationally expensive to do this because you are basically comparing the old vector with the new vector.
Then you are going to say, the size of the array changes, or the values of each array, the position changed significantly. What I’m trying to say is that you want to try to avoid as much as possible this third stage for efficiency purposes. The point is you’re going to create a processing order. Then for all the significant change, you’re going to start again with the vector processor. This is the second pattern where we have that persona of vector processor. That itself is a pattern. You can look at vector processors as something that is reusable in multiple situations. You can see this as a reusable, shareable pattern across the others. There’s an important thing here. Every time you come up with some significant changes, you have to have this machine learning inference in here as an optimization engine so you can calibrate in a monthly basis or in a quarterly basis of, I’m having a lot of false positives here, so I’m not able to actually detect with high accuracy by the end of the month. You want to continuously calibrate this.
The only way to do this is for you to actually store and flush all your decisions. Then you can actually come up with a human analysis that, ok, so this is what’s going on. This is the pattern that we’re not seeing. You go back to your light, text, or deep semantic analysis and recalibrate. Same with the other ones. In order for this to happen, you have to have a little bit of code refactoring for this to work. You have to either leave the CDC to handle the whole change analysis asynchronously. That’s the approach number one, or you can actually bring this analysis up front to your applications if they’re critical. I don’t recommend this.
You know why I wouldn’t recommend this? Because that computation analysis is computationally expensive. If you’re bringing this process into your application, likely you’re going to increase the usage of memory, and CPU, and bandwidth. It’s better if this is executed asynchronously. Those are some of the metrics that are related to the semantic change detector. The false positives and the false negatives, those are the most important ones for the business side. The analysis time also is important for the technical time. Maybe you’re using algorithms and implementations that are so elaborated, but not necessarily efficient, that the analysis time is taking too long, and you want this to be fast either for the light, or the text, or the deep analysis. I’m going to walk over all of them.
3. Versioned Vector Registry
Versioned vector registry, the type of changes that are going to trigger this is going to be either when you change your embedding models, or you change the dimensions of your vector embeddings. Why is this important? Because when you change, think about this, if you change the number of dimensions, and you simply replicate all the changes and synchronize the vectors all over your data source, the microservices are going to, you might end up in situations that the code that was dealing with v1 is going to break. Now you have to make sure that you’re going to have a v1 and v2 happening concurrently for a period of time, so you can eventually retire v1 and only start using v2. That’s why this pattern is called versioned vector registry.
How you actually solve that problem, you’re going to have some version management system, so you can store all those versions of your vectors. That’s a lot of storage. That’s another complication by itself that you have to manage. What’s going to allow this to do is that you can put a time period about when actually concurrent version has to be alive, a week, a month. It needs to be shut by the end of the day. Chances are that sometimes because of the code that you created, maybe you want the two versions of the code to coexist. The repercussions of this in your code are very considerable, because imagine that you were doing a semantic search, and now your code has to make sure that you’re going to search on the v1 of your vector embedding, but also consider v2, or 3, or 4, or 5.
Otherwise, your searches are not going to be able to find the proper results that were loaded before. That is the pattern that most impacts the way you write your code to deal with vector embeddings. It’s the most intrusive. It’s the one that you likely are going to have your development teams screaming at you, and then they’re not going to be happy for the type of refactors that you are going to force them to do. It is always a tradeoff of, yes, I know it’s a pain, but what is the value that you were accomplishing with this? You can actually have some advantages. Metrics-wise, I would say that the migration progress on the technical side is one of the most important ones. Like, how long does it take for you to actually replicate these migrations? For the business side, I would say is the version breakdown by vectors.
By the end of the month, you can start showing to the business, yes, you see this vector over here? You ended up with these three versions, and in the past month, we only have one version. What changed? Why are the development teams changing the embedding models and the dimension of vectors so frequently? I worked at Amazon as well. I no longer work for Amazon. At Amazon they have this saying called the dogs that are not barking. Have you heard about the expression before? Dogs that are not barking is that when everybody thinks that everything is ok, just because dogs are not barking. When you start looking for, but why is a version changing so frequently? That’s something that highlights the discussions for investigating what’s going on. That’s something that you should be aware of.
4. Business Rule Filter Chain
This one, business rule filter chain, is basically when you have to deal with situations where you want to make sure that some business changes are going to trigger the need for the competition of new vector embeddings. This one basically is a rules chain pipeline that, depending on the changes that are detected, they’re going to do a pre-processing and a post-processing scenario. Then they’re going to trigger the vector processing as well.
This is basically for you to make sure that whenever you have a change, instead of going blindly and reprocessing them, like using the dependency-aware propagator, you are going to put this filter in between the process to make sure that it’s obeying these rules, A, B, and C that have been established by the business team, the product owners, or whoever is responsible for the applications. This is yet another overhead of processing that’s going to happen over the wire every time you need to reap enqueue to the vector embeddings. I think they paid off when you are dealing with something like regulations that you have to be in compliance of.
Basically, the way it works, you are going to have this query rule context. The implementation part, it varies. You can use whatever framework for this, like Drools, which is a very popular open-source project for dealing with business rules. It doesn’t actually matter what frameworks you’re going to use for implementing this. What matters is that you bring this kind of evaluation, and then by the time you’re actually going to either read your vector or write your vector, you apply those pre-processing and post-processing rules when you’re doing the update.
Again, yet another infrastructure problem that you are bringing into the development team. Are you getting the feeling how much this is going to cause some disruption in the way developers build applications. Remember the quote from the beginning? That’s what you have to keep in mind. Those are the business metrics that I think are important. I think the rule processing time is definitely the one that the technical team needs to be aware of. Because if they’re taking too long to evaluate, I think they’re maybe not being created with efficiency in mind. This is usually the case because the people that usually create those rules are not necessarily the developers. They’re going to try to express the rules in a natural human language. Efficiency is not necessarily always the best thing. The rule change impact, probably this is going to be the most important for the business side, how much compliance you are creating.
5. Adaptive Sync Orchestrator
Lastly, this one is more like a collateral pattern. It’s not necessarily as important as the previous ones. I’m not sure if you had the feeling, but the previous ones exactly address the three dimensions of change that I mentioned before. When this was presented to me the first time, it was from a business need to actually prioritize certain changes. You know when you have like three development teams, the A, B, and C, and then the B says, no, that change has to take priority because it’s more important for us. This must, that specific situation. This type of cultural influence and how those patterns are going to execute is going to create a need for some prioritization.
The way to do this is using some orchestration engine where you’re going to have several strategies for selections for prioritization. This prioritization can either be determined by the code. We’re going to show the snip code in a moment. Either you prioritize in the code, you just created the vector, and then you prioritize. This is huge, important that it updates in the next millisecond. Or, you don’t say anything in the code, and then just let the background take care of the whole prioritization. Maybe it can end up being batched, or being scheduled for last month, or it’s going to end up being this cluster of immediate processing.
The criteria to do this is going to be dictated either by the development team in the code or somewhere else, or it also has to take into consideration the resources that are available. Because all of this, the Flink jobs, Kafka infrastructure, all the infrastructure that is going to take care of executing the vector processor, they might become overloaded with the amount of changes in a given point in time. You have to have some notion of how many resources I have available in order to process all of this. You see when was the last update.
The timestamps are going to dictate, for example, when you can use time to make sure that, ok, this has already taken more than two days that it wasn’t updated, so that has to create some prioritization over the ones that happened a week ago. Adaptive sync orchestration is not something that’s going to be very common. It depends on this need of all the development teams to have some sort of prioritization. It’s a very useful pattern for you to think about it from day one, because eventually this type of discussion might happen. I think for the technical side, the strategy switching is probably the most important one, because people are going to ask in the end, why did we change this prioritization from this to that? Why did it become more important than mine? This is the type of report that the people would like to see.
Event Types, Patterns, and Relationships
This is more like an overview for you to see how those patterns used in combination look like. As you can see here, the likely most important one is going to be the dependency-aware propagator. That’s the one that actually does the heavy lifting at the end of the day. The versioned vector registry, mostly because the most common change is going to be the change of embedding model and the change of dimensions. Then those, I categorize them as operational patterns.
The semantic one, adaptive ones, because the frequency in which they happen and the need for which they’re going to actually need to process are not going to be so common. As the glue for everything, you can see here that’s the event bus. This by itself is something that you have to design very carefully, because if it’s the glue that’s collecting all the patterns, it’s going to easily become your source of bottleneck eventually. If you implement it correctly, it’s not going to become, because that’s the characteristic of an event bus, but everything running on top of hardware resources can become a bottleneck. That’s the basic principle of computer science.
Event Bus as the Central Nervous System
Speaking about the event bus, I mentioned before, but just to highlight, I think Apache Kafka and Apache Flink are a very good combination of technologies for you to implement this. Mostly because they’re very well known. There’s a bunch of resources available online. There’s a bunch of companies backing this technology. They’re proven. This is not something that, yes, I created two years ago and then we’re experimenting with. You can trust that an application built on top of Kafka and Flink are going to work. I think the most important characteristics that they provide is the two of them, the scalable processing, for high volume changes. There was a project that we tried to implement with RabbitMQ, AMQP in the past.
Then, basically, the tradeoff that we found in there is that RabbitMQ doesn’t have persistence enabled by default, which is the reasons for performance, and you start to enable persistence, you start slowing down the entire messaging system, which is counterintuitive. Kafka has persistence enabled by default. It’s not because you’re saving into this that you’re going to slow down the entire messaging system. That’s one of the characteristics that is very important for scalability. Apache Flink has very good extensions for you to actually plug your observability frameworks on top of, either if you are using, for example, OpenTelemetry for instrumentation, or if you are using some proprietary monitoring/observability approach. Apache Flink makes all the events exposable for you to work with, and then you can work them to become metrics sources.
Apache Avro as a Serialization Framework
We’ve made a very quick evaluation of frameworks for doing the serialization, because at the end of the day, think about this, you are moving events all over the place from one place to another, from Debezium, from Kafka, from the systems that are going to use the Flink jobs, so you have to have a predictable way in which all the events are passing through. We evaluated Apache Flink and protocol buffers, protobuf.
Then we found that Avro has a better alignment when you are working with Kafka and Flink, especially because they have this Flink-Avro connector, which makes developers not worry about the implementation aspects of Avro serialization. Whereas protocol buffers, it’s annoying when you have to create the versioning and the ID numbers for every single field from your schema, from protobuf, protobuf is very good for efficiency. There are some benchmarks that prove that protobuf is faster than Avro, arguably. Don’t quote me on that. The point is, I think for the development experience and flexibility, Avro has a better option. Also, it has support for byte array and float array.
The reason why this is important, we’ve detected that, depending on the vector store that you are using, some vector stores might store your data as a JSON document, and JSON documents only deal with array of floats. Whereas some other vector stores, you might deal with something that is not stored as JSON, it’s more like opaque hash-like data structure, and then everything is going to be stored as an array of bytes. Point is, Apache Avro supports both, so it’s very flexible for you to use, to carry all those embeddings all over the place. Again, now it’s going to make sense, the quote that I put in the beginning about, that’s so much complexity that we’re trying to bring in our lives, with so much complicated infrastructure for this, the development pain that we’re going to bring to the development teams.
As Sam Altman says, I think the difference between a successful implementation is how you are thinking about the design of your systems upfront, and not reactively. That’s a lot of infrastructure and concern that you have to bring to your development teams. At least think in the possibility of having to deal with some of those patterns. Dependency-aware propagator being the most important one, in my opinion.
Key Takeaways
I think that’s very simple to understand, vector staleness is a multi-dimensional challenge. It’s not something that you can solve with a product, sprinkle a little bit of some code patterns, here and there. It’s something that you have to think about thoughtfully. Effective sync requires pattern-based solutions, mostly because I think those patterns, if you’re able to capture across multiple projects and departments and organizations, they’re going to be able to mature better instead of being like ad hoc solutions for every product. I think the glue for everything has to be event-driven architectures. There’s a way for you to do this in a batching way, or in an application-centric way to solve all those patterns, but I think you’re going to start dealing with performance problems and the coupling between all the layers of this pattern, so it’s not very good.
Questions and Answers
Participant 1: The question I had was with regards to the synchronization problem, specifically the vector. Isn’t the point of doing embeddings that we need to be able to handle many-to-many relationships? If that becomes a problem, because a bunch of different people are going to mention the same thing many different ways, and that is the whole point of doing vector embeddings, semantic search. Do we actually need to worry that much about? It seems counterintuitive, like, why would you want to synchronize if that’s the problem that you’re trying to solve?
Ricardo Ferreira: I think the essence of what you’re asking is what is the need for synchronization. It’s to avoid the staleness of data. Common examples include when you’re having a search-oriented application like the one that I’ve shown in the beginning. Stop finding and searching items that you used to find before, because the embeddings are not necessarily in sync with the actual keywords that were originally used to create embeddings. Those are the triggers that are going to force you to think about, yes, I’m having a synchronization need here. I agree with you that that might not be a need immediately.
If your business is extremely reliant on the accuracy of these vector embeddings, I think it’s going to be a little bit clearer if you actually need those synchronization patterns between them. I think the same goes for RAG use cases, like when you’re sending chunks of vector embeddings to the LLM and the LLM starts hallucinating a little bit because your embeddings are not necessarily as accurate as it used to be. I think that’s another trigger for you to, “Yes, I need to re-sync them”.
Participant 2: You mentioned Flink. I have personally not used Flink much. From what I understand, it’s like a stateful consumer of Kafka, that’s the way I think of it. I’m curious, what are some of the use cases where you would require stateful computation over this event bus of updates?
Ricardo Ferreira: I think the most value that Flink provides being stateful as it is, is for you to reprocess something that, for some reason, was stopped forcibly in the middle. It has this concept of built-in snapshots. One thing is for you to reprocess something in a data level. Kafka can do this very well with the whole concept of offsets that it has. When you start processing, a compute process, you might be interrupted in the middle. Other technologies that we have in the past didn’t have this ability of, continue in the way they were stopped. They were basically resumed from the beginning. Depending on the situation that you were dealing with, that may lead to inconsistency and corruptions of data, especially when you’re dealing with vector synchronizations, because array of floats and array of bytes can easily break.
See more presentations with transcripts
