Deploy MultiModal RAG Systems With VLLM

Transcript

Stephen Batifol: We’re going to talk about multimodal RAG systems. I’ll be using vLLM and I’ll be using Pixtral from Mistral. I will talk a tiny bit about vector search, vector databases, so that you have a better idea of how everything works behind the scene. There’s lots of talk by vector databases. There’s lots of people saying like, mine is better, mine is doing that. I’m just going to give you a quick idea of what to look for, and which index to pick and when. Then, in my opinion, we don’t talk enough about embedding models. We’ll talk about those. You will then see me also talk about vLLMs, Pixtral, how vLLM models are working. Then, at the end, I have a live demo.

I’m Stephen Batifol. I was a developer advocate for Milvus, which is an open-source vector database, and now I’m joining Black Forest Labs, which is an image generation company.

What is Vector Search?

What is vector search? For the past two years, vector databases have been like, you need to use that. A lot of people don’t really have an idea of what it is and how it works. We’re going to go through that. Why is it so popular? Why is it everywhere? Why is it still important nowadays? It’s because vectors unlock your unstructured data. Think of images, audio, videos, user documents. Then, what’s happening is that you’re going to put that through your deep learning model. Then, that’s how you get vector embeddings. You then store them in a vector database. Then, from now on, you can perform search. You can perform RAG, Retrieval-Augmented Generation. You can also perform other things. A lot of people are talking about RAG. For the past two years, we’ve been talking about RAG, but there are other things. There’s drug discovery.

A friend of mine, he’s a CTO at an AI protein company, and they create proteins using AI, and then they search through the protein using vector search. This is also the kind of thing that you can unlock. Also, very good for recommendation system and anomaly detection. How it works is that you’re going to transform the unstructured data, and then you project everything into a vector space. Here, for the demo, it’s only in three dimensions. The blue points you see here are vectors. You have to imagine that you’re trying to make things that are semantically similar, and you’re going to try to make them close together. Here, you see on the image, you have the image of a banana, and then you have the text, banana. Those two are close because they are semantically similar.

Same for the cat, for the text of the cat, and then the image of the cat. Those two, they’re pets, basically. We like cat, dogs. Then you can also see that they are very far from the different fruits. This is why we project everything into a vector space. Obviously, we don’t work in three dimensions. Most of the embedding models are more like in the thousands, 2,000, 3,000, and we’ll see that later on. How do you find the points that are very close to you? Actually, we’re just going to compute nearest neighbors. You can brute-force it. This is actually a real thing. You have an index that I will talk about later on, which is called FLAT, which is actually brute-forcing that. You’re computing all the points, you’re computing all the vectors to find the one that is the closest to you. The problem with this one, it doesn’t scale, obviously. It is amazing up until you have hundreds of thousands of vectors. When you arrive in the millions, it’s not going to work.

If you’re in the billions, don’t even think about it. What we do is then we’re going to approximate nearest neighbors. We’re using machine learning models that have been trained, actually, to predict how to approximate close neighbors. This is what we’re calling indexes. Indexes, it’s a bit of a fancy word. I find it fancy. I find it very complex, very opaque. Actually, behind the scenes, it’s just a machine learning model. Then you have different tradeoffs. You have to take into account model complexity, the quality, and the search speed. Those three, actually, they’re close together. They’re always intricate.

If you want to have a higher search speed, you will have to lower down the model quality and also lower down the model complexity. Maybe, actually, you want to have a better model, then you will likely have two tradeoffs. With the search speed, it will be lower, and the model quality also has to be lowered down. Same, if you want to have a model of better quality, then the search speed will be lower, but also the building time will also be lower. This is what we refer to for the index building time. This is usually what is very important.

In the end, I’m going to insist just to make sure that you can follow on three different indexes, because I work for a vector database, so I work with indexes every day, but it might not be clear to everyone. FLAT, this is for the brute force. You have a query vector, and then you’re literally computing the distance to all the vectors that you have in the vector space. This is nice. When you have small scale, it actually works. Then, when you have more data, usually you go for IVF, which is inverted file index. This one, instead of actually computing all the vectors, you’re going to create centroids, and then you’re going to compute the distance to those centroids.

Once you find the closest one, then you’re going to compute the distance to all the vectors within this centroid. For this one, you have to tune, you have different parameters that you have to take into account. For example, how many centroids are you going to have a look at? This one requires a bit more tuning, but actually, if you tune it well, you can have very good performances. Last but not least, HNSW. It’s Hierarchical Navigable Small World Graphs. This is one of the most popular ones. This is the one that is always used by a vector database by default. This one, it’s a graph-based index, meaning that it’s very performant, but also to build it, it’s going to take actually quite a long time. If you want to update your index very often, then you likely don’t want this one, because you have to rebuild the graph all the time. How it works is that you have an entry point, and then you’re going to try to find the nearest neighbors, and what happens is that the graphs are divided into different levels.

At the first level, you have only a couple of vectors. Then you’re going to find the closest one to you. Once you do that, then you’re going to go a layer down. Then there you have a bit more vectors, and then again, you’re going to try to find the closest one, and again and again. You’re going to go down two different layers, and at the end, you arrive at the last layer, where you have all the vectors, and then that’s how you found the nearest neighbors. There’s actually lots of indexes. I just talked about three, but here we can see we have like GPU index-based, which is the one on the top left. The one on the top left, it’s GPU index, so you have to imagine the latency is very good. It’s very low. The throughput is also very high. The index speed, the time to build the index, is actually very low, but it’s way more costly than if you were to use RAM-based index. HNSW, I just mentioned it. It’s like a good middle ground.

Then you have other ones like DiskANN, which is a disk-based index. This one, very cost-efficient. We’re very happy here. Also actually has a very good accuracy, but the latency is going to be higher, and the time to build the index is also way much higher. It’s like all about the balance and all about what you want. There’s not one that will fit for everything.

As a summary, so you can have a better idea about indexing. FLAT, very simple, very accurate. Actually, if you want 100% recall, this is the only one that is able to do that, but it’s slow at scale. IVF, fast enough actually for most tasks. I believe that a lot of people should use IVF, but you need tuning, and the accuracy can vary. HNSW is the classic one, very high query per second, great accuracy, but it is slow to build, and it’s costly to update. Finally, the one I mentioned as well that is running on disk, DiskANN. This one will allow you to scale beyond your RAM because you will actually build the graph directly on disk. The problem with that, though, is obviously you have a higher latency. Everything is a tradeoff. Have a look at what you need for your company.

Embedding Models

Then, I mentioned embedding models. They are, in my opinion, one of the most important pieces that you can think of. I don’t think people talk about them enough. We talk a lot about LLMs. We talk a lot about foundational models, but never about those small embedding models. It is extremely important to choose the right one because you may have the best LLM, you may have the best vector database, you may have the best agentic system. If you don’t have good embeddings, you will likely never reach the full potential of your RAG system.

Then you may ask, how do you find the best one then for you? This is from Hugging Face. They have a leaderboard directly about embedding models. You can see you have a rank. I want you to forget this rank because this is not the way you should look at the leaderboard. Here at the moment, Gemini is actually the first one. You may go and be like, ok, I’m going to use Gemini. They are obviously first ranking. This is the best model for me. That’s not the way you should approach it, in my opinion.

First, you should have a look at the embedding dimensions. This is the third column that you see where you see Gemini is 3,072. This is the dimension that I’ve shown at the beginning where you had three dimensions. Now you’re projecting things into 3,072. This one is very important because depending on the vector database that you’re going to use, it may or may not be able to support it. pgvector, for example, which is an extension from Postgres, they work with paging. When they work with paging, it means, for embedding dimensions, they don’t support embedding dimensions with more than 2,000. They do if you quantize your embeddings. That means you’re going to have a tradeoff in quality. Those are the things that you have to look at.

If you want to use an embedding model with pgvector, for example, without quantization, you would have to go for the multilingual one, which is fourth in the leaderboard. Have a look at those. Also, that’s not it. Have a look at the different tasks that you have. Here you see in the leaderboards, we can see classification tasks, we can see clustering tasks, instruction retrieval, and then you have way more tasks as well, actually, behind the scenes.

Recently, actually, what’s very nice from Hugging Face is that they released different filters. Really, first, find the languages that you want to work with. Are you working, for example, with German and English documents? I’m based in Berlin, so I work, actually, with a lot of documents that are German and English. I then, therefore, use embedding models that have been trained on those two languages, because then I can either write in German or I can write in English, and then they will understand it. Also, make sure you have the correct task, because maybe you’re going to have a retrieval system, but maybe you have a very different task, and then embedding models have been trained on those different tasks as well. Same for the domains. Are you working in the medical industry? Are you working in the legal industry? Embedding models have also been trained on different types of data, so make sure you select a domain that is actually very good for you.

The modalities, I’m going to talk about multimodal RAG, so I will have to use an embedding model that is supporting images and text, but maybe yourself you want to use text only or you want to use images only, then you have to find the right one. Finally, do you want to have an open-source one, or you don’t mind, you can just go through the API? Those are all the filters that you have. Then, once you find the right filters, then maybe you will find your best embedding model. Maybe it was ranking 12th, but it may be way better for you and your use case. Lately, all the labs are pushing to be first in the ranking, but just because you’re first doesn’t mean that you’re the best for your use case. Please use embedding models that have been trained on similar data.

Now we have an idea of how vector search is working, which embedding models to use, but then how everything is working together. We’ve seen this one already. You have your unstructured data, you transform it into vectors, you get the embeddings. You store that in your vector database. Then, the new part now is that you have your user coming, and they have a query. Maybe they’re giving you an image, maybe they’re giving you some text, but they want to learn something. Again, you’re going to transform that into vectors to get the vector embeddings, and this is then where you’re going to perform the similarity search that we talked about before, and this is where the index is very important. This is what we’re doing now. The index is finding the closest points.

Then once we’re happy, we get the results, and then we give you back those results. This is what vector databases, databases in general, are working on, and they’re working to improve. Similarity search, in theory, is very simple. It’s very simple when you have a couple of hundreds of thousands of vectors, but once you reach scale, this is the hard part. How are you going to update your index? How are you going to handle upserts? How are you going to handle all of the things? Last point about evaluation. I could give an entire talk about how to evaluate RAG. There’s a lot of vibe coding lately, but please don’t vibe check your RAG. Have a proper evaluation, because, first, you can’t fix what you don’t measure, Yann mentioned. I used to work for MLOps when it was cool, and when I joined my previous company, we had a couple of models running in production, and we had no idea, actually, if they were performing well or not. We had no evaluation. We couldn’t check the latency of the models and those things.

Then you have no idea if they’re good or not. Same for RAG. If you can’t measure it, you can’t fix it. Also, a very famous person called Hamel, he’s always like, look at your data. Always look at your data. Don’t assume that your data is perfect. Test each part, and not just the whole pipeline. Test the RAG system, test the embedding models, test the retrieval, test all those parts, and not the entire thing only. Use LLM as a judge, this is also very powerful, to score answers. Measure retrieval recall as well, and not only the model output. Once you have that, then you have a better evaluation system, and then you will not only have your gut feeling, you will actually have a strategy. Why is it so important is that so many companies are training LLMs.

Then you have a new one. A couple of months ago, Claude was the cool one, now, apparently, it’s Gemini 2.5, which means if you change your LLM, then you want to make sure that it’s actually still very good for your system. If you don’t have an evaluation, then you have no idea, and then you’ll be like, ok, I guess it works, but actually, you want to be confident. That’s why. Vibe coding is fine, but no vibe checks for evaluation. Summary, pick the right index, choose embeddings carefully, and then evals.

Retrieval-Augmented Generation (RAG)

Quickly about RAG. I have a quick question, who thinks RAG is dead? Do you think it’s useless? A couple of people. Still, maybe I can convince you otherwise. Llama 4 was released. This is a benchmark that was everywhere on socials. You can see this is about long context. This is to make sure that LLMs, actually, they pretend to have long context. Let’s see, actually, when we compare them with different benchmarks. The best one is o1, up until 120,000 tokens for input. We have 53% accuracy, which means you’re actually not very accurate. They claim that they have very good accuracy at even millions and millions of tokens as an input.

Actually, when you train them, you can see that it’s still not the case. I’m going to give credit where credit is due. Gemini 2.5 Pro, which is making a lot of noise, which is in the middle, you can see that actually is very good, even up until 120,000 tokens. It actually has a slope at 16. No one knows why. Up until 8, and after 16, it’s very good. Maybe give them longer prompt, so they reach out to 32,000 or 16K when you use it. The only problem, though, is that it’s going to be quite high. The latency will be quite high. Also, the cost is going to be high. Because 120,000 tokens, it’s about 180, 200 pages long PDF. If you go through an IPO, the PDF is 300 pages long, just to give you an idea. Llama 4, which has supposedly an input token of 10 million tokens, we can see it at the bottom, unfortunately. The accuracy for Maverick is 28% at 120,000 tokens, and for Scout, it is 15%. I imagine we can scale it up. We can go to 10 million. That’s fine.

Then what matters in the end is the accuracy that you have, and the latency, and the cost. I just want to highlight that because labs are pushing a lot for like, ok, we have very high input tokens, and that’s cool, but if your accuracy is not good, then I don’t want to use it. Hopefully, I convinced you that RAG is still here, because RAG, in the end, is only retrieval plus LLM. The R part of RAG is actually very important. You want to make sure how you’re going to use metadata filtering, how you’re going to make sure that you have full-text search, and all those things.

How can we make it better, then? We can use the best LLM, we can use the best embedding models, but then we’ve seen on the benchmark before that actually they’re still not very good. Retrieval makes or breaks RAG. If the context is not clear to you, don’t expect the model to understand it. This is usually what I talk about with customers. It’s like, when you get chunks back, if you don’t understand those, and you’re the one asking the question, then the model will not understand it either. It’s something like, make sure that you understand the context, and if you don’t, you have to fix something with your RAG system.

Good retrieval means making sure that the retrieval has the right context, making sure you’re getting all the relevant documents, but also making sure that you don’t have too much noise. Because then LLMs will be very confused, and then they’ll be like, this is cool, but they’re going to be confused. Even though reasoning models now are getting better at that, where they will actually tell you, “This part is not relevant, I’m just going to skim through it”. It’s still like, make sure that if you have a good retrieval, you will have better performances in the end. Also, similarity search is not magic. Some models are not trained on your data. They will not understand internal jargons, they will not understand acronyms. Embedding models also compress information, which means that some details will get lost on the way. This is inevitable. This is how it works.

Finally, people still search with keywords. You may have the best embedding models, the best vector search, if people are searching with keywords because they want a specific product, then you will not be able to find it very likely. We have something else, which is BM25, which is a very old technology now, which is actually also called full-text search, or keyword search. It’s a ranking function that is used to estimate the relevance of a document for a given query. It’s very great, because great baseline. It’s fast. It’s simple. It scales. So far, you’d be like, what’s wrong with it? It’s also really good with acronyms, really good with internal terms. The only problem, though, is that it doesn’t understand meaning, so it only matches words.

On one hand, you have semantic search, which is very good at understanding meaning, and on the other, you have BM25, which is not good at that, but is really good at understanding acronyms and internal jargons. You can see where I’m going. We’re going to combine those two, and then this is what we call hybrid search. Hybrid search is actually combining those two. It’s similarity search plus BM25. At the beginning, you can see here, you have your documents. They go directly into your vector database, and on the other one, you’re computing BM25.

Once you have your query, you’re actually going to run the query. You’re going to try to find the best chunks for your vector database. You get those top K. You do the same for BM25. You get the top K for BM25, then everything is beautiful. You have, on one hand, BM25 coming, on the other one, vector store. What’s happening now is that you’re going to use a fusion algorithm. This is what usually we call a re-ranker. They’re basically fusioning those two results together. They give that back to the LLM, and then, hopefully, you have better answers. This is actually a very good way of having better results. It doesn’t add much complexity to your system.

Metadata filtering is also extremely important. You’re going to have more signal. You’re going to have less noise. What you can do is you can store metadata yourself. You can track the source, the author, when was it last updated. Then you can also add your own ones, the company name, the region, the country. You can add all of those. That way, you can really use filters to then narrow down the search scope, and then you’re going to improve your precision. Also, if you want to have information that is very fresh, you can also filter that way by recency. You don’t have to go through all the data that is 5 years old, when actually you only want the last one. With good metadata, you have better context, and then you get better answers. This is what I want you to go out with. I’m just going to give you a quick example of how it can improve the results. Let’s say you have a query, which is, what’s the refund policy for UK orders? You have two chunks, without filtering. You have the first one, which is like, customers have 30 days to request a refund.

Second one, refunds are processed within 5 business days. Then, you see that, and myself, I actually don’t have any idea. I don’t understand it. I don’t know which one is correct. Don’t expect your LLM to know which one is correct. It’s not magic. It’s not going to be like, yes, this one looks beautiful. Usually, it will pick one of those two. It will give you an answer. Then that’s what we call hallucination. If we have region metadata, then we can filter it by the UK, and then, for example, you get this one, which is UK orders can be returned within 14 days for a full refund. Now somehow, you’re like, ok, this actually looks legit. You still have to make sure it’s correct, but at least now, it’s actually about the UK and not about any other country where they talk about refunds. This is the kind of thing you can do with metadata filtering. There’s more.

RAG now, actually, just the components about usually with agentic systems is just a tool that you have. What I want you to remember is that even if it’s just a tool, you have to make sure it’s a great tool. You have to make sure it’s optimized. You have to make sure you get really good retrieval. You also have GraphRAG, which allows you to boost your retrieval with a knowledge graph. This one, really powerful, but only if you need a knowledge graph and you’re working, for example, with entities or things like that because it’s really going to increase the latency. Then you also have to convert everything into graph language.

Building a Self-Hosted Multimodal RAG System, Using Milvus and vLLM

Now we have a better idea. Now we can try and see how we can build a self-hosted multimodal RAG system. I will only be using open source. I want you to imagine that we’re all back at uni, and we just missed our class. We’re a student, and thankfully now all the classes are recorded. Even if we miss it, it’s fine, we can still catch up. Let’s say you’re using text-only RAG. Then what you’re going to do is you’re going to have transcription of the audio of the class. This class was about backpropagation in neural network. This is something that is very visual to learn. Usually, you see the backpropagation yourself. If you’re only going to use text RAG, then you’re actually not going to be able to catch up on anything. You’re going to use multimodal RAG with audio, video, and text. Now somehow, you can search through the images. You can search through the audio.

Then this is going to be on the demo later on that I’m going to show you where you can actually ask questions about the class, and then it will find the best image to illustrate this point. This is what it looks like for the stack. We take a lot of different inputs, thinking of images, video, text. We put them through multimodal embeddings. We store them in the vector database. Then when we search it, then we have this question that is also going through the multimodal embeddings, vector database, and then we give that back to our LLM with a question plus the context that we just retrieved. This is then sent to Pixtral, so we can also send images directly.

Then, hopefully, we get reliable answers. The tech stack I’m going to use is Milvus for the vector database, vLLM for the inference and serving, Koyeb for the infrastructure layer, and Pixtral for the multimodal model. I’m using the open-source version, so this is the one that I’m going to self-host. You’re going to see later on, hopefully, if it works, you’re going to see I’m going to self-host it and then we’ll deploy it. Koyeb, why am I using them? First, they’re very easy to use. They have autoscaling. If I were to deploy this into production, then I can really autoscale it, as long as I have money on my bank account. Then I can also scale to zero. If there’s no demand at all, then you can scale to zero, which is nice.

Then it’s going to start again if you have a new user coming. You can build and deploy almost everything. You can think, if I’m going from ComfyUI to vector databases to Postgres to vLLMs and to a lot of different things, you can deploy them. They support basically everything. They’re distributed globally. You can really go everywhere and then be happy.

Milvus, which I’m going to use, it’s part of the Linux Foundation, also open source. It supports a massive amount of vectors. The biggest it supports is the distributed version, which scales up to 100 billion vectors. Then you have different offerings. Standalone, which scales up to about 100 million vectors. Milvus Lite, which is just running on your laptop. It has a lot of different features. Metadata filtering, I’ve mentioned it. Bulk import, if you want to import a lot of data suddenly. Disk-based index, full-text search as well. Then GPU vector search. This one, actually, I love when I work with customers on this one. It’s very useful. You need a lot of money.

Apart from that, it’s very useful. It’s very fast. Yes, GPU vector search, really good. Then, Pixtral from Mistral. This is natively multimodal. Very strong performance on multimodal task. It’s very good at instruction following as well. The architecture, I’ll go a tiny bit more into details on the next slide. It’s a vision encoder that they actually trained from scratch. When I was talking to some Mistral engineers, they were like, yes, we were not happy with CLIP from OpenAI. They were like, we’re just going to train our own. That’s what they’ve done. Then the multimodal decoder is based on Mistral NeMo. Then it supports different image sizes and aspect ratios. The architecture is the following. You can see here, we have two different images. They both go through the vision encoder. Then what the vision encoder is doing is actually generating tokens for those images.

Then you want to flatten all those tokens so that you can put them directly to your decoder. How you’re going to know that this is the end of the first line of the image is that you’re inserting those special tokens here. You can actually see my mouse. You can insert those b, which means image break. This one means that you reach the first part of the image. Then you do that for the second line. Once you reach the end of the image, then you’re going to have the special token, which is e, which means image end. This is then how you can support images with different aspect ratios. Then they basically all do this one. This is not specific to Pixtral. This is more specific to vLLMs. They transform everything into tokens.

Then they flatten everything so they have a nice sequence. How does it work after that? As an input, you have your text that is transformed into tokens through the transformer that you have. Then you have the vision transformer encoder that I just mentioned, which is also transforming your images into tokens. Then you combine those directly in the multimodal transformer decoder that you have at the top. This one understands all those tokens. Then it can really play around and understand, ok, I have this text. I have this image. Everything is talking to it, but then it understands those and combines. Then as an output, it will give you text. This is then how you can support different aspect ratios, different images.

This is basically everything you have for multimodal vLLMs. Then I’m using vLLM. vLLM, the reason why I’m using it, very fast, easy to use, open source. They actually give a lot back to the community. They are doing some end-to-end inference optimization, which we’ll have a look later on. They also support different hardware, because a lot of people support NVIDIA by default. They’re also different hardware providers. They support also a different range of model support. This is really nice.

Real World Challenges

Now you have your perfect tech stack. Just because you have the best tech stack in the world doesn’t mean that you don’t have challenges. How are you going to solve those? First, you have to look at two things. Those two things, when you deploy, are very important. It’s the latency against the throughput. Latency is the response time. Usually, it’s measured in time to first token. Also, the time between each token generation. This is the time between each token generation. Those are the metrics that you’re going to have a look at. Very important in real-time systems and interactive applications. Then, on the other hand, you have throughput, which is the output of tokens per second. This is also something very important. Extremely important, if you need to process a large amount of data, or you have many users, as well, at the same time. The problem, those two values are interconnected.

If you reduce the latency, which is good, actually, you can have an increased throughput. Not everything is always beautiful. You have to check on how you can mix those. Also, the higher throughput will allow you to have a higher batch size. What is batch size, actually? This is the number of requests or inputs that your model can process at once. It’s going to affect speed, memory usage, and throughput. You have two different strategies. You have a third one, this is like getting into popular.

The first one is very naive. Individual requests, one by one, they come. You treat them. Then this doesn’t scale. If you’re ChatGPT and you treat them one by one, you’re never going to have an answer. What we do is dynamic batching, where we either wait for a certain amount of time, then once the batch is complete, then we send all those to the LLM so that we can get the response back to our users. Those are different strategies. The batch size is very important, actually, you’re going to deploy your LLM and your vLLM in production.

Now you have an idea of those, but you still have to fit your model on a GPU. Quick napkin math, if you want to know if you can fit it. It’s actually going to be the number of parameters in billion times the size of the data in bytes. Llama 3, 8 billion, for example, on FP16, it’s going to take 16 gigabytes of RAM, approximately. Pixtral, the one I’m running, in FP16 is going to use 24 gigabytes of RAM. Llama 4 Behemoth, which is 288 billion parameters, in FP16 is going to be 576 gigabytes of RAM. Of course, we optimize them. We’re going to quantize them. That’s still a lot of RAM. That is only to load the models. You’re still not generating any tokens here. You just loaded the weights. You have to keep that in mind. Then you have different GPUs. I’m actually using an A100 for Pixtral. A100 is 80 gigabytes of RAM, which means when I’m loading my model in FP16, I’m losing 24 gigabytes of weight.

Then I have the rest that I can use to actually generate tokens. If you want to deploy Llama 4, then you’re still going to need multiple GPUs, actually, to be able to deploy it, because it’s so big that even really big GPUs will not be able to make it through. How do you do that? Two ways. Either you repeat on different devices. Very easy to do. It’s literally a copy, paste. You’re going to have a better throughput. You have no inter-GPU communication, but your batch size would be lower. What you do, and what usually we do, and what people are doing, is you’re going to split it. You’re going to chunk the model, and you split it across devices. The problem, you have communication overhead, but you can fit much bigger models. Also, you have larger batch size. Then you’re happy. Your users are happy. You can see it now a bit more.

On the first part, on the top part, it’s like if we copy, paste the model onto different GPUs. The orange part you see at the top is the space for working. This is to generate the tokens. You can see that if you repeat it, then you’re not going to have a lot of space for working. If you divide it, and if you divide the model in two, then you have more space for working, which means you have more space to generate outputs, which means you can have a higher batch size, which means you can have more users. This is why we’re doing it. This is the reason why we have parallelism.

How do you split a model? Two ways, again. Pipeline parallelism. Spoiler alert, you don’t use that. You can see we have four layers on a model here. We have the first two layers that are on the GPU at the bottom part. The GPU at the top part is still waiting, because it’s waiting until the first two layers are done with the computation. Then once you’re done, layer 3 and 4 are coming. That’s beautiful. Then you have a lot of idle time. You don’t want that. What you do is tensor parallelism. This one, you split up the large matrices that are internal to the models. Think of attention mechanism and think of those. Here we have four layers again. We have half the layer 1 on the GPU at the bottom and the GPU at the top.

Then you compute those at the same time. Then now you have more intercommunication cost, but you’re not waiting for anything. Then in the end, happy people as well. This is a benchmark by vLLM, which is actually showcasing the impact that tensor parallelism has. If you’re not parallelizing anything, which is in gray, the available memory is going to be very low. Whereas if you do parallelize it and you split it into two GPUs, you can see you have way more resources available for your model, which means you have higher throughput, which means you can generate more tokens, which means your users are happier.

Inference Optimization

Another reason why you want to optimize your model, you want to quantize them, to reduce the RAM requirements. We’ve seen it for Llama 4, for example, 288 billion, goes to like 500 more gigabytes. If you quantize it, then you have 288 gigabytes in RAM. What you want to do is you want to accelerate the linear layers. This is the attention mechanism. This is like all those ones, we have a lot of computation, and this is in pink here. This is the time on the y-axis. You can see those are really time consuming. It’s been shown now that if you quantize it properly and not too hard, also it has a negligible impact on the model quality. You have two ways of quantization. What you want to do is you want to do weight plus activation quantization on the right, where you’re going to quantize your weights and the activation mechanisms as well. That way, you reduce the data movement on your GPU. You have also more Tensor Cores that you can use.

Finally, for the optimization, you have KV Cache, which is what is actually making all the magic to generate the tokens. This is rectangular shape, usually. On the x-axis, it’s the maximum sequence length. For Llama 4, this is 10 billion. The problem with the default one is that you have a lot of space that is wasted because 80% of the users are not using all the input tokens that are available.

Then you’re only going to have a lot of space available for those 20%, but that means then you can’t have a higher batch size. Then you’re going to struggle. What you do, you do KV Cache with paged attention. This is very similar to what operating systems are doing with memory, where they split the memory into small chunks. Same for LLMs. That was invented by vLLM. This is open source. If you use them by default, you have that available. Then the attention mechanism now is split into smaller parts, which means you’re not wasting a lot of resources, which means users are happy. Those were the different parts for inference optimization for different things.

Demo

This is actually the video I wanted to show you. This is backpropagation explanation. This is the 3Blue1Brown YouTube channel. Imagine I’ve missed this class — that would be an amazing teacher — but this is the teacher we have, and we missed it. I have then my Streamlit UI here. I’m going to load the index already, because if I have to process those, it’s going to take a couple of minutes. Actually, I can show you the data that I have.

For the text, you can see I have this data. Then if I click and I view it, you can see I have the embedding, which we don’t understand, because those are just floats. Then I’ve added actually a lot of metadata, so the file name, the file type, the file size, a lot of those things that I’ve mentioned before. Because if I have something actually at scale, then I could filter through that. Let me check if it’s working. My index has been loaded, so I can be like, “What was the class about?” Then, now it’s checking. It should have been working, but it was not working. OpenAI, I’m going to try again. No, it’s not working. “What was the video about?” It’s not working.

Questions and Answers

Participant 1: I just have a question, because I’m using RAG, but there is another concept, it’s named CAG. It’s Cache-Augmented Generation. Did you use it? What do you think about it?

Stephen Batifol: What do I think about CAG instead of RAG, which is Cache-Augmented Generation?

Participant 1: It’s Cache-Augmented Generation.

Stephen Batifol: Cache, so not the one from Anthropic, then?

Participant 1: No. From IBM, actually.

Stephen Batifol: I’ve seen this one, but I haven’t used it at all. I don’t have an opinion on it.

Participant 2: I was just wondering, because I saw the error was related to OpenAI. Are you using OpenAI in your architecture?

Stephen Batifol: No, it’s the OpenAI SDK, and then vLLM is just supporting it. That’s why I was using it.

Participant 2: Where was the interface where you were seeing all the data?

Stephen Batifol: This one is called Attu. It’s only an interface if you use Milvus, the vector database. It just allows you to talk to it, and see it.

Participant 3: Like I have multiple components, and one of them is chunking. What is the best strategy of chunking, especially when you’re dealing with a variety of documents?

Stephen Batifol: There’s not one good strategy for chunking. Same for indexing. Everything is a tradeoff. Depending on the data now, like if you have really long PDF, it’s actually good. One-page PDF chunking, and then you add some context about the chunks, though, is really good. Say, Anthropic, it’s Contextual Augmented Generation. Then you add context to your chunks, being like this is about DoorDash IPOing, or this is about those things. Then you will have a better idea. There’s not really one rule for chunking, unfortunately, yet.

Participant 3: There is a limitation, it depends on the variety of documents, and we can’t use RAG?

Stephen Batifol: No, it’s just going to depend on the chunking strategy you have in your documents. If you have PDFs with very complex images and data there, maybe have a look at PaLI, and different ways, which are going to take screenshots of the PDF, and then transform those into tokens, as we’ve seen. Then you search through that. That’s very good if you have complex PDFs. If you have Excel files, it’s going to be very different. It just depends on your data, unfortunately.

See more presentations with transcripts

Deploy MultiModal RAG Systems with vLLM

Transcript

What is Vector Search?

Embedding Models

Retrieval-Augmented Generation (RAG)

Building a Self-Hosted Multimodal RAG System, Using Milvus and vLLM

Real World Challenges

Inference Optimization

Demo

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Block ads and keep things family-friendly forever for just $16

Check Out Highlights From WIRED’s Big Interview Event

Android Users Are Getting Checks From Google’s $700M App Store Settlement – Here’s How – BGR

Op-Ed: Working at home is good for your mental health – New study pins it down

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

What is Vector Search?

Embedding Models

Retrieval-Augmented Generation (RAG)

Building a Self-Hosted Multimodal RAG System, Using Milvus and vLLM

Real World Challenges

Inference Optimization

Demo

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News