By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Building Embedding Models for Large-Scale Real-World Applications
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Building Embedding Models for Large-Scale Real-World Applications
News

Building Embedding Models for Large-Scale Real-World Applications

News Room
Last updated: 2026/02/13 at 11:47 AM
News Room Published 13 February 2026
Share
Building Embedding Models for Large-Scale Real-World Applications
SHARE

Transcript

Sahil Dua: Let’s start with a simple scenario of show me cute dogs. You go on any search engine, and you write, show me cute dogs. It’s very likely that you will get a very nice photo like this. What happens under the hood? How is the search engine able to take that simple query, look for images in the billions, trillions of images that are available online? How is it able to find this one or similar photos from all that? Usually, there is an embedding model that is doing this work behind the hood. Today, we’ll dig deep into what is an embedding model, what does it do, how does it work, where is it used. Some practical tips on how do we put these models in production, what are the challenges that we face at large scale, and we’ll also look at how we can mitigate those issues and use these models reliably.

I’m co-leading the team at Google that’s building the Gemini embedding models, as well as the infrastructure. Recently, I had the pleasure to work on the Gemini Embedding paper. I’m really proud of this team, because together, we have built the best embedding model that’s available on all the known benchmarks. Before Google, I was working at booking.com. I was building machine learning infrastructure. This actually was the topic of my talk at QCon 2018. Besides that, I wrote a book called, “The Kubernetes Workshop.”

Outline

Let’s look at the topics. Let’s look at what we are going to cover today. We’ll start with embedding models. What are they? What is their importance? What are the use cases? We’ll look at the architecture. How are these models formed? How are they able to generate these embeddings? Next, we’ll look at the training techniques. How are these models trained? Then we’ll see, once you have trained these larger size models, how are you going to distill them into smaller models that can actually be used in production? Next, we will see how we can evaluate these models. It might be non-trivial. Then we will look at, once you have these models, once you are happy with the quality, how do you put these models in production? How do you make sure that they are running reliably without any issues? Then in the end, we’ll summarize with the key takeaways that you can take home or to the office and start working on your applications immediately.

Embedding Models and Their Applications

Let’s start with embedding models. What are they, and where are they used? Embedding model is basically a model that takes any kind of input. It could be a string input. It could be an image. It will generate a digital fingerprint of that. That’s what we call a vector or an embedding. It’s a list of numbers that uniquely represent the meaning of a given input. For example, we have, show me cute dogs. It will have an embedding.

Similarly, any other input, like an actual picture of dogs, will also have an embedding. The key idea for embedding models is that the embeddings of similar inputs are going to be closer to each other in the embedding space. Usually, we use cosine similarity to find the similarity or closeness between any two given vectors or given embeddings. Now, on the other side, it will also make sure that embeddings of different inputs are going to be far apart from each other. For example, if you have a query called, show me cute cats, but there is an image of cute dogs, it will generate its embeddings and will make sure that these embeddings are going to be far apart from each other using the same cosine similarity or similar similarity measure.

Let’s look at some of the common applications. The most fundamental application is retrieving best matching documents, passages, images, or videos, whatever the use case is. Just like we saw in the example in the beginning, you would write, show me cute dogs. It’s able to look through billions of web pages or images and find just the right one that matches your given query. Embedding models are usually the ones that are doing this retrieval task. Most of the search engines, it doesn’t need to be like a full-scale search engine. It could be something like searching on Facebook, for example. All of these are powered by embedding models, which are able to sift through a huge amount of data and find just the right information for your query.

The second use case that’s very common is generating personalized recommendations. We are able to capture the user preferences in these embedding models and generate the outputs that are very specific to what the user wants. For example, if you have a shopping website and a user buys an iPhone, now if the user comes, the user is more likely to buy an accessory which is related to the iPhone. Using these embedding models, we are able to capture the past behavior, the history, and predict the right products that are relevant. Similarly, for example, Snapchat. I recently read the blog that they are using exactly these embedding models to power the search for what stories to show. For a given user, what is the most relevant story that we should show? The next and one of the most popular use cases these days is RAG applications. RAG stands for Retrieval-Augmented Generation.

As the name suggests, we are augmenting the generation of the large language models using retrieval. What happens in a RAG application is that you use a large language model to generate the responses. Before you do so, you find the most relevant pieces of information that are useful for the model to give the output. These retrieved passages or documents are now added to the model’s context. This guides the whole generation process so that the model is able to generate more factually correct and accurate results, and it helps reduce the hallucinations. Last but not the least, this is a use case that’s more behind the scenes for training the large language models.

Usually, we have huge amounts of data that’s used to train. What we can do is we can generate embeddings for all of the data points and find the near duplicates based on their similarities using cosine similarity. Then we are able to remove the redundant data. This helps to improve the quality as well as the efficiency of the large language model training. These are the common use cases. Not an exhaustive list, but some of the main use cases that I’ve seen in recent times.

Architecture of Embedding Models

Now that we know what embedding models are at a high level, and we know their importance, their applications, let’s look at the architecture. This is the architecture of an embedding model. Let’s look at each of the components one by one. We don’t need to look at everything together. The first component is a tokenizer. It takes an input, string, breaks it down into multiple parts, and each of these parts are called tokens. Then it replaces these tokens with its corresponding token IDs. The input is a string, and the output is a list of token IDs. Next, we have an embedding projection. Now that we have input tokens, we have broken the string into multiple tokens, we are going to replace these tokens with its corresponding embedding or a vector. Because the model doesn’t know what these token IDs are supposed to mean, it only knows what they represent. We have this embedding projection. It’s a huge vocabulary table. Whatever number of tokens you have, you will have a corresponding representation for that, and you will replace it so that the output of the embedding projection is going to be a list of token embeddings.

The next component is actually the heart of most of the models that we’re using these days, the transformer. What transformer does is it takes these token-level embeddings, which have no context of what’s around those tokens, it will output a very enriched representation of token embeddings. What this does is, for each token, it will look at the surrounding tokens, and add that information and enrich the embeddings so that now the output is token-level activations, you can also consider those to be embeddings. At this point, it contains the context of the whole sequence, not just the one token.

Next, we have a pooler. Pooler’s job is very simple, take these token-level embeddings and generate a single embedding. There are a lot of different techniques that we can use. For example, mean pooling, where we take the average of all of these token-level embeddings and generate a single average embedding. This is the most commonly used method. There are a few other methods. For example, we can take only the first token embedding and remove all the other tokens and consider that to be the representation of the entire sequence.

Similarly, we can also do it so that we take the last token to be the representation of the entire sequence. Most commonly, we just use mean pooling, because it allows us to take the information from all of the tokens and combine into a single embedding. Now, we already have gone from input, string, to the embedding. There is another optional component, which is called output projection layer. This is a linear layer that takes the pooled embedding and generates another embedding, which is of a different size. A lot of times, you want your embedding model to generate the embedding of a very fixed dimension. You can control that dimension using this component. There is one more technique where, if you don’t want to fix the output embedding, you can co-train multiple embedding sizes.

For example, in this case, what we do is we take a d dimension, whatever that number is, we can co-train the smaller embeddings along with that. Like d by 2, d by 4, up to d by 16. This allows us to co-train these multiple embeddings so that, at the production time, we can decide which embedding size to use. Research shows that these smaller embeddings can be almost as good as the larger size embeddings. That’s like we’re getting smaller embeddings almost for free and with high quality.

Now, putting it all together, what was the input? A string. In the end, we get an embedding. The same logic applies to other modalities. For example, if you have an image, instead of going through the text tokenizer, in this case, what we’ll do is we’ll have a vision encoder. A vision encoder is just a special type of model that will take an image, break it apart into multiple patches. You can think of patches as the tokens. For text, we are breaking it down into tokens. For images, we break it down into patches. Again, it will replace those patches with its corresponding vector. The same stuff will happen that will be passed to the transformer, pooler, and the projection layer. The same thing happens with the video. Most commonly, the video is represented as a list of frames, list of images.

The same thing will happen here that it will replace each of the images with its corresponding patch embeddings and then create a final embedding that captures all of the information that’s in a single video. Now, we’re going to simplify a bit. We’re not going to look at each of these components separately. For the rest of the slides, we’re going to look at this whole box as an embedding model, embedding model that takes an input and generates a final embedding. Usually, we have two sides of inputs. One is a query, and the other one is a document. What we do is we create two embeddings, one for the query and one for the document. We want to make sure that if the query and documents are similar, their embeddings should be closer to each other in the vector space.

Training Techniques

Now we know what the embedding models are created out of, let’s look at how we can train them. The most common technique that we use is called contrastive learning. As I said earlier, we want to make sure that any two inputs that are similar, their embeddings are closer. Any two inputs that are not similar, their embeddings are far apart. This is what usually training data looks like. We have pairs of query and documents, where each example has a given query and its corresponding relevant good document. What we do is we want to make sure that for any given query and document pair, for example, the query1 and document1, we want to make sure that the embeddings of these are closer. The similarity score is higher. To challenge the model more, we also want to consider all the other documents and treat those as negatives. In short, we take the query1. We want to make sure that similarity with the document1 is high.

At the same time, we want to make sure that its similarity with all the other documents in the batch is minimized. This is captured very well by this loss called in-batch cross entropy loss. It’s very simple. This is a simplified representation where we want to maximize the similarities between positives and minimize the similarities between these in-batch negatives, because these are the negatives that we just take from within the batch. There is another addition to that. We can challenge the model more by adding a hard negative for each example. The way it works is, let’s say your query is, find me best Italian restaurants in London. If we have a restaurant, which is an Asian restaurant in London, that’s an easy thing for the model to know that, ok, this is not an Italian restaurant, so this is not a good match.

To make things challenging, we can add a hard negative, which is going to be semantically similar, so maybe Italian restaurant in New York. This will teach the model to know that, ok, being Italian restaurant is not enough, we want to pay attention to the location as well. This is basically adding some hard negatives for each example, and we just modify the loss to maximize the positive similarity, minimize the in-batch negatives, and also minimize similarity between the hard negatives.

This is the training technique. How do we actually prepare the data? Let’s say we have a bunch of text data, how do we prepare that data to train these embedding models? There are two techniques. One is supervised learning, and the other one is unsupervised learning. In supervised learning, what we do is we use a next sentence prediction. Let’s say we take this text from the Wikipedia. I just search for London. These are the first two lines on the Wikipedia. We will split those into two separate sentences, and we will say that the left input, the query, is going to be the first sentence, and the document that we need to match is going to be the next sentence. This means that you can take any text corpus that you have and convert that into this next sentence prediction task to train your embedding models.

The other method is unsupervised learning. In this case, what we use is called span corruption. What we’ll do is we’ll take the same sentence, corrupt some span of that sentence. For example, in this case, we are going to mask out London is the capital, and only keep, of both England and the United Kingdom. On the other side, we are also going to take the same sentence, but now we’re going to corrupt a different span. We’re going to feed that as a positive example for the model to know that even though these spans are corrupted and masked, it still needs to predict embedding so that both of these are closer to each other in the embedding space. Similarly, the second sentence can be done the same way.

Let’s look at how we can convert these large language models that you see everywhere these days into an embedding model. The first stage is going to be how do we prepare the data. We covered two techniques, supervised and unsupervised. Optionally, you can also add hard negatives. The second is, how do we choose the architecture? I will cover this more in detail in the later slides. What’s more important is how we choose the size, as well as the output embedding dimension. The next, this is very important, because here, what we are doing is we are taking the large language model that’s good at generating text, we are converting that into an embedding model. We’ll load the model weights into the embedding model and change that attention to be bidirectional so that it can look at the whole sequence as an input. The next stage is training. We usually have two stage training. The first stage training is called pre-training.

The goal of this stage is to take the large language model and convert that into embedding model. What happens here is that we train it on a lot of data, which is usually noisy, usually slightly low quality. The main goal is that instead of generating text token, now it knows that I have to be trained to generate embeddings. Then the next stage is usually fine-tuning, where we take a very specific data for what task we have. For example, let’s say you have a task of RAG application. What you would do is you would take some given input, and for the documents, you will have the best matching document that needs to be retrieved so that the model is able to generate truthful results.

Distilling Large Models for Production

Next, let’s look at how we can distill these large models into smaller size models for production. What is distillation? Distillation is basically a process of training a large size model and then distilling that into smaller size. We have a large model. Then we are going to train the smaller model using this large size model. There are three techniques that we use for distillation. The first one is scoring distillation. In scoring distillation, we are going to use Teacher model’s Similarity Scores to train the student model. This is what it looks like. We have a query and document. We will generate the embeddings using the teacher model. We’ll compute the similarity score that the teacher model predicts, and then we will pass the same input through the student model that’s being trained. We will make sure that whatever similarity score it creates, that is closer to the similarity score that the teacher model generated. We usually use some loss which can compare these two scores. For example, mean squared error loss, which can compare these and teach the model to predict similar scores.

The second approach is embedding distillation. Instead of using only the final score, we will use the embeddings. For example, we have a teacher model and a student model. We’ll put an input through the teacher model. It will generate some embedding. We’ll do the same thing with the student model. It will generate another embedding. We are going to teach the model that student model’s embedding should be very close to the teacher model embedding. We can combine both of these things together. We can use scoring plus embedding distillation. In this case, it’s going to combine both the powers of the model, that it will take the actual embedding that the model generates plus the final similarity score, and use both of them to train the student model. This is what it looks like. This is what we saw for the scoring distillation, where we take scores between the query and document for teacher as well as student, and we try to match that.

On top of this, we will add another component, which will take how similar are the query embedding from teacher and the query embedding from the student. Similarly, we’ll add another component, which will compare the document embedding from the teacher and document embedding from the student. This is called embedding distillation loss. We basically combine them together so that we’re able to use the teacher model’s power to train the smaller model.

You don’t always have a student model architecture, so you might need to create a custom architecture. What are the considerations for designing student architecture? The first one is the model’s depth and width. Depth is usually how many layers the model has. The higher the depth, higher the number of layers, the better quality you will get, because the model is now able to capture very complex relationships between the different texts. The cost is that it’s going to be higher latency. Similarly, you can have a wider model. This means that the model dimension is going to be higher. Again, the higher the width of the model, the better quality, but again, there’s a tradeoff that you will have a higher latency. We need to find the right balance between the depth of the model as well as the width of the model. The second consideration is the type of attention that we’ll use. We have different types of attention. The first type is multi-head attention. What you need to know about this is that it has better quality, but it also has higher memory usage.

The second type is multi-query attention, where you have lower memory usage, so that’s good, but it also results in slightly lower quality. Here is the catch. For most of the language models, they always go for the multi-query attention, because they want to reduce the memory consumption of these large models. We don’t have that problem. We are creating architecture for a student model, which is going to be smaller for production. We can afford to use multi-head attention, despite the fact that it will use slightly more memory, but it will get better quality. That’s what we should optimize for, because models are already so small. How do we make sure we can get the best quality out of it? This is one of the ways. The third thing is choosing the right output dimension. We have a bunch of different costs that are included when we serve these models. I will talk about that in detail later on as well. One cost is the storage cost. The other is memory cost.

The third one is the nearest neighbor search cost. All of these costs are going to be directly proportional to the output embedding. It’s really important that we are able to keep the size of the embedding small, while keeping the max quality or the decent quality that we are happy with. This is where Matryoshka Representation Learning could be useful, because you’re able to train multiple embedding sizes in the same process, and use whichever suits your serving requirements.

To summarize, how do we distill models? There are three distillation techniques. One is scoring-based distillation. The other one is embedding-based. Then we can combine both of them to use scoring plus embedding-based distillation. The other thing is, how do we choose the student architecture? A couple of things that are important. There is multi-head attention that we can use, because it gives us better quality with the same size, even though it’s slightly higher memory. That’s ok. We should choose a smaller embedding size.

Evaluating Embedding Models

We have models that we have trained. How do we evaluate? As the saying goes, if you cannot evaluate a model, you can’t really improve it. It’s really important that we find a robust evaluation to understand the strengths and the weaknesses of an embedding model. How do we do the retrieval evaluation? Usually, it has these steps that we’ll go through one by one. First of all, you will prepare a set of queries, a test set of queries, and a set of candidates that you want to find the results from. This set of candidates can be in thousands, millions, or billions, it doesn’t really matter. The next thing is going to be generate embeddings for both the queries as well as the documents from your embedding model. Once you have these embeddings, you want to compute the top nearest neighbors. The way this goes is you go through each of your queries. For each query, you will look at its cosine similarity with all the documents and pick the top K candidates. Once you have these top K candidates, we can compare with the golden candidates. Golden candidates are usually the true labels. For example, for a given query, you will have the right document that the model should be able to retrieve.

Then we can compute different metrics, for example recall metric, which tells you how often the relevant documents are actually retrieved in the top K results. You can also have NDCG metric, which is basically a ranking metric. It tells you how good is the ranking of the results in the top K. Then also possibility, you can use mean reciprocal rank, which tells you where exactly in the top K candidates is the relevant document placed.

This seems to be fairly straightforward, but what happens in reality is that we don’t really have golden labels. We have billions of documents, we have billions of images to choose from, and there can be more than one document which are good or relevant for a given query. How do we evaluate a model when we don’t have golden labels? All the steps stay the same, except for the last step. Instead of comparing with the golden label, now what we’re going to do is, for each query, we’ll fetch the top candidates, and we will send these predictions to an auto-rater model. You can think of auto-rater model as a simple language model that generates an output, which tells you how relevant is a given query and a given document. Instead of comparing with the golden labels, we will send these to the auto-rater model, generate scores, and this will tell you how good the retrieved metrics were.

One metric that I really like in this is the position weighted average score. This is very important because this is the metric that actually takes into account that when you retrieve, let’s say, 10 results, as a human, you’re more likely to put more weight to the first few results, because you’re more likely to see that first before you go into the other later results. Position weighted average score gives more weightage to the first few results compared to the results that show up later in the ranking.

Serving Embedding Models at Scale

Now we have looked at what embedding models are, how to train them, how to distill them, how to evaluate them. Let’s get into how we serve those models at large scale. When we have these embedding models in production, there are two sides, usually. One is the query side, where we have, when the user types, show me cute dogs, you will generate its embedding. This is going to be on the critical serving path, so you need to make sure that this is very fast. On the other side, you have billions of documents or billions of images or passages to retrieve from, so it’s not really possible that you can generate embeddings for all of these on the fly. What you need to do is you need to somehow pre-compute these and cache the results so you can use on the serving path. Let’s rephrase this. There are two sides of embedding models. One is the real-time query serving, where you take the query, generate the embedding.

On the other side, it’s offline document indexing, where you run a large-scale inference where you generate the embeddings for all of your corpus. It could be documents, could be images, videos, or passages, whatever the corpus is, you will generate its embeddings using the embedding model. Once you have the embeddings, you need to create some structure around it so that you’re able to query the top K candidates very quickly because this is going to be on the serving path. The way this goes is, user types a query, you generate its embedding, you send it to the index that you created, and it gives you the top K candidates back based on the similarity.

A lot of things can go wrong with this. Let’s look at the challenges. Let’s look at what all can go wrong and how can we mitigate those issues and serve these models reliably. The first aspect is the query latency. As I said earlier that this is going to be on the critical path, so you need to make sure that it’s super-fast because the user is literally waiting while you’re generating this query. One of the ways that we can do this is by enabling server-side dynamic batching. What this means is that, let’s say a lot of requests come to your model. Instead of serving them one by one, you should be able to group them on the server side and serve that as an entire batch rather than serving them separately, which has slightly extra cost, extra latency, and slightly lower GPU or TPU utilization.

The next thing you can do is quantize the model weights, and this means that you will take the model weights, reduce its precision, because now if you take less memory, it will be faster to compute, and this makes your query predictions faster. Usually, if you do the quantization right, there is no quality drop, or there’s very minimal quality drop. The third thing is you can use a smaller query model, and this is actually something that we use in almost all the applications where you have two embedding models. One is going to be the query model, which is going to be much smaller in size because this needs to be on the serving path, and the other one is going to be document embedding model, which is only going to be used during the offline indexing, so we can afford slightly more latency on the document side.

Next, let’s look at the document indexing cost. Throughput matters a lot because you could have billions or trillions of documents depending on what scale you work at. Storage costs can also be huge, so we’ll look at both of these things, how we can optimize. The first solution is use a larger batch. As we saw in the query embedding, that if you batch on the server side, you’re able to serve the request faster. We can do the same thing here, except that we don’t need to wait for the model to do server-side batching. We can already use a large batch of inputs and send that to the model rather than sending one input at a time.

The other thing is we can, of course, use more GPUs or TPUs. This is easy to say but very hard to acquire because it’s very expensive, but this is one thing that if we could parallelize the large inference pipeline more, it will run faster because now you’re able to process more inputs at the same time. The third thing — I will keep repeating this because this is one of the most important points — is that you need to reduce the embedding size. Usually we use something like 64, 128, or 256 embedding size because that’s easier to store as well as faster to serve.

Next, we’ll look at the nearest neighbor search latency. As I said earlier that we have query embedding, now we need to go into the index and find the best matching documents. This can be non-trivial because you need to choose the right kind of structure. This is going to be on the critical serving part, so you need to make sure that it’s fast. Some of the solutions you can use, once again, use smaller embedding because the cost of searching through the index is going to be directly proportional to the embedding size. The other one is, don’t go for the exact match. There are two things here. One is you look for the exact match where you have 100% guarantee that you will always find the best matches. This is going to be very slow because it’s almost like brute force. You’re looking through all your documents, just to find the right ones. Instead of that, what we do in production is we use approximate algorithms. You can check a lot of different benchmarks for these algorithms and choose the best one that fits your use case.

The third thing is usually this index is stored in a vector database. There are a lot of solutions out there, and you need to choose depending on what your requirements are and where your stack is. For example, if you’re using Google Cloud, there’s Spanner k-NN. On AWS, they have OpenSearch, which allows you to store these embeddings and look for the embeddings at a quick time. Also, other solutions are there. You just need to find the best one that matches your use case. There’s one more issue. On the internet, nothing is static. New documents are added. New images are added. Once you have this index, it can become stale very quickly. You need to work around that. One obvious solution is that you need to rerun this pipeline that you have for generating the document embeddings, you need to run it periodically. One smarter thing you can do is not to run it for all of your corpus again and again, but just update your index in an incremental manner so that you only process the documents or the images that were new added.

We looked at all these things. We looked at what are the things to consider for production. What if you say that I don’t have resources? I don’t have resources to train my own model or create my own embedding model. These resources could be, you don’t have enough training data to train. You don’t have GPUs or TPUs to train the model. Or you just don’t have time. You say that I want an embedding model tomorrow. Of course, you can use some of the off-the-shelf models that are available online. There are some things that you need to be very considerate of. The first thing is intended use case. You need to make sure that the model that you’re picking is trained for your special use case. For example, if your use case is, find related products, you need to make sure that it uses some data and it has been trained to be good on the shopping use cases.

If your use case is RAG, you need to make sure that the model that you’re picking is already trained for RAG applications so that it performs well on your use case. The next very important thing is the languages. Some of the models that you see online, they’re going to be trained on English only or only 20 languages or only 30 languages. You need to be careful, look at what languages the model is trained on and make sure that it fits your use case. If you have a use case where your inputs can be of 100 different languages, you need to make sure that you pick the right model and make sure the model is already trained on those languages. The next one is training data. Let’s say you’re using these models for a RAG application. You need to make sure that it uses some data of your domain.

For example, if your RAG application is related to shopping, you want to make sure that it has some shopping data so that it has some domain knowledge already. The next is the model size and efficiency. We already talked about this. It’s on the critical path. You need to make sure it’s fast. This is something to consider that the model size and its efficiency in terms of serving, it should match your requirements. Again, output embedding dimension, very important. Make sure that the size that you need is available in the model that you pick. The next is licensing. This is very important because a lot of times some of the public models are trained on data that cannot be used commercially. You need to be aware of what kind of data was used for a particular model and make sure that you only use the ones that you’re legally allowed to serve or to use in your business. Of course, the cost is one depending on if you’re able to run that model on your own hardware or if you’re running it somewhere else and you need to make sure what the cost of serving it is.

The other thing that’s a bit softer is the community support as well as the documentation. You need to make sure that if you face any issues with that solution, you’re able to get help from the community. Very important, but still, it’s there, performance on the benchmarks. There are a lot of benchmarks that are available for judging the quality of embedding models. You need to look at the performance of these models on the benchmarks and get a better idea about what is the best quality model that you can use that also satisfies all the other requirements.

Key Takeaways

We have gone through all these components. Let’s look at some of the takeaways. What are the key points that you can take and start applying it at your work or in your personal projects? The first one is embedding models power most of the search and RAG applications. Very important, you cannot really ignore them. The next is, evaluating these models can be a little tricky sometimes when you don’t have very strict golden labels, so you might need to use an extra model just to rank, just to score the retrieved results so that you could get an idea about how good the retrieved results were. The third thing is, you have large models, but you need to distill them into some smaller production-friendly sizes so that you can serve it reliably with the low latency.

The fourth one is, when you use these models in production, there will usually be two sides. One is the real-time serving and the other one is offline indexing. You need to make sure that you’re aware of these two methods. For real-time serving, you can use dynamic batching that we talked about, or quantization. For offline indexing, you can use a larger batch size to run through billions of documents faster, or you can also use a smaller embedding size to reduce the cost. Then in the end, if you are picking up a model off the shelf, just a ready-made model that’s available online, there are several things that you need to consider before you choose the right candidate.

Questions and Answers

Participant 1: In the evaluation section, when you said if there are no golden labels available, you would use an auto-rating model. How is the auto-rating model able to assess the similarity between the pair that you spoke about?

Sahil Dua: When we are evaluating the model, how is the auto-rater model able to judge how good two results are? Usually, we use very large language models like Gemini or ChatGPT, whatever these available models you have, and these models are already very good at judging, given a query and given a document, are they related and how well they are related? If you find that your domain is very specific, let’s say domain like legal domain, you need to make sure that the model actually is aware of the technicalities there, so it’s possible that you need to fine-tune the model on your particular data to know what is good and what’s not good. Usually, if it’s a generic use case, you should be able to use any language model to predict the scores.

Participant 2: If you’re using Gemini and LLMs as an auto-rater to evaluate your embedding models, what do you use to evaluate your Gemini or your LLM in the first place?

Sahil Dua: If you’re using the Gemini or ChatGPT models to evaluate embedding models, what are we using to evaluate these language models? It depends. As I said, there are two ways. One is you can just use an off-the-shelf model that is already trained. You can use that to predict how good the results are. In this case, you don’t need to evaluate those models because you already know that these models are good at language understanding and predicting the best response. If it’s the second case where you’re fine-tuning your own model, you need to have another set of evaluation. For example, you have a bunch of scores. You have input, output, and its corresponding score, relevant score. You can train the model on this and just withhold a small set of test data that you can then evaluate how good this model is. You’re training a model, evaluating a model so that you can use that model to evaluate an embedding model. That’s very common.

 

See more presentations with transcripts

 

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article SAIC’s ride-hailing unit announced 1 million investment round in push for robotaxis · TechNode SAIC’s ride-hailing unit announced $181 million investment round in push for robotaxis · TechNode
Next Article Tom Gozney, the man behind the pizza ovens of the stars, on addiction, having a design vision and saying no to cooking with an app Tom Gozney, the man behind the pizza ovens of the stars, on addiction, having a design vision and saying no to cooking with an app
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

It’s time to upgrade to Windows 11 without hesitation, as it’s finally getting the most requested Windows 10 feature
It’s time to upgrade to Windows 11 without hesitation, as it’s finally getting the most requested Windows 10 feature
News
One Of The Most Ridiculous Sci-Fi Movies Nails Time Travel, According To A Metaphysicist – BGR
One Of The Most Ridiculous Sci-Fi Movies Nails Time Travel, According To A Metaphysicist – BGR
News
Pin Less, Sell More: A Proven 9-Step Pinterest Sales Funnel
Pin Less, Sell More: A Proven 9-Step Pinterest Sales Funnel
Computing
Deal: This UGREEN Ergonomic Mouse is only .99 right now!
Deal: This UGREEN Ergonomic Mouse is only $16.99 right now!
News

You Might also Like

It’s time to upgrade to Windows 11 without hesitation, as it’s finally getting the most requested Windows 10 feature
News

It’s time to upgrade to Windows 11 without hesitation, as it’s finally getting the most requested Windows 10 feature

4 Min Read
One Of The Most Ridiculous Sci-Fi Movies Nails Time Travel, According To A Metaphysicist – BGR
News

One Of The Most Ridiculous Sci-Fi Movies Nails Time Travel, According To A Metaphysicist – BGR

4 Min Read
Deal: This UGREEN Ergonomic Mouse is only .99 right now!
News

Deal: This UGREEN Ergonomic Mouse is only $16.99 right now!

2 Min Read
iRobot’s Roombas have a new Chinese owner, but it says your data will remain in the US
News

iRobot’s Roombas have a new Chinese owner, but it says your data will remain in the US

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?