GenAI At Scale: What It Enables, What It Costs, And How To Reduce The Pain

Transcript

Kurtz: My name is Mark Kurtz. I was the CTO at a startup called Neural Magic. We were acquired by Red Hat end of last year, and now working under the CTO arm at Red Hat. I’m going to be talking about GenAI at scale. Essentially, what it enables, a quick overview on that, costs, and generally how to reduce the pain. Running through a little bit more of the structure, we’ll go through the state of LLMs and real-world deployment trends. How things are generally shaping as of current day, as well as where we’re headed.

Challenges and key decisions in production as we move more and more AI, especially LLMs, into deployment. We’ll dive into a little bit more on specifics on the tools then to be able to optimize those deployments. Specifically, we’ll dive into vLLM, run through some model compression as well, fine-tuning with InstructLab. Just trying to go over some open-source tools, things like that, that you can actually use for these deployments. Then, finally, move into concluding with tuning your deployments.

The State of LLMs and Real-World Deployment Trends

Let’s run through the state of the industry today. The first part is why LLMs. LLMs, especially when ChatGPT 3.5 came out, really made a big difference in terms of the usability of these really large models, specifically targeted at generating text. Ultimately, they’re able to understand and generate natural human-like text with unprecedented accuracy that we haven’t seen, really, in models before. They’re trained to predict the next token as part of pre-training. Then the really key thing that enabled ChatGPT to blow up is tuning for alignment preferences. Essentially, after you’ve trained it to predict the next token across nearly the entire web and the entire internet, then we’re going to actually align those models to what human preferences are. What that means is that we ask it to continually generate answers.

Then we’re ranking, as humans or a model in between, whether or not that answer is good or not. That’s a key point, which is that LLMs are trained to make humans happy. That is where you get a lot of hallucination issues and things like that. What makes people happy and at least what they’re generally tuned for, is agreeableness and also acting like you have the correct answer. Some issues come out from that. In general, there are a lot of use cases which we’ll run through that still are heavily enabled by these new interfaces. With that, nearly every company out there right now is working to integrate LLMs or at least figure out how to do so, mainly at the direction of their boards. Nearly every board is saying, go figure out how to make money on this. It’s pretty much in every single industry right now and every single company.

Last numbers I saw, maybe 8% of companies are not looking into GenAI at all. Otherwise, everybody else is actually looking into it to see what they can do with it. Current state, only 3% to 10% of prototypes actually make it to production. Anything that you may be working with LLMs, a pretty small percent of those are actually going to get deployed. With that, about $106 billion is going to be spent on deployments this year, just on deployments, leaving training out completely. Training is an entire other beast that costs a lot of money as well. Looking at the projected state, where we’re headed, in about the next 2 years, most analysts are seeing that go up to 30%.

At least they’re projecting those prototypes to start going up 30% as the tooling evolves and as people’s understanding evolves on how to work with these and what applications they can work well with. With that, inference spend is projected to hit about $255 billion by 2030. Obviously, very conditional on technologies and where things go. That’s a lot of analysts’ best projection right now.

Where is all of that money going? Let’s dive into some use cases. Pretty much the biggest right now is code and content generation. We’re augmenting human workflows. If people have heard of Cursor, Cursor is almost a billion-dollar company now. All that they’re doing is augmenting software engineers. Essentially, a really advanced autocomplete. It works really well for that, primarily towards that point of it’s trained on the next token. It knows most likely what you’re going to predict and what you want it to put out. It does make mistakes. It hallucinates. That’s where this augmentation comes in a lot, which is that it’s a really nice workflow to make humans much more productive.

Generally, about 30% is where numbers come out. Really depends on who you ask. They’re going to say, it’s going to replace humans. I do not agree with that at all. I don’t think it’s going to ever do that, at least with LLMs. We need a new technology to do that. It is able to augment and gain efficiency. The average comes at around 30% efficiency gains. This is what’s coming out of Google and Microsoft in terms of active software engineering flows that they’re seeing. Then the next top use case is going to be summarization. We’re going to distill key insights from long documents. Things like meeting notes, reviews, articles, getting really popular now that Gemini is getting rolled out across the Google Suite, Copilot across Microsoft, things like that, where you can just have a bot that’s actually listening to your meeting, transcribing it, and then summarizing that to send out notes afterwards. Really helpful in that use case.

Additionally, another active use case is Amazon reviews. If people have started to notice, there is a little summary that pops up above all of the reviews. They’re essentially feeding every review into that, just asking a chatbot to summarize it. Then they have some constraints on it to make sure it’s not hallucinating, things like that. It works really well to be able to distill those key insights into easily consumable information for humans. Then the final one is really long question answering systems. This is either answering queries from internal sources for things like onboarding or internal Q&A and support, or external sources where some companies, some airlines, and things like that, are starting to deploy chatbots actively out there. There are a lot of guardrails that go around that.

Generally, they are implemented as a RAG system, Retrieval-Augmented Generation. Essentially, you just have a big database of all the information that you may be able to collect. When a question comes in, you’re going to source a few of those documents that look the most relevant to the question based on some vectors that another model creates. You’re going to feed all of that into the chatbot to say, answer this question based on all of this context. The nice thing about that is that it’s really significantly less likely to hallucinate, especially when Sam Altman sets his target for hallucinations at about 10%. That’s where they want to get to. Whenever we go into a Retrieval-Augmented Generation setup, it makes it really easy to, one, ensure that the model is sourcing from some actual information, and not just making something up, especially for internally private information.

Additionally, it’s also really easy to check. If we see words that are coming out in the summary or the answer that are not in the context, then we know it’s most likely hallucinating. This is why a lot of these production systems are going out. Either humans are in the loop and it’s augmenting them, or there are easy pathways to be able to validate hallucinations and see if it looks like the correct answer or not.

Then going into production for a lot of prototypes, the pretty long general flow. First, we start off with rapid prototyping. We’re going to focus on usefulness, not performance. We’re just trying to see, does an LLM make sense for the application that we’re trying to deploy it in? Do something quick as possible, as cheap as possible, just see if it gives any promise towards usefulness. Next, generally, we’ll move into accuracy evaluations. We’re going to define some eval criteria. Most companies will go through and set up some academic benchmarks. This is something like LM Eval Harness, these are essentially the metrics that all of the models are optimized to maximize. It’s things like math tests and things like that to see how well the baseline model is doing.

After that, we get into more useful metrics, at least that correspond with the real world, things like LMArena harness, which is just doing chatbot versus chatbot. You can do a LLM in the middle to try and evaluate which chatbot looks like it’s giving the best response. It’s a really quick way to be able to validate which one’s outperforming, especially for your specific data, if you want to plug that in. Then, finally, the big thing is internal evaluations and human evaluations, where you’re always essentially going through and defining your own data, so you can have some quick checks on that. Running it through some type of human feedback system, so you can tell whether or not when you deploy it or before you deploy it, is that going to look like it’s useful? Is it going to exceed what you might have right now? After we can define all that, we’re just going to go through a bunch of architectures and sizes. How much you do here completely depends on your budget and your team size.

The point is to make sure that you’re comparing at least more than one model. Models are trained and tuned. Mainly, they are aligned very differently. Each company has their own proprietary formula around doing that alignment. They will give very different answers, and they will be tailored to different use cases. Explore those. After we’ve defined our evals, we can dive in and see which models work the best. Optionally, we can fine-tune on our own data. I’ll go through a lot more in detail a little bit later. Fine-tuning definitely helps both towards alignment with your brand as well as being able to create smaller models.

The big point here is that if we fine-tune on our data, generally, we can make it significantly more accurate. After that, we’ll move into inference performance testing. We want to ensure the model is fast enough and cost effective. I’ll define those a little bit more later. Then we’ll generally move into some type of limited deployment. Or if we have already been deployed and we’re trying to improve it, some A/B testing. That means we’re going to release to a subset and monitor and gather feedback on how that model is doing. Then, finally, we can go into a full launch and deploy accurately and performantly at scale.

Challenges and Key Decisions in Production

Next, moving into challenges and key decisions in production. Where are the complications that come out of everything that we just ran through? Mainly because the rest of this talk is towards hosting your own models and what that costs, things like that, first thing is to decide whether or not you should even consider hosting your own and when you should go about that. Top reasons that people will decide to host their own models. One, data privacy. We want to make sure that the data flowing through our systems, especially if we have any HIPAA or any other considerations, those stay within our systems. We’re not giving that to a ChatGPT, things like that. Especially prevalent because lawsuits against OpenAI have recently come out with constraints that they cannot delete any chats at all. A big consideration there. Model lifecycle control. We want to avoid external API changes, so that we can have constant interfaces.

A lot of issues also came out of OpenAI with them deprecating ChatGPT 3.5, a lot of systems were built on top of that. When that went away, those systems essentially all broke. I think they gave a month heads-up on it, something around there. Big point there is that all the prompts that you’re writing, all the flows that you’re using, things like that, those are not immediately transferable to another model. You have to tweak those, and that’s where a lot of this fragility comes in and why you want to control that lifecycle. Another piece, cost optimization. We can create smaller and optimized models. I’ll walk through that in a lot more detail later so that we can control how much we’re willing to spend rather than being given a set price per token.

Finally, customization. A big point comes in with aligning with your brand, terminology, and data. These models are trained on pretty much the entire web and aligned based on whatever that company that created it wanted to do, so that may not match your terminology, your brand, things like that, and we need to tune those models to what our use case is. When it makes sense to do that, based on those considerations. If you have a few or less deployed applications, generally just stick with a hosted API, unless you have strong rationale based on those four key points. This is primarily just you’re not going to get enough scale to either justify a team managing that or even be able to get enough scale to be able to essentially do amortized costs across all that, because if you only have one use case that you’re hitting it very infrequently with, GPUs are expensive. That GPU is going to keep running, it’s going to keep costing you money, so you want to make sure you can keep that GPU fully utilized.

If you have more than a few, we want to ensure that we have a team essentially dedicated to creation and deployment of LLMs. This can be anywhere from one or two people for a startup, up to a team of like 40 for a large enterprise, really depends on the size and scale of your company. Then, definitely makes sense to bring it in-house. If you’re running more than a few apps and you can dedicate those resources, absolutely bring that in. All of those advantages that I just went through, you immediately gain those and you will be able to run these significantly cheaper than what you can get for those open APIs.

Let’s dive into what makes it hard to deploy then. First, let’s define accuracy SLOs, service level objectives. A model that’s wrong obviously is not helpful. Accuracy as we walk through on the eval side must exceed some usability thresholds that we can go out with. Obviously, we’ll hit issues with hallucinations or off-brand responses, things like that, ultimately leading to a poor UX with a lack of trust from users. This core point comes from just model size. The larger the model, the more likely it is to be accurate, especially for your use case. There can be some significant differences as we scale those models up in size.

The other side of it is inference performance SLOs, where essentially latency can’t get in the way. I think probably everybody is familiar with the Google Search SLOs. Apparently, I think it was about 2 seconds they were aiming for, maybe it was a second, but they saw a 5% deviation in that, lead to essentially a 10% deviation in their traffic. Really small fluctuations and hitting a latency threshold can mean life or death for your application in terms of how many users want to go into it. The big things that guide LLM inference performance is going to be time to first token, as well as inter-token latency and request latency. Breaking those down, LLMs are autoregressive. This means it’s going to generate one token, essentially one word at a time. For each token that we generate, we’re executing that model again and again, building on top of the previous tokens. We’re executing those models a lot. Time to first token is the time that it takes to process that initial request and get the first response to a user.

Then after that, inter-token latency is how long does it take to generate each token after. We’ll go through a little bit more in detail on compute breakdowns and things like that, how those change a little bit later. Otherwise, throughput must scale. We need to be able to handle whatever we define as production level scale to be able to meet our users at essentially that latency and at that accuracy. Small gaps can definitely mean big failures there. Ultimately, we need to be fast enough and accurate enough. Those are in quotations because those are very determined by both the use case as well as your personal company on what your users are willing to tolerate.

Running through the resource requirements now that we went through some background there. Just fitting the model, looking at hardware requirements, Llama 4 109B, so 109 billion parameters, that’s 218 gigabytes of just weights. That means we need three 80-gigabyte GPUs. These three 80-gigabyte GPUs, each GPU is about 30 grand, somewhere around there, 35 grand, just buying that outright. Then Llama 4 400B, we actually get outside of being able to fit this on a single server. Generally, most servers will go up to 8-by-80. There are the new B-series from NVIDIA that do go a little bit above 80 gigabytes. They essentially put two GPUs together with high bandwidth memory, and that can go above that. We’re looking at getting outside of deploying on one single server. We need essentially two servers, ten 80-gigabyte GPUs just to fit the weights. KV Cache, essentially you can think of it as a user session. These are the activations that are commonly kept around. We don’t need to redo compute a lot.

Most importantly, these are essentially all the activations for a single user that we need to keep around. It’s about 2.5 megabytes per token. Going back to, token is roughly equivalent to a word. Actually, each word is usually broken up into three tokens, somewhere around there. What that means is 2.5 gigs for a medium-sized request and 25 gigs for a very long context. Long context, you’re looking at summarization of large documents or reasoning requests. Medium-sized requests, this is more day-to-day chat use cases, things like that.

Then, scaling out users just based on memory. We’re looking at 32 parallel users on an 8-by-80 gig system, and roughly 32 parallel users on a 2-by-8-by-80 gig system with a 400 billion size model. A lot of money just to store some weights. Scaling users based on SLOs, just using transformers, pip install transformers and run a model through transformers in Python. We can essentially get two parallel users with some reasonable SLOs, essentially 1 second TTFT and then 50 milliseconds ITL. Two parallel users for that 109 billion model use case, and then about three parallel users for that 400 billion use case. You can see where this is going to start to add up. This is why ultimately your CFO is worried about LLMs, as you can see on these numbers.

For the 104 billion model, we’re almost at a million dollars a month just for a small size startup. Let’s run through that tradeoff triangle, weighing those. Ultimately, we’re balancing speed. Real-time latency is crucial. High throughput requires more compute. Accuracy, we need it for trust, alignment, usability. Larger models are going to cost more. They take more compute, more resources. Then, cost, we want to keep our infrastructure and GPU usage under control so we’re not breaking the bank. Aggressive optimization towards that is either going to hurt speed or accuracy. Essentially, we can pick two out of those, but we’ll run through a little bit on the right tooling for how we can shift that triangle rather than trying to change the tradeoffs. Tradeoffs will always be there, but we can significantly improve on that triangle at least.

How vLLM Unlocks Efficient Serving

One of them is vLLM, and we’re going to walk through what vLLM is. It’s a purpose-built LLM serving solution rather than a general ML solution that’s doing training and inference, things like that. Ultimately its job is to avoid wasteful compute and memory overheads primarily because we can freeze the graph at inference time, nothing’s changing, no weights are changing, things like that. We can add a lot of optimizations in and we can also tailor kernels and all of that underlying software for the exact models that we’re looking at and specifically the sizes, matrix multiply sizes that they’re executing on. Their entire goal outside of that to run as performantly as possible is also make sure you can easily scale to multiple users.

For vLLM, it’s essentially the most popular open-source LLM inference engine out there right now based on stars as well as day-to-day usage. It’s built from the ground up all towards those high-performance pieces I was just running through. A very high-level architecture on what vLLM looks like under the hood. We have a scheduler, the KV Cache manager. Won’t dive into detail on that, but that is the session you essentially need to keep around. Then you have a CPU allocator and a GPU allocator. Essentially, the CPU’s goal is just to feed the GPU as fast as possible. One of the big things underneath it is PagedAttention. Essentially, it does virtualized memory for KV Cache. I don’t want to dive into too much detail, but if people have questions after, definitely ask them. As well as the tuned kernels and graphs that I was talking through. Multi-process scheduling, we want to orchestrate the CPU so that the GPU is constantly at full utilization.

Ultimately what that means is we can get about 24x faster inference performance over the top of just a baseline transformer’s run. A quick Python example, there we go. A quick server example. Really simple to get up and running, just pip install vLLM. Either of these commands are going to work. You can either run it in Python, actively send requests to it, or you can run it through HTTP. Now recalculating the cost, we can see how this significantly shifted. We went from almost a million dollars a month down to about less than $50,000 a month for that 104-billion size model in that startup use case.

In conclusion, you should use a dedicated inference server like vLLM up to 24x faster. Fewer GPUs needed, so lower cost, less scaling. Purpose-built for real-world production and production-ready inference. There’s a lot more we’re going to run through, but this is where most people stop. Most people are not optimizing the models, they’re not fine-tuning them, things like that, so they stop at that $50,000 essentially bill. There’s a lot of compute, as you can imagine, 200 gigabytes of weights, there has to be some redundancy in there, and that’s what we’re going to explore.

Reduce, Reuse, Compress

First is talking through what I call type 1 compression, where we’re going to keep the same model and the same architecture, but we’re going to improve our inference performance through dimensionality reduction. There’s a few different ways you can do dimensionality reduction, but ultimately our goal is to make sure we’re maintaining accuracy while we’re doing that. With that, we’re going to target some combination of memory usage, memory bandwidth, and the compute requirements. If we can relax any of those, then we will have better inference performance.

The most popular approaches along these lines are quantization, where we’re reducing the precision. I’ll go into a little bit more detail on that. As well as pruning, where we’re actually removing weights from the model. For quantization, you can see in that little distribution there, we’re reducing the precision. Most models are running at float 16 or BF16, and we’re going to reduce that down to 8-bit or 4-bit, things like that. We’re taking a really big distribution and we’re trying to force it into a much smaller one, that smaller one being beneficial for a lot of different reasons that I’ll run through. We want to project into that smaller domain, all while maintaining accuracy. Why does it work? Mainly whenever training, these models are trained with Stochastic Gradient Descent. That training process is extremely noisy, a lot of different updates happening.

Essentially, those weights become very resilient to small fluctuations in the earlier outputs, because we constantly have shifts and things like that. If you look at the gradients while training, it’s constantly flipping. The network becomes redundant to that noise. Quantization algorithms all are really focused on exploiting that so we can find the optimal projection. Ultimately, that means that we’re going to need less memory, both in terms of total memory to store, as well as less memory bandwidth as we’re transferring those weights around, and less compute, at least if we’re on a supported hardware for a target scheme.

Quantization in production. If you guys go on to a number of different organizations under Hugging Face, a lot of people are pushing on quantized models, including Red Hat AI, where I work. These are the normal schemes that you’ll see. We have W4A16, which means that the weights are kept at 4-bit, but 16-bit activations. We have a reduction in memory usage and memory bandwidth, but we don’t have a compute reduction. It’s because we need to up-convert the weights to 16 bit, so we can multiply it times the activations. We end up with about 3.7x compression, 3.7x fewer GPUs that we need, and about a 3x performance increase for latency sensitive use cases. INT W8A8, this is where we have 8-bit activations and 8-bit weights. We have a reduction in memory usage, memory bandwidth, and this time compute, because now we can run the 8-bit weights times the 8-bit activations. Most hardware has specific 8-bit kernels that run about twice as fast compared to 16-bit multiplications.

Generally, 2x compression and about a 3x speedup, you can get a lot of savings in between, so you can serve more users, things like that. Then the final one is FP W8A8. You have 8-bit weights and 8-bit activations, pretty much all the same pieces as we talked about with the INT W8A8 setup, but it’s a lot easier to compress models with a lot less concerns on using advanced algorithms to be able to preserve accuracy. It’s just generally easier to deploy. You can get 2x compression, 3x speedup, but you are limited to essentially the latest hardware. NVIDIA, you’re looking at Ada Lovelace, Ampere, and later, and then the latest AMD has them as well.

Let’s dive into pruning. We’re going to remove unimportant weights. That 200 gigabytes of weights that I was talking about, let’s see if we can remove some of that so we can significantly bring down the size. We’re going to zero out low-impact connections as we can see in that little graph. Connections between the nodes, those are the weights. Let’s see if we can zero some of those out and just remove them, and doing that all while maintaining accuracy. Why does it work? Whenever we’re doing Stochastic Gradient Descent, it’s really important when it starts to explore a really large search space, just so that it can try out essentially every different solution. When it actually starts converging though, it’s only converging on a small part of that optimization space.

Converging on a small part of that optimization space means that we’re only using ‘a few’ weights to do that. It depends on what you’re doing, how many you can get rid of, but we’ll run through a little bit. That convergence, it settles on that smaller pathway. Pruning algorithms are all focused on figuring out how to remove the unimportant weights, and if we can essentially update to correct for any error. The impact, we have lower memory usage and bandwidth from sparse weight storage. On supported hardware, then we can enable compute speedups. Unstructured pruning, this is what most accelerators will be able to support. Any individual weight can be set to zero across the entire network. We’re going to get a reduction in memory usage and memory bandwidth, but no compute reduction. It’s mainly because most hardware cannot support unstructured sparsity. A really hard problem to solve for. With this, you can get 2.5x compression roughly, and down about 1.5x speedup for latency sensitive.

For LLMs, we’ve been able to get up to 70% sparsity so we can wipe out 70% of the network and still maintain the same accuracy. 2:4 semi-structured pruning is something very specific to NVIDIA. Whenever we look at the weights, we’re going to just essentially order them in a giant array, and for each block of four weights, we’re going to set any two of them to zero. A little bit more constraining than unstructured, but therefore we’re set at 50% sparsity. You get a reduction in memory usage and memory bandwidth and compute as long as we’re running on NVIDIA Ampere or later. That gets us about 1.8x compression, less than the unstructured case, but we’re trading that off because we can get 1.5x speedup for general server use cases. The latency use case is going to be if we’re running one user or minimal users because we’re bounded by memory. Here we can get general speedup across all use cases. Composable with quantization.

Quick example, LLM Compressor is an open-source library that enables a lot of this, makes it really easy, productizes all the latest state-of-the-art research there, and pluggable in with Hugging Face and PyTorch. Quick Python example, looks like that. I won’t dive into the details here, but just know it’s pretty simple to be able to go through, define what algorithm you want to run and apply that to a model. Recalculating the cost. Remember we were at about $50,000, a little bit less than that for the 70B baseline right about here, but you can see how we’re shifting this line down further. Quantize here, about half, and then sparse quantize further down. We can enable much better scale, as well as significantly bring down our GPU cost just by doing quantization and further compounding that with pruning.

Reduce, Reuse, Replace

Another one, type 2 compression. This is where we’re actually going to replace the model completely. We’re going to replace the larger model with a smaller one. We’re going to use that larger model to actually teach the smaller one so we can increase its accuracy. We’re going to sacrifice some capabilities in the model, but in general, it’s going to help us hit our SLOs with far fewer resources. The targets, same thing, memory usage, memory bandwidth, and compute requirements. We’re running a smaller model, naturally all of that gets less. The most popular approach is knowledge distillation. We’re going to teach from training, I’ll run through that.

Then, data distillation, where we’re going to teach through data. Knowledge distillation, where we’re teaching a student. We’re going to use a larger, more accurate model as a teacher, and we’re going to train the smaller model to mimic the outputs of the larger model. We’re keeping the model compact while trying to preserve the ultimate accuracy, or the accuracy of that larger model, and just diving into why it works and why we can do that.

Standard training, whenever we go through, generally uses cross-entropy, things like that, but the main point is that we have a bunch of words that the next token could be, and it’s only setting one of those to the correct answer. There are definitely a lot of synonyms for a lot of words, so by using that larger model, it’s actually teaching the smaller model what synonyms are, what is most different for each word that it’s targeting, rather than just saying there’s only one correct answer. LLM Compressor, all enabled through there as well. There’s a new integration actually with Axolotl. Axolotl’s a really popular post-training fine-tuning pathway. It enables all of this as well.

The other one, data distillation, we’re going to teach through data. We’re going to use a larger model to generate a clean, high-quality dataset, and generally prompt with a small set of examples from our real data, so we can generate a lot more. Then we’re going to train that smaller model on that clean dataset. Why does it work? Generally, the teacher’s going to capture a lot of rich relationships and nuances in the data and prompts, especially specialized nuances for the data that you’re looking at converging on for your use case.

Ultimately, we’re going to generate outputs that are cleaner and more aligned, therefore. The other big piece is that it enables a lot more scaling. You can think of it as, most people do not have hundreds of thousands or millions of labeled examples, but most people can pretty quickly generate 100 or 1000 labeled examples. We can use those patterns, and then the larger model’s larger knowledge of general language constructs to then be able to generate hundreds of thousands of examples that the smaller model can learn from much faster. Naturally, we end up with that smaller model. Train faster with higher accuracy than the previous smaller model. You’re not going to match larger accuracy, but you can get pretty close.

Ultimately, greatly improve that real-world performance. InstructLab is a popular pathway for this. It’s a comprehensive toolkit for synthetic data generation. Enables templates, generation, and training through simple APIs, all open-source as well. Its job is essentially to pull in the latest research and productize it around synthetic data generation. A quick bash example, very rough, because there are quite a few details on the steps here. Essentially, you’re going to download a teacher, tell it to generate, given some template, download a student, and then train the student on the generated data.

Looking at the recalculated costs, because we’re down, we’re able to shift that model down, we go from a 70 billion model down to an 8 billion model. We’re essentially under 1,000 per month for a small startup. Compounding gains, where we zoom in on that 8B. If we further quantize it and prune that, we can get sub-500 per month. All of that, where we started at about a million per month, so significant gains.

Final one is type 3 compression, where we extend the model. I’m going to go over a very brief overview on it, because it’s not productionized yet, but it’s where we’re not going to change the architecture or the size. We’re actually going to add additional modules to the model, targeting smarter inference. I’ll go through what smarter means. We’re trading off, generally, memory bandwidth and compute for specific SLO targets. You can either add compute and get faster latency, or you can add memory bandwidth to get more throughput, so you can save on compute. Most popular approach is going to be speculative decoding. This is where we’re going to add a smaller speculative model. Its entire job is to predict what it thinks the larger model is going to do. It’s way cheaper to run that smaller model.

Then we’re going to use the larger model to validate whether or not the smaller model was correct. The larger model still runs every token, but it runs them all at once, so it’s easier to parallelize. You get a lot of benefits from compute, while getting significant speedup from that smaller model, ultimately leading to faster latency, but we’re running more compute. Not productionized yet, it is coming soon. We’re actively working on that right now. A few repos are going to be out on that.

Tuning Your Deployments

Now let’s dive into tuning your deployments. Assumptions are only going to go so far. We want to be able to, essentially, prior estimates that I was going through, assume an average distribution and some generous SLOs. Your workloads definitely will vary, so benchmark with your data, so you know how much it’s going to cost and where you can optimize. Understand the profiles that you’re running. Prompt-heavy requests are going to be compute-bound. You’re going to take up a lot more compute on that GPU. Where prompt-heavy is going to be things like summarization, where you have a really large document, and all that’s getting generated at once to generate that first token. Same thing with reasoning models. They generate a ton of tokens, but all of those being more prompt-heavy. Decode-heavy are going to be memory-bandwidth-bound for the most part.

After we get past that first token, we’re running a little bit less compute, and it’s mainly just weight transfer to go through the registers on the GPU that we’re actively doing. Makes it much more memory-bound. This will be things like content generation, where you’re asking it to write a blog for you. Small prompt, it’s going to generate a lot of tokens on top of that. Ultimately, that mix is going to affect what your latency, throughput, and cost will look like. Generally, what we recommend is running a performance sweep with your data, and then compare the hardware and model combinations with that, so we can determine concurrent requests per server at our given SLO.

Ultimately, what that means is we can figure out how many servers we need to run to be able to meet our user demand. A big open-source repo for this is going to be GuideLLM. It’s a comprehensive toolkit to go through accurate and automated performance benchmarking, and then tailored metrics and statistics for those LLM deployment evaluations. It has all of the metrics already built in, so you can just measure those and use those. Also, the latest best practices around that eval. A quick example, this is it running in action. It’s running a sweep right now at a bunch of different RPS, requests per second, so you can figure out what request rate you’ll meet your SLOs at.

To give an example what this looks like, you can see here, here’s an example for Granite 3.1, 8 billion, running on two different hardware types. Hardware types definitely matter as well. You can see here on that graph on the right where the 4-bit works really well in low requests per second. Requests per second being our x-axis. Works really well in low requests per second, but as we gain and get closer to throughput, or a lot of users in parallel, that’s where the 8-bit quantization starts winning out, giving us better overall latency as RPS scales.

Conclusion

LLM deployments are definitely complicated. Inference is expensive. Models should be compressed, and need to balance latency, throughput, and accuracy. Open-source tools make it easy. vLLM unlocks efficient, scalable serving. LLM Compressor allows you to apply those productionized compression algorithms. InstructLab allows you to be able to do synthetic data generation so you can fine-tune your own models. Then, GuideLLM is going to be there so you can ensure you’re meeting your SLOs as you try and move into deployment. What to do next? Make sure you know your use case, your workload, and your budget. You’ll have to go through a little bit of math to be able to map that out. Ultimately, measure your performance and not just general cost.

Questions and Answers

Participant 1: One thing I was wondering about is as new models come out, do you go through that same process and how does that factor into your cost of doing all that?

Kurtz: It can. The main thing there being the Llama series, for example, at least up until the latest 4.0 release, the iterations that they were doing there from Llama 1, Llama 2, Llama 3, all were the same base architecture. Generally, a 7B or an 8B model from all of those are going to perform similarly. The only gains you’re going to get is in accuracy because they did better training on it. Llama 4, definitely flip that and sizes change. They actually go with what’s called a Mixture of Experts model, which are a lot more complicated. They take up a lot more memory in terms of how much compute they run. Memory bandwidth starts to become a lot more important. Answer is it depends. Most times you can get by with just comparing sizes of models because the architectures are fairly similar. There can be differences where you have a Mixture of Experts architecture and a more traditional generative decode architecture. Those two are definitely going to be very different in terms of performance profiles.

Participant 2: I was curious if you could dive a little more into how you measure the quality of the output. For example, if you’re summarizing something, I’m a human, I can read a couple different summaries and say, this looks good, that looks good, this maybe looks better. How do you measure that at scale, or how do you ensure once a pipeline’s running that the quality bar is consistent with what you want?

Kurtz: There’s a lot of complications in there. Summarization’s actually one of the easier use cases. With summarization, you’ll go through different stages of what you’re evaling. The most basic is that you have some test set that’s already pre-labeled, or essentially you have a bunch of content and then a target summary. You’re just evaluating the model based on that. There’s things like Rouge score and things like that, which essentially are just looking at how many words are overlapped from what the model generated and what the ideal summary was that was human-generated. Then it’s also looking at the spans. How many words in a row match as well? Ultimately, you’re going back to some baseline and then measuring the difference between that baseline. That’d be your first pathway.

The second is for active usage where you don’t have a source of truth for it. Then, generally, you’re going to try and validate that with the number of words that are actually contained in the original source. It doesn’t get you the quality in terms of if the language is understandable, but it will get you a general sense of did it summarize content that actually exists, and generally correct. If you can trust that the model’s capable enough that it’ll generate legible answers, then you can go off of that. The final piece is always going to be just constant human quality checks to ensure. Generally, you’ll look at active data that came through, and then spot check those every once in a while, with a human to validate.

Participant 3: My question was around strategies that are proven to be effective when you’re switching models, so from 3.5 to 4, all the churn and problems that people faced, were there any strategies that worked?

Kurtz: You definitely hit a lot of issues. The people that were able to scale it out most successfully, what they’ll do is they had a few test sets that they had. You can do a meta prompt engineering where you ask an LLM to generate a prompt and then test that through the model. Then, again, you’re looking at quality and how much does it match the reference baseline. A lot of people were doing that where essentially you can just do a bunch of iterations and tune the prompt with an LLM. To be able to do that, that’s going to be your most automated pathway. You’ll always have to double check it afterwards. That is the most scalable. Otherwise, you fall back to a human manually trying to tweak the prompt and go through the output and also go through your eval set again.

See more presentations with transcripts

GenAI at Scale: What It Enables, What It Costs, and How To Reduce the Pain

Transcript

The State of LLMs and Real-World Deployment Trends

Challenges and Key Decisions in Production

How vLLM Unlocks Efficient Serving

Reduce, Reuse, Compress

Reduce, Reuse, Replace

Tuning Your Deployments

Conclusion

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Today's NYT Connections: Sports Edition Hints, Answers for Dec. 8 #441

Why Online Soccer Games Are Becoming a Top Choice for Gamers

The humanities must have a role in overseeing AI ‘censorship’

Woot’s Best Deals of Cyber Week: The Fourth-Gen Apple iPad Pro Is $100 Off

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

The State of LLMs and Real-World Deployment Trends

Challenges and Key Decisions in Production

How vLLM Unlocks Efficient Serving

Reduce, Reuse, Compress

Reduce, Reuse, Replace

Tuning Your Deployments

Conclusion

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News