Transcript
Yu: I’m Cody. I’m a staff software engineer at Anyscale. I’m going to talk about how to scale out batch inference with Ray. I’m a staff software engineer and a tech lead for large language model performance at Anyscale. I’m also a vLLM, SGLang, and Apache TVM, which are our famous open-source projects committers. Before that, I was also the expert engineer at Boson AI. Also, a senior applied scientist at AWS AI.
The GenAI Era
Before we dive into the batch inference for large language models with Ray, I want to first start with the statement that we are in the GenAI era. For example, you can see today a lot of people are talking to the chatbots, and some companies have already used a chatbot or large language model for their online customer services.
A more complicated situation or use case, could be using the large language model as the multi-agent to process a complicated task. We all know that the model today can actually generate not only the text, but also images, or even videos. Even the images from the slides were generated by OpenAI GPT-4. This illustrates how useful and important the applications are in this GenAI era.
Batch Inference
At the same time, the demand of the batch inference is also getting higher. This is mainly because we now have multi-modality data sources. You have cameras, mic, and sensors, and PDF files. Then, by processing these files, you will get different kind of raw data in different kinds of formats, which is unstructured or structured. For example, you can have the images extracted from the camera or sensor. You can have the audio, tabular data, or video, or even the raw text.
In order to make that data pretty useful for your real application and services in production, you will need either the embedding models or the large language models to help process this data. Then you can save the results back to the vector database, use for model training, classification, or extract the knowledge from that data for the various applications. In short, while a large language model becomes high on demand, the demand of the batch inference will go even higher. That’s why we want to dive into the batch inference and make it more cost effective and high throughput. Now scaling out batch inference actually will bring up certain challenges.
For example, the first thing you want to deal with is the scalability, because you can imagine those raw data can easily go up with hundreds of thousands of gigabytes or terabytes, or even more. Scalability becomes the biggest challenge. Come with high scalability, you have the reliability issue, because in order to save money, usually we will prefer to use the Spot Instances on the public cloud providers for cost saving. The Spot Instances will actually offer you the instances with, for example, just 10% of the original price, but it can be preempted anytime by the cloud provider in high peak hours.
In order to build a system on Spot Instances, we have to have a good reliability, that means you can deal with the failures due to the preemption, and you’ll be able to resume the failure jobs without any human in the loop. Of course, computing is one of the most important factors, because different stages in your data pipeline may require different hardware, which can be CPUs or GPUs, or different kinds of GPUs. Dealing with multiple stages in your data pipeline and heterogeneous computing also becomes a big challenge. Also, the flexibility, because you will need different models to process different data.
For simple models, you may need just simple object detection model. For complicated data, you may need a large language model to analyze the semantic or extract information, so the system, the batch inference pipeline also needs to deal with the flexibility of dealing with different OS models, or the custom models trained by yourself. Finally, SLAs. When you’re doing batch inference processing in production, you cannot just throw out the data and get a result without considering the cost. Either you will set the high throughput or cost or latency as your SLA before you get into the production to make sure you won’t bankrupt by processing those kinds of data.
Multi-Layer Approach
At Anyscale, we solve this problem by introducing a multi-layer approach. We basically have a three-layer approach. From the bottom up, we have the Ray Core which is a scalable general-purpose AI computing engine to deal with the scalability and reliability.
On top of that, we build Ray Data, which is an efficient and scalable data processing pipeline on Ray. Because Ray Core already deals with the scalability and reliability, so we can build Ray Data on top of that easily. I will show how we build it and how do we implement that later on. On top of that, we need a powerful large language model inference engine. We use the open-source vLLM, which is the most popular open-source large language model inference engine at the moment, and based on that, to construct our batch inference pipeline.
Ray – Scalable AI Compute Engine
Let’s start with Ray. Here’s a Ray overview. Ray is basically a distributed library to deal with large scale cluster and resource management and task allocation. On top of Ray, we already have a very plentiful ecosystem which are the application-level libraries built on top of Ray. For example, we have Ray Tune and Ray Train to deal with the training jobs. We have Ray Serve, to deal with the online model serving. RLlib, which is a reinforcement learning library built on Ray.
Also, the Ray Data, which I will be introducing very soon, is for data processing. All of those applications are implemented for different functionalities, but a common part of them is to deal with large scalability. At the bottom we have the remote functions and classes, we call those tasks and actors. Then the Ray Core, which is a general-purpose distributed execution layer to deal with all the task allocation scheduling and actor manipulations.
For example, for now, Anyscale has been using Ray to push the scalability to thousands of instances for running our workloads for our customers. In short, you can see we have the head nodes and the worker nodes. The worker node can be 1000. In each node, we have the driver and the worker. On the line, we have two important components, which is GCS and a dashboard server. For GCS, this is used for actor scheduling, placement group scheduling, node resource views. For dashboard server, this basically shows the dashboard metrics, system logs, job APIs.
We highlight those two components because we realize that when you’re scaling up to thousands of nodes, scalability, reliability are the important factors, but other than that, how do you make sure all the metrics and logs and the current status can be easily monitored and interpreted by users, are also becoming important factors. That’s why, in the recent development of Ray, we also focus on those GCS and dashboard server developments. We want to highlight that those two components have actually become much more robust on actual large clusters, and this is basically because we have optimized the event loop very carefully.
For example, we have the dedicated threads for high value and simple RPCs, and you also move expensive data fetching to the background threads, and break up expensive handlers. I won’t dive into too much details for the Ray Core, but if you’re interested in the details of Ray, welcome to check out the Anyscale documentation, or contact us for more details.
Ray Data – Scalable Data Preprocessing for Distributed ML
Once we have Ray to deal with the scalability, next we can build Ray Data, which could be a scalable data processing for distributed machine learning processing. Let’s start with the key challenges that Ray Data has solved. First is the heterogeneous computing. Like I mentioned in the beginning, in the different stages of your data pipeline, you may need different hardware in terms of CPUs and GPUs. In Ray Data, we enable streaming execution. That means we basically chunk the data into multiple blocks, and then we can process all the blocks in a streaming way.
In this way, we can make sure the resource on each stage can be maximized all the time. In this example, you can see we load and preprocess data on CPU and do the inference on GPU. We want to minimize the GPU idle time, because the GPU cost is usually like 4x of the CPU cost, or even more. The second challenge is, like I mentioned, reliability is an issue if we want to build a data pipeline on very cheap Spot Instances. Ray Data along with Ray Core, also make sure the reliability is there.
Specifically, we have the fault tolerance and checkpointing to make sure all the jobs can be done with the correct functionality and statements. Also, we have the job resumability so that whatever the instances are preempted by the cloud providers, we can make sure all the jobs will get executed only once. The third challenge is to deal with a complicated ecosystem. Now you have a thousand ways to process your data using various of the existing and mature systems, like HDFS, S3, those are the objective systems.
Also, obviously, Spark, Dask, those are the computing frameworks, and that’s for the data part. Then for machine learning and AI, of course, people usually use TensorFlow or PyTorch to do the inference for your model. Ray Data has already integrated all of those popular user frameworks, so that you can easily use Ray Data to integrate all of those components together to build your scalable data pipeline.
Let me dive into Ray Data implementation. First is the Ray Data Engine. In the example on the slides, we show a Ray Data batch inference example. In this example, we first read the image from S3 buckets and apply the custom transform function to process the data in any way you like. After that, we will send the processed data to inference. This inference is a class, and it will be hosting a model on GPUs. After data has been processed, we’ll basically upload the result back to the S3, and all this happens in a streaming way.
By writing this simple code, you already construct a highly reliable, highly scalable data pipeline using Ray Data. What’s happening under the hood is we will first construct a logical plan. This basically designs what to do with the data pipeline. You can see by this code snippet, we define the four stages in the CI pipeline: first read the data, transform data on CPU, perform inference on GPU, and then upload the result on CPU. However, you can imagine, although this data pipeline looks pretty simple and straightforward, it could be inefficient during the execution.
After this, Ray Data will actually compile this logical plan to generate the physical plan. The physical plan basically defines how to do with your data pipeline. In this simple example, we can see the read and transform are actually fused to a single MapOperator. That means we can save the data communication, data rewrite, Disk I/O by streaming the read and transform together.
Once we have the logical plan, during the actual execution, the data actually will be processed by Ray Tasks or Actors with different resource requirements. Like we mentioned, we can actually specify how many CPU cores you need to process the input data or how many GPUs you need to inference the data. By specifying different requests, what’s happening under the hood is Ray Data will do the scheduling for different tasks on different hardware in a dynamic way.
In this example, we have three nodes, two CPU nodes and one GPU node. Even for the GPU nodes, we have the CPU cores. Ray Data will basically just schedule the Ray Tasks on the CPU or GPU, based on the requirement. Of course, we’ll try to minimize the data movement by scheduling similar data to similar nodes. Like I mentioned, it’s important to have the streaming executor, so we’ll schedule tasks dynamically and manage the resources and handle the back-pressure. All of this is to ensure the GPU is consistently saturated. This becomes very important metrics when we’re evaluating the efficiency of the data pipeline with GPUs.
Talking about streaming execution, I do want to highlight one thing that plays an important role in efficiency, is about how do you communicate data between tasks and nodes? As you can imagine, because we can probably schedule the tasks on different stages, on different nodes, you will need an efficient way to communicate the data between nodes. We introduced an abstraction called the Ray Object Store, which is basically a storage or like a file system, so that the MapOperator can basically upload your temporary result to this object storage, and the next stage can consume that.
In order to achieve the best efficiency, this abstraction can be implemented in different ways. For example, between nodes and nodes, you can use share file system or AWS S3 object storage. With the same node, this can even be in, for example, the shared memory or local disk I/O. By optimizing and elegantly selecting the most suitable file system under the hood, we can optimize the data transfer between tasks.
Here is a case study using Ray Data. This example is the batch embedding generation. In this simple pipeline, we first load PDFs on CPU, and extract and clean text on CPU. Suppose this is streaming easy, in this example, the most difficult part is to extract embeddings for those text using the sentence transformer model. After this pipeline, we will have embedding factors for each PDF file, and this embedding can be used for data retrieval or the PDF search.
In order to generate such embeddings, we use Ray Data with 20 GPUs with 100% GPU utilization. These 2K PDFs, which include 30k pages, and we would like to generate 140k encodings, we use 20 GPUs, and this will take about just 4 minutes runtime to finish all the processing. By the trace on the right-hand side, we can see, once the task kicks off, all the GPUs are basically saturated all the time for the entire 4 minutes.
By doing so, you only need to pay less than $1 to finish the 2K PDF processing. Of course, we have some more examples which are available on the Anyscale blog post, feel free to check out. We do have more example case studies with much higher scalability. I do encourage you to check out and see other use cases.
Ray Data + LLM
I want to dive into the large language model part. The previous example I show, you can use Ray Data to construct a pipeline and basically leverage the machine learning model to process the PDF file. However, if we get into the large language model, you’ll bring up more challenges. First of all, what’s the motivation? Because we know that large language models are actually much more powerful to process the various tasks, you can let it do the summarization, tagging, or sentiment analysis, keyword extraction, or unstructured data processing. You can do much more things than the traditional machine learning models.
However, it also brings challenges. The most important challenge it brings up is large, because it’s called large language model, so it’s basically too large to achieve the high throughput in a very naive way. Here is one possibility of your data pipeline with large language model. Now in the data preprocessing stage, you are now just doing the data processing in the way you like. You also need to tokenize your input data. You probably also need to encode your image if your data has an image input. This is because a large language model doesn’t really take the text and image directly as the input.
Instead, it takes tokens. Tokens are basically integer representations of the inputs. For example, an English word can be partitioned to one or multiple tokens. The large language model will solve the tokens and do the mathematic computation, like matrix multiplication and so on, to calculate the features and hidden states, and then determine the response. You have to do the tokenization to transform the text and images to the tokens and send it to the large language model.
After we got the response, the response is also in the token format. You have to do detokenization to convert those tokens back to the text. We can now see that we have a more complicated pipeline, and also this large language model, often, has to be executed on multiple GPUs instead of one. Those are the challenges we need to deal with.
vLLM + Ray Data – RayLLM-Batch
In summary, by the end of this talk, I want to introduce RayLLM-Batch. This is the solution we have by combining vLLM and Ray Data, to achieve the scalable and batch processing. Before that, I want to introduce vLLM. vLLM itself is the most popular large language model inference framework. It’s fully open source and it’s also part of the Linux Foundation. It now has more than 30k stars on GitHub, and we merge more than 200 PRs every month.
There are notable contributors, including Anyscale, AI21, Alibaba, AWS. You can see a lot of companies are actively contributing to vLLM to make it better and have more features for the large language model inference. There are also a lot of adopters that use vLLM in their product, including some open-source projects and companies.
I want to talk a bit about what’s the key features we used in vLLM and what we have to already contribute. Before that, I want to take a step back to briefly introduce how large language models do inference. Different from the traditional machine learning model that you simply send the input and do the model forwarding and get output, large language model uses so-called autoregressive decoding to do the inference.
This is basically because if you just run the model once, you will just generate one token. Like I mentioned, if you want X model to respond with text or the paragraph, you will need tons of tokens, and those tokens have to be generated one by one in an iterative fashion. That’s why we call that autoregressive, because, for example, we can see that you have the prompt, and then do the tokenization, it will be converted to the number of tokens.
Then doing the one forward, we just predict one next token and one next word. Then we will send this back to the model. This is called autoregressive. You will send the output, become the input of the next iteration and generate the next token. You have to do that iteratively to generate the complete response. You can see this is actually a very long latency process. If you use the ChatGPT or other chatbot providers, you can see the response is always output in a streaming way. That’s not only just because of the UX perspective, it’s because the model generates the output exactly in that way.
Now, from the production or the vendor point of view, we want to maximize the throughput in the server side to reduce the cost. We cannot just generate one response by fully utilizing the GPU resource at a time, we have to batch all the requests and generate requests together to minimize our cost and maximize the system throughput. One important technology being introduced is called continuous batching. Basically, we want to batch the requests coming from the users and decode them together. In this example, we have the first request coming in, and we first do the prefill to generate the KV-cache for the input prompt.
After that, we get into the decode stage to generate the response token one by one. Let’s say in this time period you got the second request coming in, because we first have the prefill-first policy, so we will pause the decode process and first do the prefill for the second request. After that, we will let this second request join the decode group to batch decode together. We can see, at this moment, our batch size becomes 2 because we’re decoding two requests together.
This is called continuous batching. We have two more requests coming in, so we pause the decode process again and do this prefill, and then let all the requests join together and decode together. This is the normal batch processing with continuous batching. We can see an obvious challenge for this algorithm to do the continuous batching.
First, you can see that the decode process has to be interrupted when the new request comes in. From the user point of view, you may see the model is generating some text and then pause for a while and then continue generating that. This is because of this behavior. The second part is you can see, if there is a long prompt, let’s say the request 3 has the input token 1.5k, this will slow down the entire batching of the request. You can see the latency between this token and this token is actually much longer than this token and this token.
In order to solve this problem, an important technology, or the feature that’s already introduced and integrated into the vLLM is called a chunked prefill. The basic idea is we want to split the long prompts to multiple chunks, and we want to batch chunked prompts and decode tokens together.
Again, in this example, when a new request comes in, instead of pausing the current decode, we actually let the current decode join the prefill batch, so we can do the prefill and decode together in the same batch. Let’s continue doing that. Now that request 3 is coming in, again, we don’t have to pause the decoding process of the existing two requests. We can directly introduce a new request coming in, and then we do the chunked prefill for the request 3, followed by request 4, and now we have all the decoding process together.
This becomes an ideal case that we can balance the batch sizes between all the time frames, and then we also make sure the decode process won’t be interrupted. This feature is basically contributed by Amey, he is the chunked prefill author and PhD student, and also SangBin from Anyscale. Here is an experimental result for enabling chunked prefill in your model serving, you can see that we basically have 1.4x ITL increment. ITL stands for inter-token latency.
From the user point of view, you will see the request coming out much faster than before. We’re also suffering from 30% TTFT slowdown. TTFT stands for time to the first token. This is easy to imagine, because now for the long prompts, you need multiple batches to process them all. This is being sacrificed by this technology. Because we have the much higher ITL improvement, the overall end-to-end latency is also reduced, so the overall system throughput is also improved by enabling chunked prefill.
The next feature I want to introduce is the prefix caching. This is basically very straightforward. We want to reuse the KV-cache from another request with the same batch so that we don’t have to compute the same hidden state for the same tokens again and again. There are two obvious examples that prefix caching can help a lot. The first one is called shared system prompt. You can imagine, if you’re hosting a service, a chatbot or something like that, you probably have the system instructions across all the requests.
From, for example, the OpenAI point of view, they will have a very long system instruction to tell the model you are used by an AI assistant, your response should follow this rule 1, 2, 3, and those are consistent across all the requests. We want to share that and avoid the computation of those instruction prompts. The second example is a multi-turn conversation. You can imagine when you are talking to the chatbot, when you respond to the model with a new question, the previous chatting history is actually reused from your previous round, so this is also a shared prefix. Because of that, we want to use prefix caching.
This in the vLLM, we have the hash based automatic prefix caching. The idea is pretty straightforward. We basically chunk your prompts into multiple blocks, and then we use the token ID as the hash key. When the new request comes in, we will use the token ID to see if we have already computed the blocks in the system. If so, we just directly reuse that block without recomputing that. This feature is basically led by Zhuohan from Berkeley and me from Anyscale. Another one is called speculative decoding.
Basically, the idea is also very straightforward. When you’re serving a large model, like 70B model, the latency is really pretty high. You may imagine that sometimes the model will respond with a very simple output, and in this case, using a small model can maybe fulfill your requirement already.
Speculative decoding is basically a case that we deploy both a small model and large model at the same time, and when the request is coming in, we use a small model to generate the speculative tokens and let the large model to verify that, because the verification cost is much lower than generating the token. If this small model can guess the token in a precise way, we can significantly reduce the latency for generating the response. This is also an important feature contributed by Cade and Lily from Anyscale.
The last feature I want to introduce is the pipeline parallelism. Like I mentioned in the beginning, the large language model may not be fitting into the single GPU due to the number of its parameters, so usually we will need more than one GPU to host one single replica for a model. That means you have to parallelize the model execution in some ways. There are two famous model parallelism strategies: you can do Tensor parallelism or pipeline parallelism. Tensor parallelism is a straightforward approach.
Basically, you just parallelize the execution of the matrix multiplication, and then aggregate the results after each layer. This implementation is extremely simple, but it also introduces very high communication overheads. It is good for improving the latency on a single node, or if your GPUs have a very high bandwidth inter-GPU communication, such as NVLink. On the other hand, the pipeline parallelism is a low communication overhead solution that basically partitions different decoding layers to different GPUs and pipeline the execution together. This is very good for throughput, because you can saturate all the GPUs on different pipeline stages.
This is often very useful for offline batching on the commodity GPUs which don’t have NVLink, so the inter-GPU communication has to go through the PCIe. We are already actively using pipeline parallelism in the vLLM for the batch inference. This feature was contributed by Murali from CentML, and SangBin from Anyscale. Here is an example of the pipeline parallelism. One thing I do want to highlight is, although pipeline parallelism is an intuitive solution to achieve high throughput for the batch inference, it’s not straightforward to achieve the best throughput out of the box.
For example, in this case, we show that we use 8 L4 GPUs to serve a single Llama 3.1-70B model. We can see TP8 stands for Tensor parallelism with eight GPUs, and PP8 stands for pipeline parallelism on eight GPUs. We can see that although pipeline parallelism should give us a better throughput, the throughput shown here is actually worse than Tensor parallelism. This is because we don’t really optimize the pipeline efficiency in this setting. Why?
First of all, we have to unbalance the pipeline stage execution time, because in the large language model, the basic architecture you can imagine is the repeated decoding layers, but actually in the beginning and the end of your model, you have the embedding layer. The first embedding layer will convert the tokens to the embedding factors, and the last layer will also convert hidden states back to the embedding factors for your vocabularies.
That means the first stage and the second stage in your pipeline must have higher execution time than other stages, if you partition the decoding layers evenly. We have to consider these factors when we partition the decoding layers. Specifically in this evaluation, we just allocate less decoding layers in the first and last stages to balance the execution time of all the pipeline stages. You can see by doing so, we have already achieved a similar throughput as the Tensor parallelism, but the same performance is not good either, because we expect a much higher throughput.
Another bottleneck is the unbalanced batch sizes between pipeline stages. Imagine you have the pipeline with eight stages in this example. In the first timestamp, you’re processing the prefill with 2K tokens, followed by a decode with just 10 tokens. Although the batch with 10 tokens can be processed much faster, there is a prefill in front of you to block your execution. This imbalance will create the pipeline bubbles and reduce the pipeline efficiency.
Our solution is to enable the chunked prefill, so we can make sure every batch has a similar number of tokens, so the execution time of this batch can be balanced across all the pipeline stages. You can see the result is pretty amazing. We basically get almost 2x throughput increment by simply enabling the chunked prefill. This is not the end. We actually can achieve much higher throughput by tweaking another configuration.
In this example, if we change the configuration to use two GPUs to do the Tensor parallelism and use four GPUs to do the pipeline parallelism, we can actually achieve a much higher throughput. This is because if we use the pipeline stage every four pipeline stages, we can actually balance the execution time in a better way given the number of decoding layers in this 70B model. You can imagine this configuration tuning is basically case by case, and there is no optimal solution across all the workloads and models.
RayLLM-Batch – Ray Data + vLLM
Finally, let me get into my final goal to combine vLLM, this is a powerful LLM inference engine, with Ray Data to construct a highly scalable data batch inference pipeline. RayLLM-Batch is an Anyscale library built for large scale, cost-optimized batch inference. The key features are, you can bring any open-source model, and it’s dealing with the fault tolerance. You can use the Spot Instances at scale to save money. We also have the optimizer for the custom workloads. There is a simple code snippet that’s being used by our customers.
Firstly, you just need to define your workload, for example, how you read the data, how you want to process the data. Then you just need to specify how many GPUs you have, and you want to use that for the batch inference. Then the rest of the thing will deal with RayLLM-Batch.
Other than that, we have the inference engine optimizer. Like I mentioned in the pipeline parallelism example, you have to optimize the engine configuration in terms of number of GPUs, number of pipeline stages, or even the batch sizes, whether you enable chunked prefill or something like that. This is actually pretty complicated. In the right-hand side, there’s a very scary usage example dumped from the vLLM open source. There are basically thousands of different configurations you can play with to achieve the optimal performance. This is actually pretty painful.
In Anyscale, we have the inference engine optimizer, which is pretty straightforward. We have the autotuner to explore those possible configurations, and then evaluate those configurations on all GPU clusters. Because of Ray which is already very scalable and reliable, we simply build this optimizer on top of Ray so that we can tune a lot of configurations in a very short period of time.
Case Studies
Finally, is a case study. We use the synthetic dataset with 320 million tokens, and then we enable chunked prefill, prefix caching, and pipeline parallelism like I introduced before. From this result, we can see the engine optimizer can help reduce the total processing time by up to 34% as shown in the right-hand side.
Also, the prefix caching can help reduce the total process time by up to 64% with the 80% shared prefix. We can dive into this figure, in the left-hand side, we have the Llama 3.1-8B on L40S GPUs, and the right-hand side is the 70B. We can see that by enabling the engine optimizer and the prefix caching, we can reduce the cost by almost an order of magnitude. Then this also, again illustrates that different workloads and different models will require different optimizations.
Takeaways
We mentioned that the large-scale batch inference actually becomes a high demand workload in the AI era, and Ray Data can actually push the scalability of the batch inference to thousands of nodes. Large language models actually make batch inference even more practical. vLLM has enabled most features for high throughput LLM inference, and it works out of the box. If you combine vLLM with Ray Data to become RayLLM-Batch, that will be an ideal solution for large scale LLM batch inference.
Questions and Answers
Participant 1: Are there any examples of what you’ve used RayLLM-Batch to build, for example?
Yu: Yes, we do have customer examples. Some examples I can probably share, high level, is we have a customer that uses RayLLM-Batch to process their customer data, and generating a possible next action item. Like, how do we deal with this customer? Where we should give them the discount, or we should greet them more frequently, or something like that. They process millions of customer data using RayLLM-Batch.
Another example is probably, there’s our customer, they have the customer purchase history, there’s like grocery customers. They have the customer purchase history. They want to have some insight from the processing history. They threw the purchase history with some system prompt and asked the model to analyze those purchase history and do the recommendation and probably just get a sense about what items are getting popular across all the customers. Those are the obvious examples that actively use RayLLM-Batch, at the moment. We believe there are a lot more applications, and we’re looking forward to see what else people can do.
Participant 2: With vLLM, you went over several different inference optimization techniques that are being used. Which one do you think might contribute more towards inference optimization?
Yu: In order to answer this question, you have to make a lot of assumptions about your workloads and models. For example, if your model is 8 billion parameters, it can be usually fit into a single GPU. The most important feature could be chunked prefill and prefix caching, and probably floating points, FP8 execution and quantization. If you want to serve a much larger model, like 70 billion, or even 4 billion or 5 billion from Llama 3, then you only need multiple GPUs.
In this case, the most important feature could be the pipeline parallelism. It really depends on your use case and workload. For example, if your workload doesn’t have any prefix sharing, every of your texts are having different prompts, or even the prompt is actually pretty short, then prefix caching may not be the one you want. On the other hand, if you have 80% of the shared prompts, then prefix caching plays the most important role.
Participant 3: The tuner you are using, is it more of a grid search, or do you use some size of activation, size of weight, and do just educated guess first, and they start from there.
Yu: For our tuner, the style is configured in a flexible way. We would use different search algorithms based on real use cases. For example, for certain cases that we already manually shrink the tuning space to a very small size, like 10 or 20 configurations, in this case, we’ll just use the grid search or exhaustive grid search.
For more broad use cases, which we may have thousands of configuration points, in this case, we’re experimenting with different search algorithms. Currently in our radar could be the Bayesian optimization or the tree-based structured search, or even the reinforcement learning based search algorithms. There are a lot of tuning libraries off-the-shelf, so we’re basically just using that. We’re not inventing the new search algorithm, but basically just see which algorithm is the most suitable one for our use cases. We want to make that more efficient.
Participant 3: On the same line, for a model that either fits in a GPU or two GPUs, is there any point on going more in terms of Tensor parallel, because it’s just going to add the overhead. Unless you’re looking for lower latency, is there any point on going on the Tensor parallel route? Let’s say my model fits in two GPUs, and I’m happy with latency, is there any point in going Tensor parallel on four GPUs?
Yu: This is very correct. Tensor parallelism has two benefits. One is, it should lower the latency, given that you parallelize the matrix multiplication on two GPUs. The second is, basically you balance the workloads, because, like I mentioned, in order to make the pipeline parallelism efficient, you have to deal with the pipeline bubbles, and you want to make sure the execution time of each pipeline stage is balanced. For Tensor parallelism, we don’t have this problem, because you naturally have to balance the workload across all GPUs. That’s why we call that simple and easy to optimize.
Participant 3: On the chunked prefilling, what happens? Let’s say you have a continuous batching, max batch size is 128, and then after some time, let’s say 16 of my prompts are done. They have reached the stopping condition, but the rest of the prompts are still active. Does the batch have to waste computation for those 16 queries that are already done.
Yu: This is naturally handled by continuous batching with or without the chunked prefill. We can see that in each timestamp, we actually reconstruct a batch on the fly. If, for example, the request 1 is down in a certain time, it will just be kicked out to dispatch, and then directly return to the user. It doesn’t have to be staying here until all the requests are done. This is the flexibility of the continuous batching.
Participant 1: For companies or organizations that are getting started with this, you mentioned some overhead tradeoffs. Is it easy for, let’s say like a startup, to get started? What are the main maybe challenges if they’re doing that?
Yu: Based on all the customers we have been dealing with and the people we are collaborating with in the open-source community, it’s actually straightforward if you want to start with open-source solutions and tweak that by yourself to face those tradeoffs. Usually, things get more challenging when you’re scaling out with hundreds of thousands of nodes.
Usually our experience, we see from our customers, is they have the prototyping with a small scale, like five nodes or so. Once they want to get into a higher scalability, they probably just seek for help from us. That’s the way they’re dealing with that. Surely, tweaking the performance for the small-scale problems could be straightforward if you spend time on it.
Participant 4: I’d be interested to hear the challenges you have with this prefix batching. Does it lead to hallucinations? Have you seen accuracy problems once you start batching or caching the prefix?
Yu: In terms of the output, the prefix caching will guarantee exactly the same output as before, because we just avoid recomputation. We’re not using approximation or anything like that. As long as you prefix cache in the right way, the output will guarantee to be the same. Of course, there are some more advanced technologies, like approximate prefix cache. I only have a slight difference compared to the previous prompt, can I directly use his result? In this way, you may still get some reasonable response, but the accuracy may be changed.
Participant 4: Because I know proximity matters, and it’s long tail as it goes off, it matters less. Is it automatically determined, or are you manually defining that?
Yu: In the very end, we have automatic prefix caching. We chunk the prompts in several blocks, and then basically we just see how many blocks you can match from the beginning. If you can match 50% of your blocks, then you have the 50% computing reduction. If you can only match like 30% you can save only 30%. It depends on two things. One is whether your prompt is actually shared with the previous request, and the second whether the compute results is still in the cache. It may be evicted if you have so many requests with different prompts.
See more presentations with transcripts