At QConSF 2024, Cody Yu presented how Anyscale’s Ray can more effectively handle scaling out batch inference. Some of the problems Ray can assist with include scaling large datasets (hundreds of GBs or more), ensuring reliability with spot and on-demand instances, managing multi-stage heterogeneous compute, and managing tradeoffs with cost and latency.
Ray Data offers scalable data processing solutions that maximize GPU utilization and minimize data movement costs through optimized task scheduling and streaming execution. The integration of Ray Data with vLLM, an open-source framework for LLM inference, has enabled scalable batch inference, significantly reducing processing times.
“The demand for batch inference is getting higher and higher. This is mainly because we now have multi-modality data sources. You have cameras, mic sensors, and PDF files. And then, by processing these files, you will get different kinds of raw data in different formats, which are either unstructured or structured.” – Cody Yu
Features such as continuous batching were discussed, which enhance system throughput and efficiency. A case study on generating embeddings from PDF files efficiently and cost-effectively using Ray Data was highlighted, where the process costs less than $1 for processing with ~20 GPUs.
Discussion also covered the importance of pipeline parallelism in balancing execution times across different stages of the LLM inference pipeline. By optimizing batch sizes and employing chunk-based batching, the system has been fine-tuned for maximum efficiency. This approach not only improves throughput but also strategically manages computational resources across heterogeneous systems. Ray Tune might also potentially be used to optimize batch processing workflows through hyperparameter tuning.
The session also briefly discussed Ray Serve Batch. Dynamic request batching in Ray Serve enhances service throughput by efficiently processing multiple requests simultaneously, leveraging ML models’ vectorized computation capabilities. This feature is particularly useful for expensive models, ensuring optimal hardware utilization. Batching is enabled using the ray.serve.batch
decorator, which requires the method to be asynchronous.
Continuing the presentation, the speaker highlighted advancements in large language model (LLM) inference, focusing on the vLLM framework, speculative decoding, and inference engine optimization. vLLM is an open-source LLM inference engine known for its high throughput and memory efficiency. It features efficient key-value cache memory management with PagedAttention, continuous batching of incoming requests, and optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
The presentation also covered speculative decoding, a technique that accelerates text generation by using a smaller draft model to propose multiple tokens, which a larger target model then verifies in parallel. This method reduces inter-token latency in memory-bound LLM inference, enhancing efficiency without compromising accuracy.
Readers interested in learning more about batch inference with Ray may watch InfoQ.com in the coming weeks for a copy of the full presentation.