How Good Is PagedAttention At Memory Sharing? | HackerNoon

How Good Is PagedAttention at Memory Sharing? | HackerNoon

Last updated: 2024/12/31 at 8:38 PM

News Room Published 31 December 2024

Table of Links

Abstract and 1 Introduction

2 Background and 2.1 Transformer-Based Large Language Models

2.2 LLM Service & Autoregressive Generation

2.3 Batching Techniques for LLMs

3 Memory Challenges in LLM Serving

3.1 Memory Management in Existing Systems

4 Method and 4.1 PagedAttention

4.2 KV Cache Manager

4.3 Decoding with PagedAttention and vLLM

4.4 Application to Other Decoding Scenarios

4.5 Scheduling and Preemption

4.6 Distributed Execution

5 Implementation

6 Evaluation and 6.1 Experimental Setup

6.2 Basic Sampling

6.3 Parallel Sampling and Beam Search

6.4 Shared prefix

6.5 Chatbot

7 Ablation Studies

8 Discussion

9 Related Work

10 Conclusion, Acknowledgement and References

6.3 Parallel Sampling and Beam Search

We evaluate the effectiveness of memory sharing in PagedAttention with two popular sampling methods: parallel sampling and beam search. In parallel sampling, all parallel sequences in a request can share the KV cache for the prompt. As shown in the first row of Fig. 14, with a larger number of sequences to sample, vLLM brings more improvement over the Orca baselines. Similarly, the second row of Fig. 14 shows the results for beam search with different beam widths. Since beam search allows for more sharing, vLLM demonstrates even greater performance benefits. The improvement of vLLM over Orca (Oracle) on OPT-13B and the Alpaca dataset goes from 1.3× in basic sampling to 2.3× in beam search with a width of 6.

Fig. 15 plots the amount of memory saving, computed by the number of blocks we saved by sharing divided by the number of total blocks without sharing. We show 6.1% – 9.8% memory saving on parallel sampling and 37.6% – 55.2% on beam search. In the same experiments with the ShareGPT dataset, we saw 16.2% – 30.5% memory saving on parallel sampling and 44.3% – 66.3% on beam search.

Authors:

(1) Woosuk Kwon, UC Berkeley with Equal contribution;

(2) Zhuohan Li, UC Berkeley with Equal contribution;

(3) Siyuan Zhuang, UC Berkeley;

(4) Ying Sheng, UC Berkeley and Stanford University;

(5) Lianmin Zheng, UC Berkeley;

(6) Cody Hao Yu, Independent Researcher;

(7) Cody Hao Yu, Independent Researcher;

(8) Joseph E. Gonzalez, UC Berkeley;

(9) Hao Zhang, UC San Diego;

(10) Ion Stoica, UC Berkeley.

How Good Is PagedAttention at Memory Sharing? | HackerNoon

Table of Links

6.3 Parallel Sampling and Beam Search

Leave a Reply Cancel reply

Stay Connected

Latest News

Samsung wants to make 3D a thing again with its new Odyssey monitor

ZeroShape: Additional Qualitative Comparisons You Should Know About | HackerNoon

OnePlus clears up all ambiguity surrounding the OnePlus 13’s water resistance rating

Shanghai Pepsi-Cola Beverage Co. Ltd. Selects Blue Yonder to Optimize Production Operations

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

6.3 Parallel Sampling and Beam Search

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News