By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Boosting LLM Decode Throughput: vAttention vs. PagedAttention | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Boosting LLM Decode Throughput: vAttention vs. PagedAttention | HackerNoon
Computing

Boosting LLM Decode Throughput: vAttention vs. PagedAttention | HackerNoon

News Room
Last updated: 2025/06/13 at 10:52 PM
News Room Published 13 June 2025
Share
SHARE

Table of Links

Abstract and 1 Introduction

2 Background

2.1 Large Language Models

2.2 Fragmentation and PagedAttention

3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel

3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead

4 Insights into LLM Serving Systems

5 vAttention: System Design and 5.1 Design Overview

5.2 Leveraging Low-level CUDA Support

5.3 Serving LLMs with vAttention

6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation

6.2 Hiding memory allocation latency

7 Evaluation

7.1 Portability and Performance for Prefills

7.2 Portability and Performance for Decodes

7.3 Efficacy of Physical Memory Allocation

7.4 Analysis of Memory Fragmentation

8 Related Work

9 Conclusion and References

7.2 Portability and Performance for Decodes

To evaluate decode performance, we focus on long-context scenarios (16K) because the latency of attention kernel becomes significant only for long contexts[4]. We evaluate the following configurations:

vLLM: We use vLLM v0.2.7 as the primary baseline. vLLM pioneered PagedAttention and uses a custom paged kernel for decodes, derived from FasterTransformer [4].

Figure 9. Decode throughput with varying batch sizes using context length 16K for each request (FA: FlashAttention, bs: block size). We evaluate vLLM and FlashAttention with two different block sizes: 16 and 128. vLLM performs best with block size 16 because its attention kernel is more efficient with smaller block sizes. FlashAttention’s GPU kernel is up to 2.85× faster than the best version of vLLM’s kernel (Yi-6B, 16*16K). However, smaller blocks add CPU overhead e.g., FlashAttention with block size 16 is worse than with block size 128. vAttention provides similar gains that the best version of FlashAttention provides over vLLM, but without user-level physical memory management and without a PagedAttention kernel.Figure 9. Decode throughput with varying batch sizes using context length 16K for each request (FA: FlashAttention, bs: block size). We evaluate vLLM and FlashAttention with two different block sizes: 16 and 128. vLLM performs best with block size 16 because its attention kernel is more efficient with smaller block sizes. FlashAttention’s GPU kernel is up to 2.85× faster than the best version of vLLM’s kernel (Yi-6B, 16*16K). However, smaller blocks add CPU overhead e.g., FlashAttention with block size 16 is worse than with block size 128. vAttention provides similar gains that the best version of FlashAttention provides over vLLM, but without user-level physical memory management and without a PagedAttention kernel.

FA_Paged: For the second baseline, we integrate the FlashAttention kernel into vLLM’s serving stack. This represents a state-of-the-art PagedAttention kernel that includes optimizations such as sequence parallelism and in-place copy of new key and value vectors into the KV-cache. We evaluate the paged kernels of vLLM and FlashAttention with two different block sizes – 16 and 128 – to capture the effect of block size on performance.

FA_vAttention: For vAttention, we integrated the vanilla kernel of FlashAttention into vLLM’s serving stack. The kernel works with a virtually contiguous KV-cache to which we dynamically allocate physical memory using 2MB pages.

Figure 9a shows the decode throughput of Yi-6B, Llama3-8B and Yi-34B with varying batch sizes wherein the initial context length of each request is 16K tokens and we generate 256 tokens for each request. We compute decode throughput based on the mean latency of 256 decode iterations. We summarize the key takeaways below.

First, vAttention outperforms vLLM (both block sizes) and FA_Paged (block size 16), while roughly matching the best configuration of FA_Paged (block size 128). The maximum improvement over vLLM is 1.97× for Yi-6B, 1.3× for Llama3-8B and 1.6× for Yi-34B. The relative gains over vLLM also increase as the batch size grows. For example, the gain increases from about 1.1× to 1.97× as batch size increases from 1 to 8 for Yi-6B. This is because the latency of attention computation grows proportional to the total number of tokens in the batch (see Figure 9b) whereas the cost of linear operators remains roughly the same [25, 26, 41]. Therefore, the contribution of attention kernel in the overall latency – and subsequently gain with a more efficient kernel – increases with the batch size. While FA_Paged (block size 128) provides similar gains as vAttention, note that FA_Paged requires a new implementation of the GPU kernel whereas vAttention simply leverages the vanilla kernel of FlashAttention.

Second, Figure 9b confirms that performance difference between vLLM and FA_Paged/vAttention is indeed due to the attention kernels. In the worst case, the latency of vLLM’s best PagedAttention kernel (block size 16) is up to 2.85× higher for Yi-6B, up to 1.45× for Llama-3-8B and up to 2.62× for Yi-34B than the FlashAttention kernel.

Finally, throughput can be sensitive to block size even when memory capacity is not a constraint. For example, as discussed in §3.3, vLLM’s attention kernel has a significantly higher latency with block size 128 than with block size 16 (also see Figure 9b). In the worst case, block size 128 degrades vLLM’s throughput by 36%. While block size has a smaller

Figure 10. Latency of decode iterations with and without overlapping memory allocation with compute (batch size=4,context length=32K). Spikes show the latency impact of synchronous memory allocation.Figure 10. Latency of decode iterations with and without overlapping memory allocation with compute (batch size=4,context length=32K). Spikes show the latency impact of synchronous memory allocation.

Table 7. Physical memory allocation bandwidth (GB per second) for vAttention with different page sizes.Table 7. Physical memory allocation bandwidth (GB per second) for vAttention with different page sizes.

impact on FlashAttention, using a small block size can still hurt throughput due to CPU overheads, particularly due to the overhead of creating Block-Tables for every iteration (§3.3). For example, FlashAttention with block size 128 delivers 7% higher throughput than block size 16 for Llama-3-8B (531 vs 494 tokens per second with batch size 32).


[4] For short contexts, the computation time of the feed-forward-network dominates inference latency [25]

Authors:

(1) Ramya Prabhu, Microsoft Research India;

(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;

(3) Jayashree Mohan, Microsoft Research India;

(4) Ramachandran Ramjee, Microsoft Research India;

(5) Ashish Panwar, Microsoft Research India.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Inside the life of AngryGinge aka Morgan Burtwistle from Twitch
Next Article Nothing Phone 3 Set To Drop Glyph Lights, But It May Get 5 Major Upgrades Over Phone 2
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Sam Altman’s outrageous ‘Singularity’ blog perfectly sums up AI in 2025
News
Toyota unit Hino launches new fully-electric truck with Chinese partner · TechNode
Computing
Google Home’s latest bug: Setting an alarm for this time is nearly impossible
News
iOS 26 Messages: 7 biggest changes you need to know
News

You Might also Like

Computing

Toyota unit Hino launches new fully-electric truck with Chinese partner · TechNode

1 Min Read
Computing

Top 10 Social Media Marketing & Influencer Marketing Podcasts in 2025

1 Min Read
Computing

TSMC considers overseas 2nm production after 2025 · TechNode

1 Min Read
Computing

5 Ways to Make Your Instagram Account More Accessible Right Now

12 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?