By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Why Your Phone’s AI is Slow: A Story of Sparse Neurons and Finicky Flash Storage | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Why Your Phone’s AI is Slow: A Story of Sparse Neurons and Finicky Flash Storage | HackerNoon
Computing

Why Your Phone’s AI is Slow: A Story of Sparse Neurons and Finicky Flash Storage | HackerNoon

News Room
Last updated: 2025/08/26 at 4:33 PM
News Room Published 26 August 2025
Share
SHARE

Table of Links

Abstract and 1. Introduction

  1. Background and Motivation
  2. PowerInfer-2 Overview
  3. Neuron-Aware Runtime Inference
  4. Execution Plan Generation
  5. Implementation
  6. Evaluation
  7. Related Work
  8. Conclusion and References

2 Background and Motivation

2.1 LLM Inference and Metrics

LLM inference consists of two stages: the prefill and the decoding stage. During the prefill stage, the user’s prompt is processed by the LLM in a single iteration, generating the first token. The decoding stage, on the other hand, involves the LLM generating tokens sequentially, one at a time, in an autoregressive manner. The token produced during the prefill stage serves as the input for generating the second token. This second token then acts as the input for the LLM, facilitating the generation of the third word. This sequence continues until the output sequence is complete or an end-of-sequence (EOS) token is reached.

The two stages exhibit distinct computational patterns, necessitating the optimization of two key metrics: the time to first token (TTFT) during the prefill stage and the time between tokens (TBT) during the decoding stage. The prefill stage handles all prompt tokens within a single iteration, imposing a considerable computational burden; in contrast, the decoding stage processes only one token per iteration, resulting in comparatively lower computational demands. Consequently, an LLM inference system must leverage computing strategies designed for these stages specifically to optimize performance metrics efficiently.

2.2 Predictable Sparse Activations

Mainstream LLMs, such as GPT-4 and Llama-2, employ a decoder-only transformer architecture. This architecture consists of multiple transformer layers, with each layer containing an attention block and a Feed-Forward Network (FFN) block. The attention block establishes relationships between tokens in the sequence, while the FFN block interprets and processes these relationships as structured by the attention block. Recent LLMs usually adopt Group Query Attention [27], which reduces the number of weights in the attention block, making the feed-forward network (FFN) block occupy nearly 80% of the total weights. The activation function in the FFN block, such as ReLU-family functions [3, 28, 40], leads to a significant occurrence of sparse activations [19, 39]: most neurons (represented as rows or columns in the FFN weight matrix) are

inactivated because their computations have minimal impact on the final output.

Fortunately, the activation of neurons in the FFN can be predicted before computing each FFN block, that have been explored by prior works [21,29,30,40]. For instance, PowerInfer [30] and DejaVu [21] utilizes small MLP networks before each FFN block to predict their dynamic neuron activations. With these accurate predictors, they can significantly reduce the number of neuron computations within the FFN, thereby accelerating the inference process.

2.3 Smartphone Storage Analysis

A smartphone usually lacks sufficient DRAM memory to hold an entire LLM. Consequently, a portion of the model’s weights may be stored in external storage, such as the universal flash storage (UFS) 4.0 in Snapdragon 8gen3. In this section, we analyze the performance characteristics of smartphone UFS, which guide the I/O design of PowerInfer-2.

2.3.1 Read Throughput and Block Size

First, we evaluated the random and sequential read throughputs of UFS 4.0[1]. A notable feature is that the read bandwidth of UFS varies with the read block size. Generally, whether for sequential or random reads, the larger the block, the greater the read bandwidth. For example, when the block size is set to 512KB, both sequential and random read bandwidths reach their maximum at 4 GB/s and 3.5 GB/s, respectively. When the block size is reduced to 4KB, the bandwidth is at its minimum, with random read bandwidth at 450 MB/s.

2.3.2 Random Read and Data Range

UFS random reads exhibit an interesting phenomenon where the performance of random reads is influenced by the scope of the random read range. Specifically, a smaller random read range results in higher bandwidth. In UFS 4.0, as shown in Fig.1b, if the 4KB random read range is set to 128MB, 256MB, and 512MB, the bandwidth for the 128MB range is the highest, reaching 1 GB/s, while the 512MB range has the lowest bandwidth, falling below 850 MB/s. Notably, this phenomenon is not as apparent with other block sizes. Therefore, the bandwidth of 4KB random reads within a 128MB range exceeds that of 8KB and 12KB block sizes.

2.3.3 Read Throughput and CPU Core

A third observation is that the read bandwidth is influenced by the CPU issuing the read command. A higher frequency of the CPU core correlates with increased read bandwidth. As shown in Table 1, when using a big-core with a frequency of 3.3GHz for random reads, the bandwidth for 4KB reads reaches 1 GB/s. Conversely, when a little-core with a frequency of 2.2GHz is used for the same random reads, the bandwidth is only about 760 MB/s. This correlation arises because the CPU core initiating the read needs to run the UFS driver thus a higher frequency enables faster processing of UFS-related I/O operations, including interrupts and queue management.

2.3.4 Read Throughput and Core Number

The last observation is that unlike NVMe, the UFS storage in mobile devices has only one command queue, inherently lacking internal concurrency capabilities. Therefore, initiating I/O commands using multiple cores does not result in higher I/O bandwidth compared to using a single core. As shown in Table 1, using multiple cores for 4KB random reads even deteriorates the I/O performance by up to 40% as a result of contention in the UFS command queue.

Summary: When some model weights need to be stored on a mobile device’s storage medium, an efficient LLM system must fully consider the performance characteristics of the storage medium to maximize I/O bandwidth and minimize the performance overhead associated with I/O operations.

:::info
Authors:

(1) Zhenliang Xue, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Yixin Song, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University ([email protected]);

(4) Le Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Yubin Xia, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(6) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.

:::


:::info
This paper is available on arxiv under CC BY 4.0 license.

:::

[1] Since LLM inference involves only weight reading, we did not consider the performance of write operations

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Drivers warned to carry out ‘smell’ test and check if gas station buy is legit
Next Article AI super PACs, the hottest investment in tech
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

I can make any slow PC fast again with these 5 lightweight operating systems
News
Inside Go Channels: Buffers, Locks, and the Runtime Memory Model | HackerNoon
Computing
Here are 30 of our favorite deals from REI’s ongoing Labor Day sale
News
Sky customers warned of major channel changes hitting TVs TODAY
News

You Might also Like

Computing

Inside Go Channels: Buffers, Locks, and the Runtime Memory Model | HackerNoon

30 Min Read
Computing

These 6 true crime docs skip the murder and keep you hooked

7 Min Read
Computing

The Unseen Catalysts of AI: A Journey from Dismissed Ideas to a New Renaissance | HackerNoon

8 Min Read
Computing

Protesters occupy Microsoft president’s office at Redmond HQ in latest action over Israel contracts

3 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?