TurboSparse Mobile: 22x Faster Mixtral Inference On PowerInfer-2

TurboSparse Mobile: 22x Faster Mixtral Inference on PowerInfer-2 | HackerNoon

Last updated: 2026/03/04 at 10:36 AM

News Room Published 4 March 2026

Table of Links

Abstract and 1. Introduction

Related Work and Background
Analysis

3.1 Limitations about Existing ReLUficatio

3.2 dReLU
Are Neurons in Expert still Sparsely Activated?
dReLU Sparsification
Experiments Results

6.1 Downstream Tasks Performance

6.2 Sparsity of Sparsified Models
Practical Inference Speedup Evaluation

7.1 Experiments Setting

7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

7.4 Deploy LLMs on mobile phones
Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

7.4 Deploy LLMs on mobile phones

We also serve TurboSparse-Mixtral-47B by using PowerInfer-2 that supports LLM inference on mobile phones. PowerInfer-2 leverages the sparse activation feature during LLM inference and

introduces a computational engine on heterogeneous XPUs. It can perform high-speed inference even when the model parameters exceed DRAM capacity. As shown in Table 9, PowerInfer-2 achieves a 22.2× speedup using TurboSparse-Mixtral-47B inference compared to llama.cpp with the original Mixtral-47B. This significant performance gain is primarily because PowerInfer-2 can fully exploit the extremely high sparsity that TurboSparse demonstrates during inference.

:::info
Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi [email protected]);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.

:::

:::info
This paper is available on arxiv under CC BY 4.0 license.

:::

TurboSparse Mobile: 22x Faster Mixtral Inference on PowerInfer-2 | HackerNoon

Table of Links

7.4 Deploy LLMs on mobile phones

Leave a Reply Cancel reply

Stay Connected

Latest News

Shure’s MVX2U Gen 2 brings studio quality sound wherever you are

I replaced my $1,000 Pixel 10 Pro with the $500 Pixel 10a — and I might not go back

Washington state bill allowing direct sales of Rivian and Lucid is speeding toward finish line

Google Pixel 10a: A perfectly fine phone with one neat trick

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

7.4 Deploy LLMs on mobile phones

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News