TurboSparse Mobile: 22x Faster Mixtral Inference On PowerInfer-2

Table of Links

Abstract and 1. Introduction

Related Work and Background
Analysis

3.1 Limitations about Existing ReLUficatio

3.2 dReLU
Are Neurons in Expert still Sparsely Activated?
dReLU Sparsification
Experiments Results

6.1 Downstream Tasks Performance

6.2 Sparsity of Sparsified Models
Practical Inference Speedup Evaluation

7.1 Experiments Setting

7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

7.4 Deploy LLMs on mobile phones
Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

7.4 Deploy LLMs on mobile phones

We also serve TurboSparse-Mixtral-47B by using PowerInfer-2 that supports LLM inference on mobile phones. PowerInfer-2 leverages the sparse activation feature during LLM inference and

introduces a computational engine on heterogeneous XPUs. It can perform high-speed inference even when the model parameters exceed DRAM capacity. As shown in Table 9, PowerInfer-2 achieves a 22.2× speedup using TurboSparse-Mixtral-47B inference compared to llama.cpp with the original Mixtral-47B. This significant performance gain is primarily because PowerInfer-2 can fully exploit the extremely high sparsity that TurboSparse demonstrates during inference.

:::info
Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.

:::

:::info
This paper is available on arxiv under CC BY 4.0 license.

:::

TurboSparse Mobile: 22x Faster Mixtral Inference on PowerInfer-2 | HackerNoon

Table of Links

7.4 Deploy LLMs on mobile phones

Leave a Reply

Table of Links

7.4 Deploy LLMs on mobile phones

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply