Table of Links
Abstract and 1. Introduction
-
Related Work and Background
-
Analysis
3.1 Limitations about Existing ReLUficatio
3.2 dReLU
-
Are Neurons in Expert still Sparsely Activated?
-
dReLU Sparsification
-
Experiments Results
6.1 Downstream Tasks Performance
6.2 Sparsity of Sparsified Models
-
Practical Inference Speedup Evaluation
7.1 Experiments Setting
7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference
7.4 Deploy LLMs on mobile phones
-
Conclusion and References
A. Appendix / supplemental material
B. Limitation
C. Broader Impact
7.4 Deploy LLMs on mobile phones
We also serve TurboSparse-Mixtral-47B by using PowerInfer-2 that supports LLM inference on mobile phones. PowerInfer-2 leverages the sparse activation feature during LLM inference and
introduces a computational engine on heterogeneous XPUs. It can perform high-speed inference even when the model parameters exceed DRAM capacity. As shown in Table 9, PowerInfer-2 achieves a 22.2× speedup using TurboSparse-Mixtral-47B inference compared to llama.cpp with the original Mixtral-47B. This significant performance gain is primarily because PowerInfer-2 can fully exploit the extremely high sparsity that TurboSparse demonstrates during inference.
:::info
Authors:
(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;
(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(5) Li Ma, Shanghai Artificial Intelligence Laboratory;
(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn);
(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.
:::
:::info
This paper is available on arxiv under CC BY 4.0 license.
:::
