Rethinking AI Quantization: The Missing Piece In Model Efficiency

Rethinking AI Quantization: The Missing Piece in Model Efficiency | HackerNoon

Last updated: 2025/03/07 at 12:19 AM

News Room Published 7 March 2025

Authors:

(1) Wanyun Cui, Shanghai University of Finance and Economics, with equal contribution;

(2) Qianle Wang, Shanghai University of Finance and Economics, with equal contribution.

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Quantifying the Impact of Parameters on Model Performance & 4. Unified Mixed-Precision Training

5 Prevalence of Parameter Heterogeneity in LLMs

6 Quantization Experiments and 6.1 Implementation Details

6.2 Effect of Base LLM Quantization

6.3 Effect of Chat LLM Quantization

6.4 Comparison of Parameter Selection Criteria, Conclusion, & References

Quantization Strategies for LLMs Various quantization strategies have been proposed in the literature to reduce the precision of weights and activations while maintaining acceptable accuracy. These strategies can be broadly categorized into post-training quantization and quantization-aware training [14]. Post-training quantization methods, such as OBD, OBS, and GPTQ, directly quantize the pre-trained model without fine-tuning [15, 10, 8]. On the other hand, quantization-aware training methods, such as LLM-QAT [18], incorporate quantization operations into the training process to jointly optimize the quantized model. Some works also explore mixed-precision quantization [13] and adaptive quantization bins [7] to achieve a better trade-off between accuracy and efficiency.

Outliers in Language Model Quantization The idea of modeling parameter outliers in LLM quantization is not new. The exploration of outliers primarily includes the perspectives of magnitude [18, 7] and activations [4, 6]. For example, from the magnitude perspective, QLoRA assumes that parameters follow a Gaussian distribution [7] and designs information-theoretically optimal quantized bins based on this assumption. [18] keeps outlier parameters in 16-bit precision. From the activation perspective, [17] migrates the outlier amplifier to subsequent modules through an equivalent transformation. Additionally, SqueezeLLM also measures outliers from the perspective of parameter impact [13]. To the best of our knowledge, our work is the first to systematically reveal the outliers (heterogeneity) of parameter impact across different models, and we show a more pronounced imbalance in parameter impacts compared to magnitudes (§ 6.4). Furthermore, we propose a method to unify outlier (cherry) parameter optimization and normal parameter optimization, addressing the optimization challenges of heterogeneous parameters.

Rethinking AI Quantization: The Missing Piece in Model Efficiency | HackerNoon

Table of Links

Leave a Reply Cancel reply

Stay Connected

Latest News

Pole-dancing NYPD detective DROPPED from cases after bikini rap vid emerges

Premier League Soccer: Stream Man United vs. Arsenal From Anywhere

MWC 2025: HPE advances AI in connectivity | Computer Weekly

Alienware 27 4K QD-OLED (AW2725Q) Review

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News