By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: What Makes Vision Transformers Hard to Quantize? | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > What Makes Vision Transformers Hard to Quantize? | HackerNoon
Computing

What Makes Vision Transformers Hard to Quantize? | HackerNoon

News Room
Last updated: 2025/11/17 at 5:27 PM
News Room Published 17 November 2025
Share
What Makes Vision Transformers Hard to Quantize? | HackerNoon
SHARE

Table of Links

Abstract and 1. Introduction

  1. Related work

  2. Method

    3.1. Uniform quantizer

    3.2. IGQ-ViT

    3.3. Group size allocation

  3. Experiments

    4.1. Implementation details and 4.2. Results

    4.3. Discussion

  4. Conclusion, Acknowledgements, and References

Supplementary Material

A. More implementation details

B. Compatibility with existing hardwares

C. Latency on practical devices

D. Application to DETR

2. Related work

Network quantization. Network quantization aims at reducing bit-widths of weights and activations of neural networks. QAT methods simulate the quantization process by applying a round function to weights and activations of the network. Since derivatives of the round function is either zero or infinite, they approximate the gradients (e.g., using the straight-through estimator [3]) to train the network with backpropagation. These methods also adjust the derivatives of the round function [17, 18] or train quantization parameters jointly with network weights based on task losses [11, 16]. For better convergence of the training process, many heuristics have been introduced, e.g., progressively shrinking bit-widths [44] or freezing parts of the network weights [29, 43]. Quantized networks using QAT show performance comparable to or even better then full-precision counterparts. However, the quantization process is computationally demanding, requiring a significant amount of training time. PTQ offers an alternative approach to quantizing neural networks. Instead of training fullprecision models and simulating the quantization process at training time, PTQ methods calibrate quantization parameters (e.g., quantization intervals) using a subset of training samples. Early efforts focus on optimizing the quantization parameters to minimize the difference between floatingpoint and quantized values [2, 28]. Another line of research proposes to consider distributions of weights and/or activations to design quantizers. For instance, the work of [12] has observed that network weights follow a bell-shaped distribution. Based on this, it introduces piecewise linear quantizers that assign different quantization intervals according to the magnitudes of activations, performing better compared to uniform quantizers. Recent PTQ methods learn to either round up or down network weights by using a reconstruction error of layer outputs [27] or exploiting the Hessian of training losses [19], and they have proven the effectiveness on CNN architectures (e.g., ResNet [13], MobileNetV2 [31]).

Transformer quantization. While ViTs [10] and the variants [25, 34] have become increasingly popular in computer vision, the unique structure and characteristics of ViT architectures makes network quantization challenging. For example, PTQ methods for CNNs [2, 19, 27, 28] do not perform well on quantizing softmax attentions and GELU activations in transformers, suggesting that directly applying them for ViT quantization results in significant performance degradation [26]. To date, only a limited number of PTQ methods have been developed for ViTs. The work of [26] estimates quantization parameters that maximize similarities between full-precision and quantized outputs of linear operations, and proposes to preserve a relative order of attention values after quantization. APQ-ViT [9] introduces a calibration metric to minimize the discrepancies between full-precision and quantized outputs, while maintaining the power-law distribution of softmax attentions. PTQ4ViT [40] introduces twin uniform quantizers to handle asymmetric distributions in softmax attentions and GELU activations effectively. Most PTQ methods for ViTs exploit a single quantizer for all channels, suggesting that they do not consider the distributions of activation values across channels, typically having extreme scale variations. Recent works [21, 23] attempt to alleviate the scale variation problem efficiently. FQ-ViT [23] proposes to consider interchannel scale variations for LayerNorm [1], and exploits channel-wise quantizers with the constraint of the ratio of quantization intervals being power-of-two values. This enables using bit-shift operations, calculating mean and variance of LayerNorm in an integer level. The scale reparameterization technique, introduced by RepQ-ViT [21], allows to use layer-wise quantizers, instead of adopting channelwise ones, by adjusting the affine factors of LayerNorm and the weights of FC layers. However, this technique applies to the activations for LayerNorm only, and does not fully address the inter-channel scale variations for other layers in transformers.

Similar to ours, the works of [4, 7, 32, 36] adopt group quantization techniques for transformers. For instance, Qbert [32] and VS-quant [7] divide consecutive channels uniformly into a number of groups without considering the dynamic range of each channel, and thus the channels assigned to each group do not follow similar distributions. PEG [4] alleviates this issue by sorting the activations across channels w.r.t. the dynamic ranges during calibration, before grouping the channels. Quantformer [36] proposes to use a differentiable search [6, 24] for QAT in order to group channels of activation maps. The channels assigned to particular groups are however fixed after calibrating pretrained networks for PTQ in the group quantization techniques [4, 7, 32], which makes them inappropriate for ViTs having diverse channel distributions according to input instances. In contrast, our approach apply group quantization along channels of activation maps and tokens of softmax attentions dynamically at runtime for each input instance, without additional parameters for PTQ.

:::info
Authors:

(1) Jaehyeon Moon, Yonsei University and Articron;

(2) Dohyung Kim, Yonsei University;

(3) Junyong Cheon, Yonsei University;

(4) Bumsub Ham, a Corresponding Author from Yonsei University.

:::


:::info
This paper is available on arxiv under CC BY-NC-ND 4.0 Deed (Attribution-Noncommercial-Noderivs 4.0 International) license.

:::

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Cable giants Charter and Cox, under assault by streaming services, pursue .5 billion merger Cable giants Charter and Cox, under assault by streaming services, pursue $34.5 billion merger
Next Article WIRED Roundup: Fandom in Politics, Zuckerberg’s Illegal School, and Nepal’s Discord Revolution WIRED Roundup: Fandom in Politics, Zuckerberg’s Illegal School, and Nepal’s Discord Revolution
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

The Remonetization of Gold and The Dawn of Tokenized Trust | HackerNoon
The Remonetization of Gold and The Dawn of Tokenized Trust | HackerNoon
Computing
Comet 3I/ATLAS’s Acceleration Might Not Be Caused By Gravity – BGR
Comet 3I/ATLAS’s Acceleration Might Not Be Caused By Gravity – BGR
News
OpenZFS 2.4 Squeezes In Some Last Minute Improvements
OpenZFS 2.4 Squeezes In Some Last Minute Improvements
Computing
Trump greenlights sale of F-35 fighter jets to Saudi Arabia 
Trump greenlights sale of F-35 fighter jets to Saudi Arabia 
News

You Might also Like

The Remonetization of Gold and The Dawn of Tokenized Trust | HackerNoon
Computing

The Remonetization of Gold and The Dawn of Tokenized Trust | HackerNoon

12 Min Read
OpenZFS 2.4 Squeezes In Some Last Minute Improvements
Computing

OpenZFS 2.4 Squeezes In Some Last Minute Improvements

1 Min Read
Steal My AI Prompt for Turning SWOT Into Actionable Strategy | HackerNoon
Computing

Steal My AI Prompt for Turning SWOT Into Actionable Strategy | HackerNoon

14 Min Read
GIMP 3.2 RC1 Brings More UI/UX Improvements, Proper SVG Export
Computing

GIMP 3.2 RC1 Brings More UI/UX Improvements, Proper SVG Export

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?