By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: IGQ-ViT: Instance-Aware Group Quantization for Low-Bit Vision Transformers | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > IGQ-ViT: Instance-Aware Group Quantization for Low-Bit Vision Transformers | HackerNoon
Computing

IGQ-ViT: Instance-Aware Group Quantization for Low-Bit Vision Transformers | HackerNoon

News Room
Last updated: 2025/11/17 at 4:39 PM
News Room Published 17 November 2025
Share
IGQ-ViT: Instance-Aware Group Quantization for Low-Bit Vision Transformers | HackerNoon
SHARE

Table of Links

Abstract and 1. Introduction

  1. Related work

  2. Method

    3.1. Uniform quantizer

    3.2. IGQ-ViT

    3.3. Group size allocation

  3. Experiments

    4.1. Implementation details and 4.2. Results

    4.3. Discussion

  4. Conclusion, Acknowledgements, and References

Supplementary Material

A. More implementation details

B. Compatibility with existing hardwares

C. Latency on practical devices

D. Application to DETR

A. More implementation details

A.1. Weight quantization

For weight quantization, we exploit a distinct quantizer for each output channel, following [21]. We designate the upper and lower bounds of weight quantizers with (100-ϵ)- th and ϵ-th percentiles of weight values in each output channel, where ϵ is a hyperparameter.

A.2. Hyperparameter settings

A.3. Perturbation metric for Mask R-CNN models

For Mask R-CNN and Cascade Mask R-CNN models, we modify the perturbation metric for each RoI in Eq. (9) as follows:

Figure A. Our instance-aware group quantization framework could be implemented using existing DNN accelerators, with a slight modification from the implementation of [7]. One might leverage a group assignment module that computes the group indices for each channel using Eq. (5), and the resulting indices are used to select channels that belong to each group during matrix multiplication. Corresponding rows of weight buffers are also selected.

B. Compatibility with existing hardwares

We believe that IGQ-ViT could be efficiently implemented using existing neural network accelerators, with a slight modification from the implementation of VSquant [7]. Specifically, [7] divides channels of activations into a number of groups, and activation values assigned to each group are processed with separate multiplyaccumulate (MAC) units. The outputs from each MAC unit are then scaled with different quantization parameters. In contrast, IGQ-ViT dynamically splits channels according to their statistical properties for each input instance. Compared to [7], IGQ-ViT requires additional computations, which includes computing the min/max values of each channel, and assigning channels to quantizers with the minimum distance. To address this, one might leverage a group assignment module that computes the min/max values of each channel, followed by obtaining group indices using Eq. (5) (Fig. A). The resulting indices are then used to select channels that belong to each group. Finally, a separate MAC unit is applied for activation values within each group, which contains a single quantization parameter. Note that computing the group indices for each channel is computationally cheap in terms of BOPs (See Table 1 in the main paper), and using an indexing scheme for efficient computation is a common practice in real devices. For example, the work of [41] implements a module providing indices to dynamically detect sparsity patterns of weight and activation values in each group. It then uses the indices to skip groups of zero-valued weights and activations for efficiency (See Fig. 15.2.3 in [41]).

Figure B. Run-time latency for the tasks of (a) image classification and (b) object detection/instance segmentation using NVIDIA RTX 3090. For group quantization, we use a group size of 8 for all layers. For MaskRCNN models, we use Swin-T as the backbone.

C. Latency on practical devices

To further validate the efficiency of IGQ-ViT, we conduct a simulation in PyTorch to compare the latencies between prior group quantization techniques [4, 7, 32] and ours. A key challenge is that most quantization methods exploit a fake quantization approach, following [20], which mimics the quantization process by discretizing the network’s weights and activations into a finite set of floating-point values, thereby serving a surrogate for the true quantization process. This approach is inappropriate for estimating the latency on real hardwares as it does not change the actual precision of the data, but merely introduces the concept of lower precision during calibration. Accordingly, we directly convert the data formats of weights and activations into 8-bit representations to measure the latency more accurately. Specifically, we simulate IGQ-ViT for linear operations using Eq. (7), which requires low-bit matrix multiplication between weights and activations within each group, along with the summation of outputs for each group in full-precision. Since PyTorch does not support convolutional or linear layers that takes low-bit matrices as input, we have implemented their 8-bit counterparts. Note that we have implemented the group assignment algorithm (i.e., Eq. (5)) in full-precision.

We compare in Fig. B the run-time latency of IGQViT with its variants. We can see that IGQ-ViT introduces marginal overhead compared to layer-wise quantization, and consecutive grouping strategy, while achieving high quantization performances (See Table 5 in the main paper). This suggests that dynamic grouping of channels have a limited impact on actual latency.

D. Application to DETR

We show in Table A the results of quantizing a DETR model with a ResNet-50 [13] backbone on COCO [22]. To the best of our knowledge, PTQ for ViT [26] is the only PTQ method that provides quantization results for a DETR

Table A. Results of quantizing a DETR model with a ResNet50 [13] backbone on COCO [22].

model under 6/6-bit setting. We can see that IGQ-ViT outperforms it by 0.8% for a group size of 12.

:::info
Authors:

(1) Jaehyeon Moon, Yonsei University and Articron;

(2) Dohyung Kim, Yonsei University;

(3) Junyong Cheon, Yonsei University;

(4) Bumsub Ham, a Corresponding Author from Yonsei University.

:::


:::info
This paper is available on arxiv under CC BY-NC-ND 4.0 Deed (Attribution-Noncommercial-Noderivs 4.0 International) license

:::

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Ford Dealers Are Now Listing Their Used Cars on Amazon Ford Dealers Are Now Listing Their Used Cars on Amazon
Next Article Valar Atomics Says It’s the First Nuclear Startup to Achieve Criticality Valar Atomics Says It’s the First Nuclear Startup to Achieve Criticality
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

The Remonetization of Gold and The Dawn of Tokenized Trust | HackerNoon
The Remonetization of Gold and The Dawn of Tokenized Trust | HackerNoon
Computing
Comet 3I/ATLAS’s Acceleration Might Not Be Caused By Gravity – BGR
Comet 3I/ATLAS’s Acceleration Might Not Be Caused By Gravity – BGR
News
OpenZFS 2.4 Squeezes In Some Last Minute Improvements
OpenZFS 2.4 Squeezes In Some Last Minute Improvements
Computing
Trump greenlights sale of F-35 fighter jets to Saudi Arabia 
Trump greenlights sale of F-35 fighter jets to Saudi Arabia 
News

You Might also Like

The Remonetization of Gold and The Dawn of Tokenized Trust | HackerNoon
Computing

The Remonetization of Gold and The Dawn of Tokenized Trust | HackerNoon

12 Min Read
OpenZFS 2.4 Squeezes In Some Last Minute Improvements
Computing

OpenZFS 2.4 Squeezes In Some Last Minute Improvements

1 Min Read
Steal My AI Prompt for Turning SWOT Into Actionable Strategy | HackerNoon
Computing

Steal My AI Prompt for Turning SWOT Into Actionable Strategy | HackerNoon

14 Min Read
GIMP 3.2 RC1 Brings More UI/UX Improvements, Proper SVG Export
Computing

GIMP 3.2 RC1 Brings More UI/UX Improvements, Proper SVG Export

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?