OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

2026-05-25Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors study how to run big AI models like language and vision transformers on devices with limited memory and slow math operations. They focus on an efficient method called Power-of-Two quantization that uses bit-shifts instead of multiplications but faces problems when using very few bits, causing loss of detail. To fix this, they introduce a new approach called Orthogonal Residual Projection (ORP) that improves the quantization accuracy through clever math operations using simple shift-and-add steps. Their method speeds up model calibration and works well on real hardware, showing good performance even with very low bit precision. This makes running large models faster and more efficient on small devices.

Large Language ModelsVision TransformersquantizationPower-of-Two quantizationMultiply-Accumulate (MAC)bit-shiftsOrthogonal Residual Projectionmodel calibrationperplexityRTL synthesis
Authors
Maoyang Xiang, Bo Wang, Tao Luo
Abstract
The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limitations and the critical timing bottlenecks introduced by dense Multiply-Accumulate (MAC) arrays. In the ultra-low bit regime, logarithmic Power-of-Two (PoT) quantization provides a hardware-efficient alternative by replacing MAC operations with bit-shifts. However, the non-uniform exponential lattice is inherently limited by a \textbf{Low Angular Resolution Regime}, a structural flaw that becomes particularly pronounced at sub-4-bit thresholds, leading to a notable degradation of high-dimensional feature manifolds. To address this geometric limitation, we propose Orthogonal Residual Projection (ORP), an algorithm-hardware co-design framework. By formulating quantization as a dual-basis geometric projection, ORP adaptively synthesizes a higher-resolution residual lattice using strictly shift-and-add operations. Furthermore, ORP's analytical solver offers a practical alternative to computationally intensive gradient-based optimization, reducing the full-model calibration time for LLaMA-2-7B to approximately \textbf{15 minutes}. Extensive evaluations demonstrate ORP's applicability across modalities and its hardware efficiency. Under the 3-bit (W3/A16) constraint, ORP achieves a perplexity of 6.10 on LLaMA-2-7B, comparing favorably to conventional MAC-intensive baselines like AWQ without relying on asymmetric scaling, while maintaining competitive accuracy in 4-bit scenarios. At the silicon level, standard-cell RTL synthesis at a 28nm node indicates that ORP effectively mitigates the timing bottlenecks associated with dense multiplier trees.