HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers
2026-03-12 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionMachine Learning
AI summaryⓘ
The authors address the problem of making Vision Transformers run efficiently on devices with limited power and memory. They create a new method called Hierarchical Auto-Pruning (HiAP) that automatically finds smaller, faster versions of these models during one training step, without needing manual rules or multiple complicated stages. HiAP uses special gates to remove unimportant parts at both large scales (like entire attention heads) and small scales (like individual neurons). Their approach balances reducing memory use and computation, leading to efficient models with good accuracy on ImageNet, comparable to more complex pruning methods but simpler to use.
Vision Transformersstructured pruningsparsityGumbel-Sigmoid gatesattention headsfeed-forward network (FFN)FLOPsImageNetend-to-end training
Authors
Andy Li, Aiden Durrant, Milan Markovic, Georgios Leontidis
Abstract
Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.