Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
2026-03-23 • Machine Learning
Machine Learning
AI summaryⓘ
The authors improved a technique called DoRA, which adjusts neural network weights but usually needs a lot of memory because it calculates a big matrix. They introduced a new method that breaks down this calculation into smaller parts and created special code (kernels) that does the work more efficiently in one step. This makes DoRA use less memory and run faster across different GPUs without losing accuracy. Their tests show consistent speedups and nearly identical results compared to the original method.
DoRALoRAlow-rank adaptationweight normGPU memoryTriton kernelsvision-language modelsmatrix factorizationnumerical stabilitypeak VRAM
Authors
Alexandra Zelenin, Alexandra Zhuravlyova
Abstract
Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.