PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

2026-06-04 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors introduce a new layer called the preconditioning (PC) layer that changes how weights in large language models are organized to keep training stable. This layer reshapes weight properties using simple polynomial functions and can be removed after training without slowing down the model. They tested this method on the Llama-1B model and found it helps with training using common optimizers. The authors also provide a mathematical explanation showing that controlling weight properties helps the training process find good solutions faster.

preconditioningweight parameterizationsingular-value spectrumpolynomial preconditionerLLM trainingtransformersAdamW optimizerMuon optimizergradient descentgeometric convergence

Authors

Senmiao Wang, Tiantian Fang, Haoran Zhang, Yushun Zhang, Kunxiang Zhao, Alex Schwing, Ruoyu Sun

Abstract

We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama-1B pre-training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum-control principle by proving that uniformly bounding each layer's singular values ensures geometric convergence of gradient descent to global minima, for certain deep linear networks. Our code is available at https://github.com/Empath-aln/PC-layer.

View PDFOpen arXiv