S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

2026-04-01 • Computation and Language

Computation and LanguageMachine Learning

AI summaryⓘ

The authors introduce a new tuning method called S0 tuning that adjusts a single initial state matrix in each recurrent layer of a language model, without changing the model's weights or adding extra computation when the model runs. They tested this method on several models and found it improves performance on coding and math tasks, sometimes more than existing methods like LoRA. The tuning file is small and easy to switch without reloading or merging model weights. However, the method does not improve performance on unrelated tasks like text-to-SQL. Overall, the authors show that tuning recurrent state initialization is an efficient way to adapt hybrid language models when labeled data is limited.

S0 tuningrecurrent state initializationHumanEvalLoRAgreedy pass@1prefix-tuninghybrid language modelszero-inference-overheadparameter-efficient fine-tuning (PEFT)cross-domain transfer

Authors

Jack Young

Abstract

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.

View PDFOpen arXiv