Universal YOCO for Efficient Depth Scaling

2026-04-01Computation and Language

Computation and Language
AI summary

The authors introduce Universal YOCO (YOCO-U), a new method that helps large language models work better and faster during inference. They combine two techniques—YOCO's efficient attention system and recursive computation—to improve model performance without a big increase in computational cost. YOCO-U keeps memory usage stable and uses repeated processing layers efficiently, leading to better handling of long text contexts. Their tests show YOCO-U performs well on general and long-context tasks, suggesting this combined approach is promising for scaling large language models.

Large Language ModelsTransformerInference-time ScalingKV CacheYOCO ArchitectureRecursive ComputationSelf-DecoderEfficient AttentionToken UtilityLong-Context Benchmarks
Authors
Yutao Sun, Li Dong, Tianzhu Ye, Shaohan Huang, Jianyong Wang, Furu Wei
Abstract
The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.