Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

2026-04-27 • Computation and Language

Computation and LanguageMachine Learning

AI summaryⓘ

The authors explore how to improve big language models by combining Transformer parts with other sequence processing blocks without starting from scratch. They introduce HyLo, a method that upgrades pretrained Transformer models to work efficiently with very long text contexts by mixing new efficient Transformer layers and linear blocks, and using special training techniques. This approach lets models handle much longer inputs (up to 32 times longer) and use much less memory, enabling tasks that were previously too large to run. Their experiments show HyLo works better than previous methods for both short and long texts, even with less training data.

Transformer modelshybrid sequence modelslong-context modelingMulti-Head Latent Attention (MLA)linear sequence modelingteacher-guided distillationKV-cache memorypost-training optimizationLlamahybrid architectures

Authors

Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas, Mingyu Yang, Vansh Bhatia, Guihong Li, Vikram Appia, Emad Barsoum

Abstract

Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32\times$ through efficient post-training and reduces KV-cache memory by more than $90\%$, enabling up to 2M-token prefill and decoding in our \texttt{vLLM} inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.

View PDFOpen arXiv