Fast Byte Latent Transformer

2026-05-08Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning
AI summary

The authors worked on making byte-level language models faster at generating text. They developed a new model called BLT Diffusion that can produce many bytes at once, instead of one by one, which speeds up the process. They also created two other methods to improve the quality of generated text while keeping things efficient. Together, these techniques reduce memory needs and make byte-level models more practical for real use.

byte-level language modelautoregressive generationdiffusion modelspeculative decodingnext-byte predictionparallel decodingmemory bandwidthtransformermodel verification
Authors
Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer, Christopher Potts, Xiaochuang Han, Srinivasan Iyer
Abstract
Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.