SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation
2026-06-18 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors explain that making pictures using models that predict one piece at a time is slow because these models treat images like a flat list, ignoring that pictures are actually 2D. They created a new method called Spatially Speculative Decoding that predicts multiple parts of the image at once by using the natural 2D layout of images. This method makes image generation much faster—up to 13 times—while keeping the quality high. Their work shows that paying attention to the 2D structure in images can greatly speed up these models.
Autoregressive modelsVisual generationDiscrete tokens1D sequence2D spatial localitySpatially Speculative DecodingInference bottleneckImage generationComputational efficiencyHigh-resolution generative models
Authors
Shilong Xiang, Zirui Zhang, Lijun Yu, Chengzhi Mao
Abstract
Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference. We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference. Our approach accelerates autoregressive image generation by up to 13.3x while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.