VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation
2026-02-17 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors study how to make computers draw sketches step-by-step like humans do, instead of just creating a finished picture all at once. They use large language models to plan the order of strokes and video diffusion models to create smooth, high-quality drawing visuals. Their method treats sketches like short videos showing strokes appearing one after another, and it learns from very few real example drawings. This approach produces detailed, well-ordered sketches and allows flexible features such as different brush styles and interactive drawing. Overall, their work helps AI better mimic the natural process of sequential sketching.
Sketch GenerationSequential DrawingText-to-Video Diffusion ModelsLarge Language ModelsStroke OrderingVisual AppearanceFine-tuningSynthetic Shape CompositionsAutoregressive GenerationBrush Style Conditioning
Authors
Hui Ren, Yuval Alaluf, Omer Bar Tal, Alexander Schwing, Antonio Torralba, Yael Vinker
Abstract
Sketching is inherently a sequential process, in which strokes are drawn in a meaningful order to explore and refine ideas. However, most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing. We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models to generate sketching processes. Our key insight is that large language models and video diffusion models offer complementary strengths for this task: LLMs provide semantic planning and stroke ordering, while video diffusion models serve as strong renderers that produce high-quality, temporally coherent visuals. We leverage this by representing sketches as short videos in which strokes are progressively drawn on a blank canvas, guided by text-specified ordering instructions. We introduce a two-stage fine-tuning strategy that decouples the learning of stroke ordering from the learning of sketch appearance. Stroke ordering is learned using synthetic shape compositions with controlled temporal structure, while visual appearance is distilled from as few as seven manually authored sketching processes that capture both global drawing order and the continuous formation of individual strokes. Despite the extremely limited amount of human-drawn sketch data, our method generates high-quality sequential sketches that closely follow text-specified orderings while exhibiting rich visual detail. We further demonstrate the flexibility of our approach through extensions such as brush style conditioning and autoregressive sketch generation, enabling additional controllability and interactive, collaborative drawing.