TransText: Transparency Aware Image-to-Video Typography Animation
2026-03-18 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors developed a new method called TransText to create animations of text with transparent parts (like letters with see-through areas) more easily. Instead of changing complex parts of image models, they treat transparency as another kind of color, letting existing models handle it without messing up their learned knowledge. This way, their method produces clearer and more detailed animated text effects than previous attempts. Their tests show it works better at keeping the text's look and transparency consistent in videos.
image-to-video modelsglyph animationalpha channelvariational autoencoder (VAE)latent spacetransparency encodingcross-modal consistencyRGB color spacevisual generative modelslatent spatial concatenation
Authors
Fei Zhang, Zijian Zhou, Bohao Tang, Sen He, Hang Li, Zhe Wang, Soubhik Sanyal, Pengfei Liu, Viktar Atliha, Tao Xiang, Frost Xu, Semih Gunel
Abstract
We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.