Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Mahapatra, Aniruddha; Mai, Long; Zhang, Yitian; Bourgin, David; Liu, Feng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.05442 (cs)

[Submitted on 9 Jan 2025]

Title:Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Authors:Aniruddha Mahapatra, Long Mai, Yitian Zhang, David Bourgin, Feng Liu

View PDF HTML (experimental)

Abstract:Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to direct extensions of existing video tokenizers. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a reduced token budget.

Comments:	Project website: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.05442 [cs.CV]
	(or arXiv:2501.05442v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.05442

Submission history

From: Aniruddha Mahapatra [view email]
[v1] Thu, 9 Jan 2025 18:55:15 UTC (3,770 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators