VidTwin: Video VAE with Decoupled Structure and Dynamics

Wang, Yuchi; Guo, Junliang; Xie, Xinyi; He, Tianyu; Sun, Xu; Bian, Jiang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.17726 (cs)

[Submitted on 23 Dec 2024]

Title:VidTwin: Video VAE with Decoupled Structure and Dynamics

Authors:Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian

View PDF HTML (experimental)

Abstract:Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Our code has been released at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2412.17726 [cs.CV]
	(or arXiv:2412.17726v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.17726

Submission history

From: Yuchi Wang [view email]
[v1] Mon, 23 Dec 2024 17:16:58 UTC (3,044 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VidTwin: Video VAE with Decoupled Structure and Dynamics

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VidTwin: Video VAE with Decoupled Structure and Dynamics

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators