Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Li, Bing; Zheng, Cheng; Zhu, Wenxuan; Mai, Jinjie; Zhang, Biao; Wonka, Peter; Ghanem, Bernard

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.08659 (cs)

[Submitted on 12 Jun 2024]

Title:Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Authors:Bing Li, Cheng Zheng, Wenxuan Zhu, Jinjie Mai, Biao Zhang, Peter Wonka, Bernard Ghanem

View PDF HTML (experimental)

Abstract:While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent spaces of layers from the pre-trained multi-view and the 2D video diffusion models, addressing the reused layers' incompatibility that arises from the domain gap between 2D and multi-view data. In support of this and future research, we further contribute a captioned multi-view video dataset. Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.

Comments:	Our project page is at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.08659 [cs.CV]
	(or arXiv:2406.08659v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.08659

Submission history

From: Bing Li [view email]
[v1] Wed, 12 Jun 2024 21:44:04 UTC (10,838 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators