StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

Zhai, Shangjin; Ye, Zhichao; Liu, Jialin; Xie, Weijian; Hu, Jiaqi; Peng, Zhen; Xue, Hua; Chen, Danpeng; Wang, Xiaomeng; Yang, Lei; Wang, Nan; Liu, Haomin; Zhang, Guofeng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.05763 (cs)

[Submitted on 10 Jan 2025]

Title:StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

Authors:Shangjin Zhai, Zhichao Ye, Jialin Liu, Weijian Xie, Jiaqi Hu, Zhen Peng, Hua Xue, Danpeng Chen, Xiaomeng Wang, Lei Yang, Nan Wang, Haomin Liu, Guofeng Zhang

View PDF HTML (experimental)

Abstract:Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.05763 [cs.CV]
	(or arXiv:2501.05763v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.05763

Submission history

From: Nan Wang [view email]
[v1] Fri, 10 Jan 2025 07:41:47 UTC (24,780 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators