MOSO: Decomposing MOtion, Scene and Object for Video Prediction

Sun, Mingzhen; Wang, Weining; Zhu, Xinxin; Liu, Jing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2303.03684 (cs)

[Submitted on 7 Mar 2023 (v1), last revised 16 Mar 2023 (this version, v2)]

Title:MOSO: Decomposing MOtion, Scene and Object for Video Prediction

Authors:Mingzhen Sun, Weining Wang, Xinxin Zhu, Jing Liu

View PDF

Abstract:Motion, scene and object are three primary visual components of a video. In particular, objects represent the foreground, scenes represent the background, and motion traces their dynamics. Based on this insight, we propose a two-stage MOtion, Scene and Object decomposition framework (MOSO) for video prediction, consisting of MOSO-VQVAE and MOSO-Transformer. In the first stage, MOSO-VQVAE decomposes a previous video clip into the motion, scene and object components, and represents them as distinct groups of discrete tokens. Then, in the second stage, MOSO-Transformer predicts the object and scene tokens of the subsequent video clip based on the previous tokens and adds dynamic motion at the token level to the generated object and scene tokens. Our framework can be easily extended to unconditional video generation and video frame interpolation tasks. Experimental results demonstrate that our method achieves new state-of-the-art performance on five challenging benchmarks for video prediction and unconditional video generation: BAIR, RoboNet, KTH, KITTI and UCF101. In addition, MOSO can produce realistic videos by combining objects and scenes from different videos.

Comments:	Accepted by CVPR 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2303.03684 [cs.CV]
	(or arXiv:2303.03684v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.03684

Submission history

From: Mingzhen Sun [view email]
[v1] Tue, 7 Mar 2023 06:54:48 UTC (2,645 KB)
[v2] Thu, 16 Mar 2023 08:41:44 UTC (8,546 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MOSO: Decomposing MOtion, Scene and Object for Video Prediction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MOSO: Decomposing MOtion, Scene and Object for Video Prediction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators