Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos

Tan, Zhiyu; Wang, Junyan; Yang, Hao; Qin, Luozheng; Chen, Hesen; Zhou, Qiang; Li, Hao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.21314 (cs)

[Submitted on 28 Feb 2025]

Title:Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos

Authors:Zhiyu Tan, Junyan Wang, Hao Yang, Luozheng Qin, Hesen Chen, Qiang Zhou, Hao Li

View PDF HTML (experimental)

Abstract:Text-to-video generation has demonstrated promising progress with the advent of diffusion models, yet existing approaches are limited by dataset quality and computational resources. To address these limitations, this paper presents a comprehensive approach that advances both data curation and model design. We introduce CFC-VIDS-1M, a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline. The pipeline first evaluates video quality across multiple dimensions, followed by a fine-grained stage that leverages vision-language models to enhance text-video alignment and semantic richness. Building upon the curated dataset's emphasis on visual quality and temporal coherence, we develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms. The model is trained through a progressive four-stage strategy designed to efficiently handle the complexities of video generation. Extensive experiments demonstrate that our integrated approach of high-quality data curation and efficient training strategy generates visually appealing and temporally coherent videos while maintaining computational efficiency. We will release our dataset, code, and models.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.21314 [cs.CV]
	(or arXiv:2502.21314v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.21314

Submission history

From: Junyan Wang [view email]
[v1] Fri, 28 Feb 2025 18:56:35 UTC (16,614 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators