VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Couairon, Paul; Rambour, Clément; Haugeard, Jean-Emmanuel; Thome, Nicolas

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.08707 (cs)

[Submitted on 14 Jun 2023 (v1), last revised 2 Apr 2024 (this version, v4)]

Title:VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Authors:Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, Nicolas Thome

View PDF HTML (experimental)

Abstract:Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, existing diffusion-based video editing approaches lack the ability to offer precise control over generated content that maintains temporal consistency in long-term videos. On the other hand, atlas-based methods provide strong temporal consistency but are costly to edit a video and lack spatial control. In this work, we introduce VidEdit, a novel method for zero-shot text-based video editing that guarantees robust temporal and spatial consistency. In particular, we combine an atlas-based video representation with a pre-trained text-to-image diffusion model to provide a training-free and efficient video editing method, which by design fulfills temporal smoothness. To grant precise user control over generated content, we utilize conditional information extracted from off-the-shelf panoptic segmenters and edge detectors which guides the diffusion sampling process. This method ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Our quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at this https URL

Comments:	TMLR 2024. Project web-page at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.08707 [cs.CV]
	(or arXiv:2306.08707v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.08707

Submission history

From: Paul Couairon [view email]
[v1] Wed, 14 Jun 2023 19:15:49 UTC (44,279 KB)
[v2] Fri, 8 Dec 2023 15:37:48 UTC (40,518 KB)
[v3] Fri, 15 Dec 2023 23:54:57 UTC (44,280 KB)
[v4] Tue, 2 Apr 2024 11:08:12 UTC (44,477 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators