Video Depth without Video Models

Ke, Bingxin; Narnhofer, Dominik; Huang, Shengyu; Ke, Lei; Peters, Torben; Fragkiadaki, Katerina; Obukhov, Anton; Schindler, Konrad

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.19189 (cs)

[Submitted on 28 Nov 2024 (v1), last revised 17 Mar 2025 (this version, v2)]

Title:Video Depth without Video Models

Authors:Bingxin Ke, Dominik Narnhofer, Shengyu Huang, Lei Ke, Torben Peters, Katerina Fragkiadaki, Anton Obukhov, Konrad Schindler

View PDF HTML (experimental)

Abstract:Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: this http URL.

Comments:	Project page: this http URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2411.19189 [cs.CV]
	(or arXiv:2411.19189v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.19189

Submission history

From: Bingxin Ke [view email]
[v1] Thu, 28 Nov 2024 14:50:14 UTC (14,809 KB)
[v2] Mon, 17 Mar 2025 12:43:52 UTC (14,812 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Video Depth without Video Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Video Depth without Video Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators