ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

Athar, Ali; Deng, Xueqing; Chen, Liang-Chieh

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.09754 (cs)

[Submitted on 12 Dec 2024 (v1), last revised 17 Dec 2024 (this version, v2)]

Title:ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

Authors:Ali Athar, Xueqing Deng, Liang-Chieh Chen

View PDF HTML (experimental)

Abstract:Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both research directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. The project page is at this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.09754 [cs.CV]
	(or arXiv:2412.09754v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.09754

Submission history

From: Ali Athar [view email]
[v1] Thu, 12 Dec 2024 23:10:54 UTC (1,659 KB)
[v2] Tue, 17 Dec 2024 21:14:50 UTC (1,663 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators