VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

Saravanan, Darshana; Singh, Darshan; Gupta, Varun; Khan, Zeeshan; Gandhi, Vineet; Tapaswi, Makarand

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.10889 (cs)

[Submitted on 16 Jun 2024]

Title:VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

Authors:Darshana Saravanan, Darshan Singh, Varun Gupta, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi

View PDF HTML (experimental)

Abstract:Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships. To this end, we propose VELOCITI, a new benchmark building on complex movie clips and dense semantic role label annotations to test perception and binding in video language models (contrastive and Video-LLMs). Our perception-based tests require discriminating video-caption pairs that share similar entities, and the binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video. While current state-of-the-art models perform moderately well on perception tests, accuracy is near random when both entities are present in the same video, indicating that they fail at binding tests. Even the powerful Gemini 1.5 Flash has a substantial gap (16-28%) with respect to human accuracy in such binding tests.

Comments:	26 pages, 17 figures, 3 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2406.10889 [cs.CV]
	(or arXiv:2406.10889v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.10889

Submission history

From: Varun Gupta [view email]
[v1] Sun, 16 Jun 2024 10:42:21 UTC (4,457 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators