Fine-grained Video-Text Retrieval: A New Benchmark and Method

Xu, Yifan; Li, Xinhao; Yang, Yichun; Huang, Rui; Wang, Limin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.00513 (cs)

[Submitted on 31 Dec 2024]

Title:Fine-grained Video-Text Retrieval: A New Benchmark and Method

Authors:Yifan Xu, Xinhao Li, Yichun Yang, Rui Huang, Limin Wang

View PDF

Abstract:The ability of perceiving fine-grained spatial and temporal information is crucial for video-language retrieval. However, the existing video retrieval benchmarks, such as MSRVTT and MSVD, fail to efficiently evaluate the fine-grained retrieval ability of video-language models (VLMs) due to a lack of detailed annotations. To address this problem, we present FIBER, a FIne-grained BEnchmark for text to video Retrieval, containing 1,000 videos sourced from the FineAction dataset. Uniquely, our FIBER benchmark provides detailed human-annotated spatial annotations and temporal annotations for each video, making it possible to independently evaluate the spatial and temporal bias of VLMs on video retrieval task. Besides, we employ a text embedding method to unlock the capability of fine-grained video-language understanding of Multimodal Large Language Models (MLLMs). Surprisingly, the experiment results show that our Video Large Language Encoder (VLLE) performs comparably to CLIP-based models on traditional benchmarks and has a stronger capability of fine-grained representation with lower spatial-temporal bias. Project page: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:2501.00513 [cs.CV]
	(or arXiv:2501.00513v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.00513

Submission history

From: Yifan Xu [view email]
[v1] Tue, 31 Dec 2024 15:53:50 UTC (3,669 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Fine-grained Video-Text Retrieval: A New Benchmark and Method

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Fine-grained Video-Text Retrieval: A New Benchmark and Method

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators