VidLA: Video-Language Alignment at Scale

Rizve, Mamshad Nayeem; Fei, Fan; Unnikrishnan, Jayakrishnan; Tran, Son; Yao, Benjamin Z.; Zeng, Belinda; Shah, Mubarak; Chilimbi, Trishul

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.14870 (cs)

[Submitted on 21 Mar 2024]

Title:VidLA: Video-Language Alignment at Scale

Authors:Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, Trishul Chilimbi

View PDF HTML (experimental)

Abstract:In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-language model with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.

Comments:	Accepted to CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2403.14870 [cs.CV]
	(or arXiv:2403.14870v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.14870

Submission history

From: Mamshad Nayeem Rizve [view email]
[v1] Thu, 21 Mar 2024 22:36:24 UTC (1,352 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VidLA: Video-Language Alignment at Scale

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VidLA: Video-Language Alignment at Scale

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators