Audio-visual training for improved grounding in video-text LLMs

Sagare, Shivprasad; S, Hemachandran; Sarabhai, Kinshuk; Ullegaddi, Prashant; SA, Rajeshkumar

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.15046 (cs)

[Submitted on 21 Jul 2024]

Title:Audio-visual training for improved grounding in video-text LLMs

Authors:Shivprasad Sagare, Hemachandran S, Kinshuk Sarabhai, Prashant Ullegaddi, Rajeshkumar SA

View PDF HTML (experimental)

Abstract:Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to improved grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
Cite as:	arXiv:2407.15046 [cs.CV]
	(or arXiv:2407.15046v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.15046

Submission history

From: Shivprasad Sagare Mr [view email]
[v1] Sun, 21 Jul 2024 03:59:14 UTC (340 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Audio-visual training for improved grounding in video-text LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Audio-visual training for improved grounding in video-text LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators