Aligned Better, Listen Better for Audio-Visual Large Language Models

Guo, Yuxin; Ma, Shuailei; Ma, Shijie; Bao, Xiaoyi; Xie, Chen-Wei; Zheng, Kecheng; Weng, Tingyu; Sun, Siyang; Zheng, Yun; Zou, Wei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.02061 (cs)

[Submitted on 2 Apr 2025]

Title:Aligned Better, Listen Better for Audio-Visual Large Language Models

Authors:Yuxin Guo, Shuailei Ma, Shijie Ma, Xiaoyi Bao, Chen-Wei Xie, Kecheng Zheng, Tingyu Weng, Siyang Sun, Yun Zheng, Wei Zou

View PDF HTML (experimental)

Abstract:Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations.

Comments:	Accepted to ICLR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2504.02061 [cs.CV]
	(or arXiv:2504.02061v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.02061

Submission history

From: Yuxin Guo [view email]
[v1] Wed, 2 Apr 2025 18:47:09 UTC (4,047 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Aligned Better, Listen Better for Audio-Visual Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Aligned Better, Listen Better for Audio-Visual Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators