Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Sun, Guangzhi; Yu, Wenyi; Tang, Changli; Chen, Xianzhao; Tan, Tian; Li, Wei; Lu, Lu; Ma, Zejun; Zhang, Chao

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2310.05863 (eess)

[Submitted on 9 Oct 2023 (v1), last revised 10 Oct 2023 (this version, v2)]

Title:Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Authors:Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

View PDF

Abstract:Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed which comprises six representative single-modal tasks with five cross-modal tasks reflecting audio-visual co-reasoning abilities. While achieving competitive single-modal performance on audio, speech and image tasks in AVEB, FAVOR achieved over 20% accuracy improvements on the video question-answering task when fine-grained information or temporal causal reasoning is required. FAVOR, in addition, demonstrated remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other multimodal LLMs. An interactive demo of FAVOR is available at this https URL, and the training code and model checkpoints will be released soon.

Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
Cite as:	arXiv:2310.05863 [eess.AS]
	(or arXiv:2310.05863v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2310.05863

Submission history

From: Guangzhi Sun [view email]
[v1] Mon, 9 Oct 2023 17:00:20 UTC (6,510 KB)
[v2] Tue, 10 Oct 2023 05:30:49 UTC (6,510 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators