BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval

Lu, Zhenyu; Sethi, Lakshay

Computer Science > Sound

arXiv:2408.10383 (cs)

[Submitted on 19 Aug 2024]

Title:BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval

Authors:Zhenyu Lu, Lakshay Sethi

View PDF HTML (experimental)

Abstract:Previous methods for audio-image matching generally fall into one of two categories: pipeline models or End-to-End models. Pipeline models first transcribe speech and then encode the resulting text; End-to-End models encode speech directly. Generally, pipeline models outperform end-to-end models, but the intermediate transcription necessarily discards some potentially useful non-textual information. In addition to textual information, speech can convey details such as accent, mood, and and emphasis, which should be effectively captured in the encoded representation. In this paper, we investigate whether non-textual information, which is overlooked by pipeline-based models, can be leveraged to improve speech-image matching performance. We thoroughly analyze and compare End-to-End models, pipeline models, and our proposed dual-channel model for robust audio-image retrieval on a variety of datasets. Our approach achieves a substantial performance gain over the previous state-of-the-art by leveraging strong pretrained models, a prompting mechanism and a bifurcated design.

Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2408.10383 [cs.SD]
	(or arXiv:2408.10383v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2408.10383

Submission history

From: Zhenyu Lu [view email]
[v1] Mon, 19 Aug 2024 19:56:10 UTC (18,726 KB)

🚨2024-09-29: arxiv.org is experiencing DB issues.🚨

Computer Science > Sound

Title:BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

🚨2024-09-29: arxiv.org is experiencing DB issues.🚨

Computer Science > Sound

Title:BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators