CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Ghosh, Sreyan; Seth, Ashish; Kumar, Sonal; Tyagi, Utkarsh; Evuru, Chandra Kiran; Ramaneswaran, S.; Sakshi, S.; Nieto, Oriol; Duraiswami, Ramani; Manocha, Dinesh

Computer Science > Sound

arXiv:2310.08753 (cs)

[Submitted on 12 Oct 2023 (v1), last revised 30 Jul 2024 (this version, v4)]

Title:CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Authors:Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

View PDF HTML (experimental)

Abstract:A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.

Comments:	ICLR 2024. Project Page: this https URL
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2310.08753 [cs.SD]
	(or arXiv:2310.08753v4 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2310.08753

Submission history

From: Sreyan Ghosh [view email]
[v1] Thu, 12 Oct 2023 22:43:38 UTC (35,389 KB)
[v2] Wed, 22 May 2024 20:52:02 UTC (43,176 KB)
[v3] Tue, 18 Jun 2024 18:03:28 UTC (43,176 KB)
[v4] Tue, 30 Jul 2024 18:58:01 UTC (43,176 KB)

Computer Science > Sound

Title:CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators