Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

Chowdhury, Sanjoy; Gani, Hanan; Anand, Nishit; Nag, Sayan; Gao, Ruohan; Elhoseiny, Mohamed; Khan, Salman; Manocha, Dinesh

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2503.23219 (eess)

[Submitted on 29 Mar 2025]

Title:Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

Authors:Sanjoy Chowdhury, Hanan Gani, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, Dinesh Manocha

View PDF HTML (experimental)

Abstract:Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark comprising 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge. Evaluating 18 AVLLMs on AVReasonBench reveals significant limitations in their multi-modal reasoning capabilities. Using AURELIA, we achieve up to a 100% relative improvement, demonstrating its effectiveness. This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications. Our code and data will be publicly released at: https: //github.com/schowdhury671/aurelia.

Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2503.23219 [eess.AS]
	(or arXiv:2503.23219v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2503.23219

Submission history

From: Sanjoy Chowdhury [view email]
[v1] Sat, 29 Mar 2025 20:42:29 UTC (29,212 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators