Improving Rare-Word Recognition of Whisper in Zero-Shot Settings

Jogi, Yash; Aggarwal, Vaibhav; Nair, Shabari S; Verma, Yash; Kubba, Aayush

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2502.11572 (eess)

[Submitted on 17 Feb 2025 (v1), last revised 18 Feb 2025 (this version, v2)]

Title:Improving Rare-Word Recognition of Whisper in Zero-Shot Settings

Authors:Yash Jogi, Vaibhav Aggarwal, Shabari S Nair, Yash Verma, Aayush Kubba

View PDF HTML (experimental)

Abstract:Whisper, despite being trained on 680K hours of web-scaled audio data, faces difficulty in recognising rare words like domain-specific terms, with a solution being contextual biasing through prompting. To improve upon this method, in this paper, we propose a supervised learning strategy to fine-tune Whisper for contextual biasing instruction. We demonstrate that by using only 670 hours of Common Voice English set for fine-tuning, our model generalises to 11 diverse open-source English datasets, achieving a 45.6% improvement in recognition of rare words and 60.8% improvement in recognition of words unseen during fine-tuning over the baseline method. Surprisingly, our model's contextual biasing ability generalises even to languages unseen during fine-tuning.

Comments:	Accepted at IEEE SLT 2024
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2502.11572 [eess.AS]
	(or arXiv:2502.11572v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2502.11572

Submission history

From: Yash Jogi [view email]
[v1] Mon, 17 Feb 2025 09:06:34 UTC (773 KB)
[v2] Tue, 18 Feb 2025 05:46:11 UTC (773 KB)

Full-text links:

Access Paper:

view license

Current browse context:

eess.AS

< prev | next >

new | recent | 2025-02

Change to browse by:

cs
cs.SD
eess

References & Citations

export BibTeX citation

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Improving Rare-Word Recognition of Whisper in Zero-Shot Settings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Improving Rare-Word Recognition of Whisper in Zero-Shot Settings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators