On Robustness to Missing Video for Audiovisual Speech Recognition

Chang, Oscar; Braga, Otavio; Liao, Hank; Serdyuk, Dmitriy; Siohan, Olivier

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2312.10088 (eess)

[Submitted on 13 Dec 2023 (v1), last revised 19 Dec 2023 (this version, v2)]

Title:On Robustness to Missing Video for Audiovisual Speech Recognition

Authors:Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan

View PDF HTML (experimental)

Abstract:It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model. While there have been many attempts at building robust models, there is little consensus on how robustness should be evaluated. To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. Finally, we show that an architecture-agnostic solution based on cascades can consistently achieve robustness to missing video, even in settings where existing techniques for robustness like dropout fall short.

Subjects:	Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2312.10088 [eess.AS]
	(or arXiv:2312.10088v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2312.10088

Submission history

From: Oscar Chang [view email]
[v1] Wed, 13 Dec 2023 05:32:52 UTC (1,406 KB)
[v2] Tue, 19 Dec 2023 01:44:13 UTC (1,133 KB)

✅2024-10-01: arxiv.org is back to normal.✅

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:On Robustness to Missing Video for Audiovisual Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

✅2024-10-01: arxiv.org is back to normal.✅

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:On Robustness to Missing Video for Audiovisual Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators