Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Zhang, Zi-Qiang; Zhang, Jie; Zhang, Jian-Shu; Wu, Ming-Hui; Fang, Xin; Dai, Li-Rong

Electrical Engineering and Systems Science > Image and Video Processing

arXiv:2202.07428 (eess)

[Submitted on 15 Feb 2022 (v1), last revised 10 Jul 2022 (this version, v2)]

Title:Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Authors:Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai

View PDF

Abstract:With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR) performance, as the multi-modal inputs contain more fruitful information in principle. In this paper, based on existing self-supervised representation learning methods for audio modality, we therefore propose an audio-visual representation learning approach. The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a flexible masking strategy. After pre-training, the model is able to extract fused representations required by AVSR. Without loss of generality, it can be applied to single-modal tasks, e.g. audio/visual speech recognition by simply masking out one modality in the fusion module. The proposed pre-trained model is evaluated on speech recognition and lipreading tasks using one or two modalities, where the superiority is revealed.

Comments:	5 pages
Subjects:	Image and Video Processing (eess.IV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2202.07428 [eess.IV]
	(or arXiv:2202.07428v2 [eess.IV] for this version)
	https://doi.org/10.48550/arXiv.2202.07428

Submission history

From: Ziqiang Zhang [view email]
[v1] Tue, 15 Feb 2022 14:15:58 UTC (620 KB)
[v2] Sun, 10 Jul 2022 08:31:58 UTC (623 KB)

Electrical Engineering and Systems Science > Image and Video Processing

Title:Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Image and Video Processing

Title:Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators