Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

Mabrouk, Hadeel; Abugabal, Omar; Sakr, Nourhan; Eraqi, Hesham M.

Computer Science > Multimedia

arXiv:2207.05692 (cs)

[Submitted on 5 Jun 2022]

Title:Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

Authors:Hadeel Mabrouk, Omar Abugabal, Nourhan Sakr, Hesham M. Eraqi

View PDF

Abstract:In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to visual speech recognition systems due to the visual ambiguity of some phonemes. To this end, the development of visual speech recognition models is crucial given the instability of audio models. The main contributions of this work are i) building on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems; ii) leveraging audio data during training visual models, a feat which has not been utilized in prior word-based work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an efficient technique that aids the model in distilling knowledge at the sequence model encoder. This work proposes a novel and competitive architecture for lip-reading, as we demonstrate a noticeable improvement in performance, setting a new benchmark equals to 88.64% on the LRW dataset.

Comments:	arXiv admin note: text overlap with arXiv:2108.03543
Subjects:	Multimedia (cs.MM); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2207.05692 [cs.MM]
	(or arXiv:2207.05692v1 [cs.MM] for this version)
	https://doi.org/10.48550/arXiv.2207.05692

Submission history

From: Hesham Mohamed Eraqi [view email]
[v1] Sun, 5 Jun 2022 15:47:54 UTC (369 KB)

Computer Science > Multimedia

Title:Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multimedia

Title:Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators