Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Wu, Linzhi; Zhang, Xingyu; Zhang, Yakun; Zheng, Changyan; Liu, Tiejun; Xie, Liang; Yan, Ye; Yin, Erwei

Computer Science > Artificial Intelligence

arXiv:2403.16071 (cs)

[Submitted on 24 Mar 2024 (v1), last revised 2 May 2024 (this version, v2)]

Title:Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Authors:Linzhi Wu, Xingyu Zhang, Yakun Zhang, Changyan Zheng, Tiejun Liu, Liang Xie, Ye Yan, Erwei Yin

View PDF HTML (experimental)

Abstract:Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing speaker-specific appearance characteristics. Furthermore, a max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations. Experimental evaluations on public lip reading datasets demonstrate the effectiveness of the proposed approach under the intra-speaker and inter-speaker conditions.

Comments:	To appear in LREC-COLING 2024
Subjects:	Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2403.16071 [cs.AI]
	(or arXiv:2403.16071v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2403.16071
Journal reference:	The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Submission history

From: Linzhi Wu [view email]
[v1] Sun, 24 Mar 2024 09:18:21 UTC (1,153 KB)
[v2] Thu, 2 May 2024 08:53:35 UTC (337 KB)

Computer Science > Artificial Intelligence

Title:Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators