Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

Chung, Soo-Whan; Kang, Hong Goo; Chung, Joon Son

doi:10.21437/Interspeech.2020-1113

Computer Science > Sound

arXiv:2004.14326 (cs)

[Submitted on 29 Apr 2020 (v1), last revised 6 May 2020 (this version, v2)]

Title:Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

Authors:Soo-Whan Chung, Hong Goo Kang, Joon Son Chung

View PDF

Abstract:The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a signficant margin.

Comments:	Under submission as a conference paper
Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2004.14326 [cs.SD]
	(or arXiv:2004.14326v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2004.14326
Related DOI:	https://doi.org/10.21437/Interspeech.2020-1113

Submission history

From: Joon Son Chung [view email]
[v1] Wed, 29 Apr 2020 16:51:50 UTC (1,836 KB)
[v2] Wed, 6 May 2020 14:56:36 UTC (1,837 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.SD

< prev | next >

new | recent | 2020-04

Change to browse by:

cs
cs.CV
eess
eess.AS

References & Citations

DBLP - CS Bibliography

listing | bibtex

Soo-Whan Chung
Hong-Goo Kang
Joon Son Chung

export BibTeX citation

Computer Science > Sound

Title:Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators