Identify Speakers in Cocktail Parties with End-to-End Attention

Zhu, Junzhe; Hasegawa-Johnson, Mark; Sari, Leda

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2005.11408 (eess)

[Submitted on 22 May 2020 (v1), last revised 9 Aug 2020 (this version, v2)]

Title:Identify Speakers in Cocktail Parties with End-to-End Attention

Authors:Junzhe Zhu, Mark Hasegawa-Johnson, Leda Sari

View PDF

Abstract:In scenarios where multiple speakers talk at the same time, it is important to be able to identify the talkers accurately. This paper presents an end-to-end system that integrates speech source extraction and speaker identification, and proposes a new way to jointly optimize these two parts by max-pooling the speaker predictions along the channel dimension. Residual attention permits us to learn spectrogram masks that are optimized for the purpose of speaker identification, while residual forward connections permit dilated convolution with a sufficiently large context window to guarantee correct streaming across syllable boundaries. End-to-end training results in a system that recognizes one speaker in a two-speaker broadcast speech mixture with 99.9% accuracy and both speakers with 93.9% accuracy, and that recognizes all speakers in three-speaker scenarios with 81.2% accuracy.

Comments:	Accepted by Interspeech 2020 for presentation; this https URL
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
Cite as:	arXiv:2005.11408 [eess.AS]
	(or arXiv:2005.11408v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2005.11408

Submission history

From: Junzhe Zhu [view email]
[v1] Fri, 22 May 2020 22:15:16 UTC (5,703 KB)
[v2] Sun, 9 Aug 2020 09:24:35 UTC (5,703 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Identify Speakers in Cocktail Parties with End-to-End Attention

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Identify Speakers in Cocktail Parties with End-to-End Attention

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators