TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

Boeddeker, Christoph; Subramanian, Aswin Shanmugam; Wichern, Gordon; Haeb-Umbach, Reinhold; Roux, Jonathan Le

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2303.03849v2 (eess)

[Submitted on 7 Mar 2023 (v1), revised 8 Mar 2023 (this version, v2), latest version 1 Jan 2024 (v3)]

Title:TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

Authors:Christoph Boeddeker, Aswin Shanmugam Subramanian, Gordon Wichern, Reinhold Haeb-Umbach, Jonathan Le Roux

View PDF

Abstract:Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target-speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either via masking or via beamforming. The technique can be applied both for single-channel and multi-channel input and, in both cases, achieves a new state-of-the-art word error rate (WER) on the LibriCSS meeting data recognition task. We further compute speaker-aware and speaker-agnostic WERs to isolate the contribution of diarization errors to the overall WER performance.

Comments:	Submitted to IEEE/ACM TASLP
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2303.03849 [eess.AS]
	(or arXiv:2303.03849v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2303.03849

Submission history

From: Christoph Boeddeker [view email]
[v1] Tue, 7 Mar 2023 12:31:18 UTC (1,885 KB)
[v2] Wed, 8 Mar 2023 12:55:42 UTC (881 KB)
[v3] Mon, 1 Jan 2024 14:33:15 UTC (854 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators