Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Zheng, Xianrui; Zhang, Chao; Woodland, Philip C.

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2207.03852 (eess)

[Submitted on 8 Jul 2022]

Title:Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Authors:Xianrui Zheng, Chao Zhang, Philip C. Woodland

View PDF

Abstract:Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a tandem multitask training (TMT) method is proposed to fine-tune W2V2. For speaker diarisation, the tasks of voice activity detection (VAD) and speaker classification (SC) are required, and connectionist temporal classification (CTC) is used for ASR. The multitask framework implements VAD, SC, and ASR using an early layer, middle layer, and late layer of W2V2, which coincides with the order of segmenting the audio with VAD, clustering the segments based on speaker embeddings, and transcribing each segment with ASR. Experimental results on the augmented multi-party (AMI) dataset showed that using different W2V2 layers for VAD, SC, and ASR from the earlier to later layers for TMT not only saves computational cost, but also reduces diarisation error rates (DERs). Joint fine-tuning of VAD, SC, and ASR yielded 16%/17% relative reductions of DER with manual/automatic segmentation respectively, and consistent reductions in speaker attributed word error rate, compared to the baseline with separately fine-tuned models.

Comments:	To appear in Interspeech 2022
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2207.03852 [eess.AS]
	(or arXiv:2207.03852v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2207.03852

Submission history

From: Xianrui Zheng [view email]
[v1] Fri, 8 Jul 2022 12:06:52 UTC (129 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators