A framework of text-dependent speaker verification for chinese numerical string corpus

Zheng, Litong; Hong, Feng; Xu, Weijie; Zheng, Wan

Computer Science > Sound

arXiv:2405.07029v2 (cs)

[Submitted on 11 May 2024 (v1), last revised 21 May 2024 (this version, v2)]

Title:A framework of text-dependent speaker verification for chinese numerical string corpus

Authors:Litong Zheng, Feng Hong, Weijie Xu, Wan Zheng

View PDF

Abstract:The Chinese numerical string corpus, serves as a valuable resource for speaker verification, particularly in financial transactions. Researches indicate that in short speech scenarios, text-dependent speaker verification (TD-SV) consistently outperforms text-independent speaker verification (TI-SV). However, TD-SV potentially includes the validation of text information, that can be negatively impacted by reading rhythms and pauses. To address this problem, we propose an end-to-end speaker verification system that enhances TD-SV by decoupling speaker and text information. Our system consists of a text embedding extractor, a speaker embedding extractor and a fusion module. In the text embedding extractor, we employ an enhanced Transformer and introduce a triple loss including text classification loss, connectionist temporal classification (CTC) loss and decoder loss; while in the speaker embedding extractor, we create a multi-scale pooling method by combining sliding window attentive statistics pooling (SWASP) with attentive statistics pooling (ASP). To mitigate the scarcity of data, we have recorded a publicly available Chinese numerical corpus named SHALCAS22A (hereinafter called SHAL), which can be accessed on Open-SLR. Moreover, we employ data augmentation techniques using Tacotron2 and HiFi-GAN. Our method achieves an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.

Comments:	arXiv admin note: text overlap with arXiv:2312.01645
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2405.07029 [cs.SD]
	(or arXiv:2405.07029v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2405.07029

Submission history

From: Litong Zheng [view email]
[v1] Sat, 11 May 2024 15:02:06 UTC (615 KB)
[v2] Tue, 21 May 2024 04:44:59 UTC (615 KB)

Computer Science > Sound

Title:A framework of text-dependent speaker verification for chinese numerical string corpus

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:A framework of text-dependent speaker verification for chinese numerical string corpus

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators