Raw-x-vector: Multi-scale Time Domain Speaker Embedding Network

Zhu, Ge; Jiang, Fei; Duan, Zhiyao

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2010.12951v2 (eess)

[Submitted on 24 Oct 2020 (v1), revised 25 Feb 2021 (this version, v2), latest version 9 Jun 2021 (v3)]

Title:Raw-x-vector: Multi-scale Time Domain Speaker Embedding Network

Authors:Ge Zhu, Fei Jiang, Zhiyao Duan

View PDF

Abstract:State-of-the-art text-independent speaker verification systems typically use cepstral features or filter bank energies of speech utterances as input features. With the ability of deep neural networks to learn representations from raw data, recent studies attempted to extract speaker embeddings directly from raw waveforms and showed competitive results. In this paper, we propose a new speaker embedding called raw-x-vector for speaker verification in the time domain, combining a multi-scale waveform encoder and an x-vector network architecture. We show that the proposed approach outperforms existing raw-waveform-based speaker verification systems by a large margin. We also show that the proposed multi-scale encoder improves over single-scale encoders for both the proposed system and another state-of-the-art raw-waveform-based speaker verification systems. A further analysis of the learned filters shows that the multi-scale encoder focuses on different frequency bands at its different scales while resulting in a more flat overall frequency response than any of the single-scale counterparts.

Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2010.12951 [eess.AS]
	(or arXiv:2010.12951v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2010.12951

Submission history

From: Ge Zhu [view email]
[v1] Sat, 24 Oct 2020 18:44:00 UTC (100 KB)
[v2] Thu, 25 Feb 2021 22:44:31 UTC (310 KB)
[v3] Wed, 9 Jun 2021 02:17:50 UTC (226 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Raw-x-vector: Multi-scale Time Domain Speaker Embedding Network

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Raw-x-vector: Multi-scale Time Domain Speaker Embedding Network

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators