Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition

Hu, Shoukang; Xie, Xurong; Liu, Shansong; Yu, Jianwei; Ye, Zi; Geng, Mengzhe; Liu, Xunying; Meng, Helen

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2012.04494 (eess)

[Submitted on 8 Dec 2020 (v1), last revised 10 May 2021 (this version, v3)]

Title:Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition

Authors:Shoukang Hu, Xurong Xie, Shansong Liu, Jianwei Yu, Zi Ye, Mengzhe Geng, Xunying Liu, Helen Meng

View PDF

Abstract:Discriminative training techniques define state-of-the-art performance for automatic speech recognition systems. However, they are inherently prone to overfitting, leading to poor generalization performance when using limited training data. In order to address this issue, this paper presents a full Bayesian framework to account for model uncertainty in sequence discriminative training of factored TDNN acoustic models. Several Bayesian learning based TDNN variant systems are proposed to model the uncertainty over weight parameters and choices of hidden activation functions, or the hidden layer outputs. Efficient variational inference approaches using a few as one single parameter sample ensure their computational cost in both training and evaluation time comparable to that of the baseline TDNN systems. Statistically significant word error rate (WER) reductions of 0.4%-1.8% absolute (5%-11% relative) were obtained over a state-of-the-art 900 hour speed perturbed Switchboard corpus trained baseline LF-MMI factored TDNN system using multiple regularization methods including F-smoothing, L2 norm penalty, natural gradient, model averaging and dropout, in addition to i-Vector plus learning hidden unit contribution (LHUC) based speaker adaptation and RNNLM rescoring. Consistent performance improvements were also obtained on a 450 hour HKUST conversational Mandarin telephone speech recognition task. On a third cross domain adaptation task requiring rapidly porting a 1000 hour LibriSpeech data trained system to a small DementiaBank elderly speech corpus, the proposed Bayesian TDNN LF-MMI systems outperformed the baseline system using direct weight fine-tuning by up to 2.5\% absolute WER reduction.

Comments:	Published in TASLP
Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2012.04494 [eess.AS]
	(or arXiv:2012.04494v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2012.04494

Submission history

From: Shoukang Hu [view email]
[v1] Tue, 8 Dec 2020 15:32:21 UTC (1,841 KB)
[v2] Tue, 15 Dec 2020 04:43:54 UTC (1,841 KB)
[v3] Mon, 10 May 2021 06:47:07 UTC (975 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators