Integrating Lattice-Free MMI into End-to-End Speech Recognition

Tian, Jinchuan; Yu, Jianwei; Weng, Chao; Zou, Yuexian; Yu, Dong

doi:10.1109/TASLP.2022.3198555

Computer Science > Computation and Language

arXiv:2203.15614 (cs)

[Submitted on 29 Mar 2022 (v1), last revised 23 Aug 2022 (this version, v3)]

Title:Integrating Lattice-Free MMI into End-to-End Speech Recognition

Authors:Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

View PDF

Abstract:In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding; the on-the-fly decoding process in MBR-based methods results in the need for pre-trained models and slow training speeds. To this end, novel algorithms are proposed in this work to integrate another widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI), into E2E ASR systems not only in the training stage but also in the decoding process. The proposed LF-MMI training and decoding methods show their effectiveness on two widely used E2E frameworks: Attention-Based Encoder-Decoders (AEDs) and Neural Transducers (NTs). Compared with MBR-based methods, the proposed LF-MMI method: maintains the consistency between training and decoding; eschews the on-the-fly decoding process; trains from randomly initialized models with superior training efficiency. Experiments suggest that the LF-MMI method outperforms its MBR counterparts and consistently leads to statistically significant performance improvements on various frameworks and datasets from 30 hours to 14.3k hours. The proposed method achieves state-of-the-art (SOTA) results on Aishell-1 (CER 4.10%) and Aishell-2 (CER 5.02%) datasets. Code is released.

Comments:	in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2203.15614 [cs.CL]
	(or arXiv:2203.15614v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2203.15614
Related DOI:	https://doi.org/10.1109/TASLP.2022.3198555

Submission history

From: Jinchuan Tian [view email]
[v1] Tue, 29 Mar 2022 14:32:46 UTC (867 KB)
[v2] Sat, 2 Apr 2022 03:47:59 UTC (868 KB)
[v3] Tue, 23 Aug 2022 03:37:24 UTC (1,532 KB)

Computer Science > Computation and Language

Title:Integrating Lattice-Free MMI into End-to-End Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Integrating Lattice-Free MMI into End-to-End Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators