A review of on-device fully neural end-to-end automatic speech recognition algorithms

Kim, Chanwoo; Gowda, Dhananjaya; Lee, Dongsoo; Kim, Jiyeon; Kumar, Ankur; Kim, Sungsoo; Garg, Abhinav; Han, Changwoo

Computer Science > Machine Learning

arXiv:2012.07974 (cs)

[Submitted on 14 Dec 2020 (v1), last revised 27 Aug 2021 (this version, v3)]

Title:A review of on-device fully neural end-to-end automatic speech recognition algorithms

Authors:Chanwoo Kim, Dhananjaya Gowda, Dongsoo Lee, Jiyeon Kim, Ankur Kumar, Sungsoo Kim, Abhinav Garg, Changwoo Han

View PDF

Abstract:In this paper, we review various end-to-end automatic speech recognition algorithms and their optimization techniques for on-device applications. Conventional speech recognition systems comprise a large number of discrete components such as an acoustic model, a language model, a pronunciation model, a text-normalizer, an inverse-text normalizer, a decoder based on a Weighted Finite State Transducer (WFST), and so on. To obtain sufficiently high speech recognition accuracy with such conventional speech recognition systems, a very large language model (up to 100 GB) is usually needed. Hence, the corresponding WFST size becomes enormous, which prohibits their on-device implementation. Recently, fully neural network end-to-end speech recognition algorithms have been proposed. Examples include speech recognition systems based on Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNN-T), Attention-based Encoder-Decoder models (AED), Monotonic Chunk-wise Attention (MoChA), transformer-based speech recognition systems, and so on. These fully neural network-based systems require much smaller memory footprints compared to conventional algorithms, therefore their on-device implementation has become feasible. In this paper, we review such end-to-end speech recognition models. We extensively discuss their structures, performance, and advantages compared to conventional algorithms.

Comments:	Accepted as an invited paper to the Asilomar Conference on Signals, Systems, and Computers 2020. Figures are slightly updated in Aug. 2021
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2012.07974 [cs.LG]
	(or arXiv:2012.07974v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2012.07974

Submission history

From: Chanwoo Kim [view email]
[v1] Mon, 14 Dec 2020 22:18:08 UTC (125 KB)
[v2] Sat, 19 Dec 2020 08:27:51 UTC (112 KB)
[v3] Fri, 27 Aug 2021 15:13:47 UTC (112 KB)

Computer Science > Machine Learning

Title:A review of on-device fully neural end-to-end automatic speech recognition algorithms

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:A review of on-device fully neural end-to-end automatic speech recognition algorithms

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators