On decoder-only architecture for speech-to-text and large language model integration

Wu, Jian; Gaur, Yashesh; Chen, Zhuo; Zhou, Long; Zhu, Yimeng; Wang, Tianrui; Li, Jinyu; Liu, Shujie; Ren, Bo; Liu, Linquan; Wu, Yu

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2307.03917 (eess)

[Submitted on 8 Jul 2023 (v1), last revised 2 Oct 2023 (this version, v3)]

Title:On decoder-only architecture for speech-to-text and large language model integration

Authors:Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu

View PDF

Abstract:Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2307.03917 [eess.AS]
	(or arXiv:2307.03917v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2307.03917

Submission history

From: Zhuo Chen [view email]
[v1] Sat, 8 Jul 2023 06:47:58 UTC (242 KB)
[v2] Fri, 14 Jul 2023 23:37:43 UTC (242 KB)
[v3] Mon, 2 Oct 2023 06:57:19 UTC (240 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:On decoder-only architecture for speech-to-text and large language model integration

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:On decoder-only architecture for speech-to-text and large language model integration

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators