Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation

Liu, Henglyu; Chen, Andong; Chen, Kehai; Bai, Xuefeng; Zhong, Meizhi; Qiu, Yuan; Zhang, Min

Computer Science > Computation and Language

arXiv:2503.10211 (cs)

[Submitted on 13 Mar 2025]

Title:Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation

Authors:Henglyu Liu, Andong Chen, Kehai Chen, Xuefeng Bai, Meizhi Zhong, Yuan Qiu, Min Zhang

View PDF HTML (experimental)

Abstract:Recent advancement of large language models (LLMs) has led to significant breakthroughs across various tasks, laying the foundation for the development of LLM-based speech translation systems. Existing methods primarily focus on aligning inputs and outputs across modalities while overlooking deeper semantic alignment within model representations. To address this limitation, we propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within LLMs. To achieve this, we leverage the optimal transport (OT) theory to quantify fine-grained representation discrepancies between speech and text. Furthermore, we utilize the cross-modal retrieval technique to identify the layers that are best suited for alignment and perform joint training on these layers. Experimental results on speech translation (ST) tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches. Our findings highlight the importance of inner-layer speech-text alignment in LLMs and provide new insights into enhancing cross-modal learning.

Comments:	12 pages, 7 figures
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2503.10211 [cs.CL]
	(or arXiv:2503.10211v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.10211

Submission history

From: HengLyu Liu [view email]
[v1] Thu, 13 Mar 2025 09:54:35 UTC (4,739 KB)

Computer Science > Computation and Language

Title:Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators