Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Kim, Jungeun; Jeon, Hyeongwoo; Bae, Jongseong; Kim, Ha Young

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.16789 (cs)

[Submitted on 25 Nov 2024]

Title:Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Authors:Jungeun Kim, Hyeongwoo Jeon, Jongseong Bae, Ha Young Kim

View PDF HTML (experimental)

Abstract:Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we generate detailed textual descriptions of sign language components using MLLMs. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be effectively utilized in SLT.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2411.16789 [cs.CV]
	(or arXiv:2411.16789v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.16789

Submission history

From: Jungeun Kim [view email]
[v1] Mon, 25 Nov 2024 09:01:41 UTC (4,292 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators