EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation

Zhang, Xiangyue; Li, Jianfang; Zhang, Jiaxu; Ren, Jianqiang; Bo, Liefeng; Tu, Zhigang

Computer Science > Graphics

arXiv:2504.09209 (cs)

[Submitted on 12 Apr 2025 (v1), last revised 15 Apr 2025 (this version, v2)]

Title:EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation

Authors:Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Jianqiang Ren, Liefeng Bo, Zhigang Tu

View PDF HTML (experimental)

Abstract:Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion.

Comments:	12 pages, 12 figures
Subjects:	Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
Cite as:	arXiv:2504.09209 [cs.GR]
	(or arXiv:2504.09209v2 [cs.GR] for this version)
	https://doi.org/10.48550/arXiv.2504.09209

Submission history

From: Xiangyue Zhang [view email]
[v1] Sat, 12 Apr 2025 13:30:16 UTC (21,422 KB)
[v2] Tue, 15 Apr 2025 15:41:20 UTC (28,477 KB)

Computer Science > Graphics

Title:EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Graphics

Title:EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators