VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS

Meng, Ming; Mu, Ke; Zhu, Yonggui; Zhu, Zhe; Sun, Haoyu; Yan, Heyang; Fan, Zhaoxin

Abstract:Generating expressive and diverse human gestures from audio is crucial in fields like human-computer interaction, virtual reality, and animation. Though existing methods have achieved remarkable performance, they often exhibit limitations due to constrained dataset diversity and the restricted amount of information derived from audio inputs. To address these challenges, we present VarGes, a novel variation-driven framework designed to enhance co-speech gesture generation by integrating visual stylistic cues while maintaining naturalness. Our approach begins with the Variation-Enhanced Feature Extraction (VEFE) module, which seamlessly incorporates \textcolor{blue}{style-reference} video data into a 3D human pose estimation network to extract StyleCLIPS, thereby enriching the input with stylistic information. Subsequently, we employ the Variation-Compensation Style Encoder (VCSE), a transformer-style encoder equipped with an additive attention mechanism pooling layer, to robustly encode diverse StyleCLIPS representations and effectively manage stylistic variations. Finally, the Variation-Driven Gesture Predictor (VDGP) module fuses MFCC audio features with StyleCLIPS encodings via cross-attention, injecting this fused data into a cross-conditional autoregressive model to modulate 3D human gesture generation based on audio input and stylistic clues. The efficacy of our approach is validated on benchmark datasets, where it outperforms existing methods in terms of gesture diversity and naturalness. The code and video results will be made publicly available upon acceptance:this https URL .

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.10729 [cs.CV]
	(or arXiv:2502.10729v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.10729

Computer Science > Computer Vision and Pattern Recognition

Title:VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators