Understanding Co-speech Gestures in-the-wild

Hegde, Sindhu B; Prajwal, K R; Kwon, Taein; Zisserman, Andrew

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.22668 (cs)

[Submitted on 28 Mar 2025]

Title:Understanding Co-speech Gestures in-the-wild

Authors:Sindhu B Hegde, K R Prajwal, Taein Kwon, Andrew Zisserman

View PDF HTML (experimental)

Abstract:Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: this https URL

Comments:	Main paper - 11 pages, 4 figures, Supplementary - 5 pages, 4 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.22668 [cs.CV]
	(or arXiv:2503.22668v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.22668

Submission history

From: Sindhu Hegde [view email]
[v1] Fri, 28 Mar 2025 17:55:52 UTC (3,117 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Understanding Co-speech Gestures in-the-wild

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Understanding Co-speech Gestures in-the-wild

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators