Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration

Ding, Pengxiang; Ma, Jianfei; Tong, Xinyang; Zou, Binghong; Luo, Xinxin; Fan, Yiguo; Wang, Ting; Lu, Hongchao; Mo, Panzhong; Liu, Jinxin; Wang, Yuefan; Zhou, Huaicheng; Feng, Wenshuo; Liu, Jiacheng; Huang, Siteng; Wang, Donglin

Computer Science > Robotics

arXiv:2502.14795 (cs)

[Submitted on 20 Feb 2025]

Title:Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration

Authors:Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, Yuefan Wang, Huaicheng Zhou, Wenshuo Feng, Jiacheng Liu, Siteng Huang, Donglin Wang

View PDF HTML (experimental)

Abstract:This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions, allowing the model to learn universal motion patterns and action semantics. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation. Furthermore, we introduce a self-supervised data augmentation strategy that automatically generates pseudoannotations directly derived from motion data. This process converts raw motion sequences into informative question-answer pairs, facilitating the effective use of large-scale unlabeled video data. Built upon whole-body control architectures, extensive experiments show that Humanoid-VLA achieves object interaction and environment exploration tasks with enhanced contextual awareness, demonstrating a more human-like capacity for adaptive and intelligent engagement.

Subjects:	Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.14795 [cs.RO]
	(or arXiv:2502.14795v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2502.14795

Submission history

From: Pengxiang Ding [view email]
[v1] Thu, 20 Feb 2025 18:17:11 UTC (2,995 KB)

Computer Science > Robotics

Title:Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators