PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

Zhang, Kaidong; Ren, Pengzhen; Lin, Bingqian; Lin, Junfan; Ma, Shikui; Xu, Hang; Liang, Xiaodan

Computer Science > Robotics

arXiv:2410.10394 (cs)

[Submitted on 14 Oct 2024 (v1), last revised 16 Oct 2024 (this version, v2)]

Title:PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

Authors:Kaidong Zhang, Pengzhen Ren, Bingqian Lin, Junfan Lin, Shikui Ma, Hang Xu, Xiaodan Liang

View PDF

Abstract:Language-guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work trivially fitting the data without revealing the relation between instruction and low-level executable actions, these models are prone to memorizing the surficial pattern of the data instead of acquiring the transferable knowledge, and thus are fragile to dynamic environment changes. To address this issue, we propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) that focuses solely on the prediction of task-relevant waypoints. Specifically, PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive-driven waypoint prediction, while the latter focuses on decoding low-level actions. Additionally, we also design an asynchronous hierarchical executor (AHE), which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT-R, the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT-R can significantly improve both the performance and efficiency of robotic manipulation.

Comments:	Accepted to NeurIPS 2024
Subjects:	Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2410.10394 [cs.RO]
	(or arXiv:2410.10394v2 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2410.10394

Submission history

From: Kaidong Zhang [view email]
[v1] Mon, 14 Oct 2024 11:30:18 UTC (5,871 KB)
[v2] Wed, 16 Oct 2024 08:20:44 UTC (5,871 KB)

Computer Science > Robotics

Title:PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators