SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Qi, Zekun; Zhang, Wenyao; Ding, Yufei; Dong, Runpei; Yu, Xinqiang; Li, Jingwen; Xu, Lingyun; Li, Baoyu; He, Xialin; Fan, Guofan; Zhang, Jiazhao; He, Jiawei; Gu, Jiayuan; Jin, Xin; Ma, Kaisheng; Zhang, Zhizheng; Wang, He; Yi, Li

Computer Science > Robotics

arXiv:2502.13143 (cs)

[Submitted on 18 Feb 2025]

Title:SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Authors:Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi

View PDF

Abstract:Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.

Comments:	Project page: this https URL
Subjects:	Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.13143 [cs.RO]
	(or arXiv:2502.13143v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2502.13143

Submission history

From: Zekun Qi [view email]
[v1] Tue, 18 Feb 2025 18:59:02 UTC (33,379 KB)

Computer Science > Robotics

Title:SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators