Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding

Guo, Hao; Zhu, Jianfei; Fan, Wei; Yi, Chunzhi; Jiang, Feng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.19240 (cs)

[Submitted on 25 Mar 2025]

Title:Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding

Authors:Hao Guo, Jianfei Zhu, Wei Fan, Chunzhi Yi, Feng Jiang

View PDF HTML (experimental)

Abstract:Referring expression comprehension (REC) aims at achieving object localization based on natural language descriptions. However, existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions, hindering their application in real-world scenarios. In natural human-robot interactions, users often express their desires through individual states and intentions, accompanied by guiding gestures, rather than detailed object descriptions. To address this challenge, we propose Multi-ref EC, a novel task framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects. We introduce the State-Intention-Gesture Attributes Reference (SIGAR) dataset, which combines state and intention expressions with embodied references. Through extensive experiments with various baseline models on SIGAR, we demonstrate that properly ordered multi-attribute references contribute to improved localization performance, revealing that single-attribute reference is insufficient for natural human-robot interaction scenarios. Our findings underscore the importance of multi-attribute reference expressions in advancing visual-language understanding.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Cite as:	arXiv:2503.19240 [cs.CV]
	(or arXiv:2503.19240v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.19240

Submission history

From: Hao Guo [view email]
[v1] Tue, 25 Mar 2025 00:59:58 UTC (7,764 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators