Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

Liu, Xiaoyang; Wen, Boran; Liu, Xinpeng; Zhou, Zizheng; Fan, Hongwei; Lu, Cewu; Ma, Lizhuang; Chen, Yulong; Li, Yong-Lu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.19542 (cs)

[Submitted on 27 Dec 2024]

Title:Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

Authors:Xiaoyang Liu, Boran Wen, Xinpeng Liu, Zizheng Zhou, Hongwei Fan, Cewu Lu, Lizhuang Ma, Yulong Chen, Yong-Lu Li

View PDF HTML (experimental)

Abstract:Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at detecting HOIs from videos, which is crucial for activity understanding. However, existing whole-body-object interaction video benchmarks overlook the truth that open-world objects are diverse, that is, they usually provide limited and predefined object classes. Therefore, we introduce a new open-world benchmark: Grounding Interacted Objects (GIO) including 1,098 interacted objects class and 290K interacted object boxes annotation. Accordingly, an object grounding task is proposed expecting vision systems to discover interacted objects. Even though today's detectors and grounding methods have succeeded greatly, they perform unsatisfactorily in localizing diverse and rare objects in GIO. This profoundly reveals the limitations of current vision systems and poses a great challenge. Thus, we explore leveraging spatio-temporal cues to address object grounding and propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos. Our method demonstrates significant superiority in extensive experiments compared to current baselines. Data and code will be publicly available at this https URL.

Comments:	To be published in the Proceedings of AAAI 2025. The first three authors contributed equally. Project: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2412.19542 [cs.CV]
	(or arXiv:2412.19542v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.19542

Submission history

From: Xiaoyang Liu [view email]
[v1] Fri, 27 Dec 2024 09:08:46 UTC (27,886 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators