Spatio-Temporal Context Prompting for Zero-Shot Action Detection

Huang, Wei-Jhe; Chen, Min-Hung; Lai, Shang-Hong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.15996 (cs)

[Submitted on 28 Aug 2024 (v1), last revised 29 Aug 2024 (this version, v2)]

Title:Spatio-Temporal Context Prompting for Zero-Shot Action Detection

Authors:Wei-Jhe Huang, Min-Hung Chen, Shang-Hong Lai

View PDF HTML (experimental)

Abstract:Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in this https URL.

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2408.15996 [cs.CV]
	(or arXiv:2408.15996v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.15996

Submission history

From: Wei-Jhe Huang [view email]
[v1] Wed, 28 Aug 2024 17:59:05 UTC (2,798 KB)
[v2] Thu, 29 Aug 2024 06:54:11 UTC (2,798 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Spatio-Temporal Context Prompting for Zero-Shot Action Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Spatio-Temporal Context Prompting for Zero-Shot Action Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators