OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Chen, Tongjia; Yu, Hongshan; Yang, Zhengeng; Li, Zechuan; Sun, Wei; Chen, Chen

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.00096 (cs)

[Submitted on 30 Nov 2023 (v1), last revised 28 Mar 2024 (this version, v2)]

Title:OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Authors:Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, Chen Chen

View PDF HTML (experimental)

Abstract:Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.

Comments:	Technical report. Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.00096 [cs.CV]
	(or arXiv:2312.00096v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.00096

Submission history

From: Tongjia Chen [view email]
[v1] Thu, 30 Nov 2023 13:32:43 UTC (4,716 KB)
[v2] Thu, 28 Mar 2024 08:25:27 UTC (5,380 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators