Dynamic Scene Understanding from Vision-Language Representations

Pruss, Shahaf; Alper, Morris; Averbuch-Elor, Hadar

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.11653 (cs)

[Submitted on 20 Jan 2025]

Title:Dynamic Scene Understanding from Vision-Language Representations

Authors:Shahaf Pruss, Morris Alper, Hadar Averbuch-Elor

View PDF HTML (experimental)

Abstract:Images depicting complex, dynamic scenes are challenging to parse automatically, requiring both high-level comprehension of the overall situation and fine-grained identification of participating entities and their interactions. Current approaches use distinct methods tailored to sub-tasks such as Situation Recognition and detection of Human-Human and Human-Object Interactions. However, recent advances in image understanding have often leveraged web-scale vision-language (V&L) representations to obviate task-specific engineering. In this work, we propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen V&L representations. By framing these tasks in a generic manner - as predicting and parsing structured text, or by directly concatenating representations to the input of existing models - we achieve state-of-the-art results while using a minimal number of trainable parameters relative to existing approaches. Moreover, our analysis of dynamic knowledge of these representations shows that recent, more powerful representations effectively encode dynamic scene semantics, making this approach newly possible.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2501.11653 [cs.CV]
	(or arXiv:2501.11653v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.11653

Submission history

From: Shahaf Pruss [view email]
[v1] Mon, 20 Jan 2025 18:33:46 UTC (23,647 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Dynamic Scene Understanding from Vision-Language Representations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Dynamic Scene Understanding from Vision-Language Representations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators