Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Zhang, Yiming; Zhao, Zhuokai; Chen, Zhaorun; Ding, Zenghui; Yang, Xianjun; Sun, Yining

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.14401 (cs)

[Submitted on 21 Nov 2024]

Title:Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Authors:Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zenghui Ding, Xianjun Yang, Yining Sun

View PDF HTML (experimental)

Abstract:Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2411.14401 [cs.CV]
	(or arXiv:2411.14401v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.14401

Submission history

From: Zhuokai Zhao [view email]
[v1] Thu, 21 Nov 2024 18:30:11 UTC (9,422 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators