MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Lee, Sangho; Chun, Il Yong; Park, Hogun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.18269 (cs)

[Submitted on 30 Jan 2025]

Title:MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Authors:Sangho Lee, Il Yong Chun, Hogun Park

View PDF HTML (experimental)

Abstract:Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges. When a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.

Comments:	Accepted to the AAAI 2025 Main Technical Track. This is an extended version of the original submission
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.18269 [cs.CV]
	(or arXiv:2501.18269v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.18269

Submission history

From: Hogun Park [view email]
[v1] Thu, 30 Jan 2025 11:10:18 UTC (10,390 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators