Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion

Rawal, Ishaan Singh; Matyasko, Alexander; Jaiswal, Shantanu; Fernando, Basura; Tan, Cheston

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.08889 (cs)

[Submitted on 15 Jun 2023 (v1), last revised 7 Jun 2024 (this version, v3)]

Title:Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion

Authors:Ishaan Singh Rawal, Alexander Matyasko, Shantanu Jaiswal, Basura Fernando, Cheston Tan

View PDF HTML (experimental)

Abstract:While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Or are they achieving high scores by exploiting biases and spurious features? Hence, to provide insights, we design $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design $\textit{QUAG-attention}$, a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models' abilities to learn highly-coupled multimodal representations. Hence, we design the $\textit{CLAVI}$ (Complements in LAnguage and VIdeo) dataset, a stress-test dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. This reasserts the limitations of current models for learning highly-coupled multimodal representations, that is not evaluated by the current datasets (project page: this https URL ).

Comments:	Accepted at ICML 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2306.08889 [cs.CV]
	(or arXiv:2306.08889v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.08889

Submission history

From: Ishaan Singh Rawal [view email]
[v1] Thu, 15 Jun 2023 06:45:46 UTC (6,712 KB)
[v2] Sat, 30 Sep 2023 08:10:26 UTC (11,197 KB)
[v3] Fri, 7 Jun 2024 05:45:02 UTC (14,000 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators