CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets

Agrawal, Tanay; Guermal, Mohammed; Balazia, Michal; Bremond, Francois

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.03332 (cs)

[Submitted on 6 Jan 2025]

Title:CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets

Authors:Tanay Agrawal, Mohammed Guermal, Michal Balazia, Francois Bremond

View PDF HTML (experimental)

Abstract:Challenges in cross-learning involve inhomogeneous or even inadequate amount of training data and lack of resources for retraining large pretrained models. Inspired by transfer learning techniques in NLP, adapters and prefix tuning, this paper presents a new model-agnostic plugin architecture for cross-learning, called CM3T, that adapts transformer-based models to new or missing information. We introduce two adapter blocks: multi-head vision adapters for transfer learning and cross-attention adapters for multimodal learning. Training becomes substantially efficient as the backbone and other plugins do not need to be finetuned along with these additions. Comparative and ablation studies on three datasets Epic-Kitchens-100, MPIIGroupInteraction and UDIVA v0.5 show efficacy of this framework on different recording settings and tasks. With only 12.8% trainable parameters compared to the backbone to process video input and only 22.3% trainable parameters for two additional modalities, we achieve comparable and even better results than the state-of-the-art. CM3T has no specific requirements for training or pretraining and is a step towards bridging the gap between a general model and specific practical applications of video classification.

Comments:	Preprint. Final paper accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, February, 2025. 10 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
MSC classes:	68T05, 68T10
ACM classes:	I.5
Cite as:	arXiv:2501.03332 [cs.CV]
	(or arXiv:2501.03332v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.03332

Submission history

From: Michal Balazia [view email]
[v1] Mon, 6 Jan 2025 19:01:10 UTC (196 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators