Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport

Aslam, Muhammad Haseeb; Zeeshan, Muhammad Osama; Belharbi, Soufiane; Pedersoli, Marco; Koerich, Alessandro; Bacon, Simon; Granger, Eric

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.15489v1 (cs)

[Submitted on 27 Jan 2024 (this version), latest version 29 Apr 2024 (v3)]

Title:Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport

Authors:Muhammad Haseeb Aslam, Muhammad Osama Zeeshan, Soufiane Belharbi, Marco Pedersoli, Alessandro Koerich, Simon Bacon, Eric Granger

View PDF

Abstract:Multimodal affect recognition models have reached remarkable performance in the lab environment due to their ability to model complementary and redundant semantic information. However, these models struggle in the wild, mainly because of the unavailability or quality of modalities used for training. In practice, only a subset of the training-time modalities may be available at test time. Learning with privileged information (PI) enables deep learning models (DL) to exploit data from additional modalities only available during training. State-of-the-art knowledge distillation (KD) methods have been proposed to distill multiple teacher models (each trained on a modality) to a common student model. These privileged KD methods typically utilize point-to-point matching and have no explicit mechanism to capture the structural information in the teacher representation space formed by introducing the privileged modality. We argue that encoding this same structure in the student space may lead to enhanced student performance. This paper introduces a new structural KD mechanism based on optimal transport (OT), where entropy-regularized OT distills the structural dark knowledge. Privileged KD with OT (PKDOT) method captures the local structures in the multimodal teacher representation by calculating a cosine similarity matrix and selects the top-k anchors to allow for sparse OT solutions, resulting in a more stable distillation process. Experiments were performed on two different problems: pain estimation on the Biovid dataset (ordinal classification) and arousal-valance prediction on the Affwild2 dataset (regression). Results show that the proposed method can outperform state-of-the-art privileged KD methods on these problems. The diversity of different modalities and fusion architectures indicates that the proposed PKDOT method is modality and model-agnostic.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2401.15489 [cs.CV]
	(or arXiv:2401.15489v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.15489

Submission history

From: Muhammad Haseeb Aslam [view email]
[v1] Sat, 27 Jan 2024 19:44:15 UTC (20,902 KB)
[v2] Thu, 25 Apr 2024 20:17:08 UTC (23,155 KB)
[v3] Mon, 29 Apr 2024 01:01:35 UTC (23,155 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators