Text- and Feature-based Models for Compound Multimodal Emotion Recognition in the Wild

Richet, Nicolas; Belharbi, Soufiane; Aslam, Haseeb; Schadt, Meike Emilie; González-González, Manuela; Cortal, Gustave; Koerich, Alessandro Lameiras; Pedersoli, Marco; Finkel, Alain; Bacon, Simon; Granger, Eric

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.12927 (cs)

[Submitted on 17 Jul 2024]

Title:Text- and Feature-based Models for Compound Multimodal Emotion Recognition in the Wild

Authors:Nicolas Richet, Soufiane Belharbi, Haseeb Aslam, Meike Emilie Schadt, Manuela González-González, Gustave Cortal, Alessandro Lameiras Koerich, Marco Pedersoli, Alain Finkel, Simon Bacon, Eric Granger

View PDF HTML (experimental)

Abstract:Systems for multimodal Emotion Recognition (ER) commonly rely on features extracted from different modalities (e.g., visual, audio, and textual) to predict the seven basic emotions. However, compound emotions often occur in real-world scenarios and are more difficult to predict. Compound multimodal ER becomes more challenging in videos due to the added uncertainty of diverse modalities.
In addition, standard features-based models may not fully capture the complex and subtle cues needed to understand compound emotions.
%%%%
Since relevant cues can be extracted in the form of text, we advocate for textualizing all modalities, such as visual and audio, to harness the capacity of large language models (LLMs). These models may understand the complex interaction between modalities and the subtleties of complex emotions. Although training an LLM requires large-scale datasets, a recent surge of pre-trained LLMs, such as BERT and LLaMA, can be easily fine-tuned for downstream tasks like compound ER.
This paper compares two multimodal modeling approaches for compound ER in videos -- standard feature-based vs. text-based. Experiments were conducted on the challenging C-EXPR-DB dataset for compound ER, and contrasted with results on the MELD dataset for basic ER.
Our code is available

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.12927 [cs.CV]
	(or arXiv:2407.12927v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.12927

Submission history

From: Soufiane Belharbi [view email]
[v1] Wed, 17 Jul 2024 18:01:25 UTC (316 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Text- and Feature-based Models for Compound Multimodal Emotion Recognition in the Wild

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Text- and Feature-based Models for Compound Multimodal Emotion Recognition in the Wild

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators