Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

Belharbi, Soufiane; Pedersoli, Marco; Koerich, Alessandro Lameiras; Bacon, Simon; Granger, Eric

Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.00281v2 (cs)

[Submitted on 1 Feb 2024 (v1), revised 2 Feb 2024 (this version, v2), latest version 14 May 2024 (v5)]

Title:Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

Authors:Soufiane Belharbi, Marco Pedersoli, Alessandro Lameiras Koerich, Simon Bacon, Eric Granger

View PDF HTML (experimental)

Abstract:While state-of-the-art facial expression recognition (FER) classifiers achieve a high level of accuracy, they lack interpretability, an important aspect for end-users. To recognize basic facial expressions, experts resort to a codebook associating a set of spatial action units to a facial expression. In this paper, we follow the same expert footsteps, and propose a learning strategy that allows us to explicitly incorporate spatial action units (aus) cues into the classifier's training to build a deep interpretable model. In particular, using this aus codebook, input image expression label, and facial landmarks, a single action units heatmap is built to indicate the most discriminative regions of interest in the image w.r.t the facial expression. We leverage this valuable spatial cue to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with \aus map. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with aus maps, simulating the experts' decision process. This is achieved using only the image class expression as supervision and without any extra manual annotations. Moreover, our method is generic. It can be applied to any CNN- or transformer-based deep classifier without the need for architectural change or adding significant training time. Our extensive evaluation on two public benchmarks RAFDB, and AFFECTNET datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on Class-Activation Mapping methods (CAMs), and we show that our training technique improves the CAM interpretability.

Comments:	11
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2402.00281 [cs.CV]
	(or arXiv:2402.00281v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.00281

Submission history

From: Soufiane Belharbi [view email]
[v1] Thu, 1 Feb 2024 02:13:49 UTC (11,455 KB)
[v2] Fri, 2 Feb 2024 02:56:43 UTC (11,455 KB)
[v3] Thu, 25 Apr 2024 16:55:46 UTC (11,455 KB)
[v4] Mon, 13 May 2024 14:54:17 UTC (11,455 KB)
[v5] Tue, 14 May 2024 12:26:54 UTC (11,455 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators