A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition

Shen, Yaomin; Lin, Xiaojian; Fan, Wei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.19474 (cs)

[Submitted on 25 Mar 2025 (v1), last revised 2 Apr 2025 (this version, v2)]

Title:A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition

Authors:Yaomin Shen, Xiaojian Lin, Wei Fan

View PDF HTML (experimental)

Abstract:In the domain of multimodal intent recognition (MIR), the objective is to recognize human intent by integrating a variety of modalities, such as language text, body gestures, and tones. However, existing approaches face difficulties adequately capturing the intrinsic connections between the modalities and overlooking the corresponding semantic representations of intent. To address these limitations, we present the Anchor-based Multimodal Embedding with Semantic Synchronization (A-MESS) framework. We first design an Anchor-based Multimodal Embedding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs. Furthermore, we develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimizes the process by synchronizing multimodal representation with label descriptions produced by the large language model. Comprehensive experiments indicate that our A-MESS achieves state-of-the-art and provides substantial insight into multimodal representation and downstream tasks.

Comments:	Accepted by ICME2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.19474 [cs.CV]
	(or arXiv:2503.19474v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.19474

Submission history

From: Yaomin Shen [view email]
[v1] Tue, 25 Mar 2025 09:09:30 UTC (2,589 KB)
[v2] Wed, 2 Apr 2025 03:33:40 UTC (2,589 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators