MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Choi, Suhwan; Kim, Kyu Won; Kang, Myungjoo

Computer Science > Sound

arXiv:2501.01094 (cs)

[Submitted on 2 Jan 2025]

Title:MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Authors:Suhwan Choi, Kyu Won Kim, Myungjoo Kang

View PDF HTML (experimental)

Abstract:We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.

Comments:	Paper accepted in Artificial Intelligence for Music workshop at AAAI 2025
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2501.01094 [cs.SD]
	(or arXiv:2501.01094v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2501.01094

Submission history

From: Suhwan Choi [view email]
[v1] Thu, 2 Jan 2025 06:36:09 UTC (8,587 KB)

Computer Science > Sound

Title:MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators