Vision-to-Music Generation: A Survey

Wang, Zhaokai; Bao, Chenxi; Zhuo, Le; Han, Jingrui; Yue, Yang; Tang, Yihong; Huang, Victor Shea-Jay; Liao, Yue

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.21254 (cs)

[Submitted on 27 Mar 2025]

Title:Vision-to-Music Generation: A Survey

Authors:Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue Liao

View PDF HTML (experimental)

Abstract:Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies on vision-to-music generation from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and promising directions for future research. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications. To follow latest works and foster further innovation in this field, we are continuously maintaining a GitHub repository at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2503.21254 [cs.CV]
	(or arXiv:2503.21254v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.21254

Submission history

From: Zhaokai Wang [view email]
[v1] Thu, 27 Mar 2025 08:21:54 UTC (1,623 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Vision-to-Music Generation: A Survey

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Vision-to-Music Generation: A Survey

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators