CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

Liu, Shudong; Jin, Yiqiao; Li, Cheng; Wong, Derek F.; Wen, Qingsong; Sun, Lichao; Chen, Haipeng; Xie, Xing; Wang, Jindong

Computer Science > Artificial Intelligence

arXiv:2501.01282 (cs)

[Submitted on 2 Jan 2025]

Title:CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

Authors:Shudong Liu, Yiqiao Jin, Cheng Li, Derek F. Wong, Qingsong Wen, Lichao Sun, Haipeng Chen, Xing Xie, Jindong Wang

View PDF HTML (experimental)

Abstract:Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs' multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural, cross-continent, and cross-dataset generalization without sacrificing performance on models' general VLM benchmarks. We further present insights on cultural generalization and forgetting. We hope that this work could lay the foundation for more equitable and culturally aware multimodal AI systems.

Comments:	Technical report; 26 pages
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.01282 [cs.AI]
	(or arXiv:2501.01282v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2501.01282

Submission history

From: Shudong Liu [view email]
[v1] Thu, 2 Jan 2025 14:42:37 UTC (6,545 KB)

Computer Science > Artificial Intelligence

Title:CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators