Beyond Aesthetics: Cultural Competence in Text-to-Image Models

Kannen, Nithish; Ahmad, Arif; Andreetto, Marco; Prabhakaran, Vinodkumar; Prabhu, Utsav; Dieng, Adji Bousso; Bhattacharyya, Pushpak; Dave, Shachi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.06863 (cs)

[Submitted on 9 Jul 2024 (v1), last revised 20 Jan 2025 (this version, v6)]

Title:Beyond Aesthetics: Cultural Competence in Text-to-Image Models

Authors:Nithish Kannen, Arif Ahmad, Marco Andreetto, Vinodkumar Prabhakaran, Utsav Prabhu, Adji Bousso Dieng, Pushpak Bhattacharyya, Shachi Dave

View PDF HTML (experimental)

Abstract:Text-to-Image (T2I) models are being increasingly adopted in diverse global communities where they create visual representations of their unique cultures. Current T2I benchmarks primarily focus on faithfulness, aesthetics, and realism of generated images, overlooking the critical dimension of cultural competence. In this work, we introduce a framework to evaluate cultural competence of T2I models along two crucial dimensions: cultural awareness and cultural diversity, and present a scalable approach using a combination of structured knowledge bases and large language models to build a large dataset of cultural artifacts to enable this evaluation. In particular, we apply this approach to build CUBE (CUltural BEnchmark for Text-to-Image models), a first-of-its-kind benchmark to evaluate cultural competence of T2I models. CUBE covers cultural artifacts associated with 8 countries across different geo-cultural regions and along 3 concepts: cuisine, landmarks, and art. CUBE consists of 1) CUBE-1K, a set of high-quality prompts that enable the evaluation of cultural awareness, and 2) CUBE-CSpace, a larger dataset of cultural artifacts that serves as grounding to evaluate cultural diversity. We also introduce cultural diversity as a novel T2I evaluation component, leveraging quality-weighted Vendi score. Our evaluations reveal significant gaps in the cultural awareness of existing models across countries and provide valuable insights into the cultural diversity of T2I outputs for under-specified prompts. Our methodology is extendable to other cultural regions and concepts, and can facilitate the development of T2I models that better cater to the global population.

Comments:	NeurIPS 2024 camera-ready version
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.06863 [cs.CV]
	(or arXiv:2407.06863v6 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.06863

Submission history

From: Nithish Kannen [view email]
[v1] Tue, 9 Jul 2024 13:50:43 UTC (30,258 KB)
[v2] Thu, 11 Jul 2024 17:57:37 UTC (30,258 KB)
[v3] Wed, 24 Jul 2024 18:09:48 UTC (30,258 KB)
[v4] Sun, 4 Aug 2024 08:28:25 UTC (30,259 KB)
[v5] Thu, 7 Nov 2024 20:26:21 UTC (30,259 KB)
[v6] Mon, 20 Jan 2025 11:02:52 UTC (30,326 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Aesthetics: Cultural Competence in Text-to-Image Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Aesthetics: Cultural Competence in Text-to-Image Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators