The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

Bai, Longju; Borah, Angana; Ignat, Oana; Mihalcea, Rada

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.11758 (cs)

[Submitted on 18 Nov 2024]

Title:The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

Authors:Longju Bai, Angana Borah, Oana Ignat, Rada Mihalcea

View PDF HTML (experimental)

Abstract:Large Multimodal Models (LMMs) exhibit impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of most data and models. Conversely, multi-agent models have shown significant capability in solving complex tasks. Our study evaluates the collective performance of LMMs in a multi-agent interaction setting for the novel task of cultural image captioning. Our contributions are as follows: (1) We introduce MosAIC, a Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs with distinct cultural personas; (2) We provide a dataset of culturally enriched image captions in English for images from China, India, and Romania across three datasets: GeoDE, GD-VCR, CVQA; (3) We propose a culture-adaptable metric for evaluating cultural information within image captions; and (4) We show that the multi-agent interaction outperforms single-agent models across different metrics, and offer valuable insights for future research. Our dataset and models can be accessed at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2411.11758 [cs.CV]
	(or arXiv:2411.11758v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.11758

Submission history

From: Oana Ignat [view email]
[v1] Mon, 18 Nov 2024 17:37:10 UTC (19,016 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators