Evaluating Compositional Scene Understanding in Multimodal Generative Models

Fu, Shuhao; Lee, Andrew Jun; Wang, Anna; Momennejad, Ida; Bihl, Trevor; Lu, Hongjing; Webb, Taylor W.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.23125 (cs)

[Submitted on 29 Mar 2025]

Title:Evaluating Compositional Scene Understanding in Multimodal Generative Models

Authors:Shuhao Fu, Andrew Jun Lee, Anna Wang, Ida Momennejad, Trevor Bihl, Hongjing Lu, Taylor W. Webb

View PDF HTML (experimental)

Abstract:The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ($>5$) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.23125 [cs.CV]
	(or arXiv:2503.23125v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.23125

Submission history

From: Taylor Webb [view email]
[v1] Sat, 29 Mar 2025 15:34:43 UTC (5,374 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Evaluating Compositional Scene Understanding in Multimodal Generative Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Evaluating Compositional Scene Understanding in Multimodal Generative Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators