GalleryGPT: Analyzing Paintings with Large Multimodal Models

Bin, Yi; Shi, Wenhao; Ding, Yujuan; Hu, Zhiqiang; Wang, Zheng; Yang, Yang; Ng, See-Kiong; Shen, Heng Tao

doi:10.1145/3664647.3681656

Abstract:Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Understanding artworks is challenging due to its subjective nature, diverse interpretations, and complex visual elements, requiring expertise in art history, cultural background, and aesthetic theory. However, limited by the data collection and model ability, previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. To facilitate the research progress, in this paper, we step further to compose comprehensive analysis inspired by the remarkable perception and generation ability of large multimodal models. Specifically, we first propose a task of composing paragraph analysis for artworks, i.e., painting in this paper, only focusing on visual characteristics to formulate more comprehensive understanding of artworks. To support the research on formal analysis, we collect a large dataset PaintingForm, with about 19k painting images and 50k analysis paragraphs. We further introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture leveraging our collected data. We conduct formal analysis generation and zero-shot experiments across several datasets to assess the capacity of our model. The results show remarkable performance improvements comparing with powerful baseline LMMs, demonstrating its superb ability of art analysis and generalization. \textcolor{blue}{The codes and model are available at: this https URL.

Comments:	Accepted as Oral Presentation at ACM Multimedia 2024
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2408.00491 [cs.CL]
	(or arXiv:2408.00491v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2408.00491
Related DOI:	https://doi.org/10.1145/3664647.3681656

Computer Science > Computation and Language

Title:GalleryGPT: Analyzing Paintings with Large Multimodal Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators