LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Wang, Jingyi; Ju, Jianzhong; Luan, Jian; Deng, Zhidong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.16224 (cs)

[Submitted on 29 Aug 2024 (v1), last revised 30 Aug 2024 (this version, v2)]

Title:LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Authors:Jingyi Wang, Jianzhong Ju, Jian Luan, Zhidong Deng

View PDF HTML (experimental)

Abstract:Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2408.16224 [cs.CV]
	(or arXiv:2408.16224v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.16224

Submission history

From: Jingyi Wang [view email]
[v1] Thu, 29 Aug 2024 02:43:20 UTC (1,290 KB)
[v2] Fri, 30 Aug 2024 02:49:40 UTC (1,291 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators