LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Xu, Ruyi; Yao, Yuan; Guo, Zonghao; Cui, Junbo; Ni, Zanlin; Ge, Chunjiang; Chua, Tat-Seng; Liu, Zhiyuan; Sun, Maosong; Huang, Gao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.11703 (cs)

[Submitted on 18 Mar 2024]

Title:LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Authors:Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang

View PDF HTML (experimental)

Abstract:Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at this https URL.

Comments:	Preprint
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.11703 [cs.CV]
	(or arXiv:2403.11703v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.11703

Submission history

From: Yuan Yao [view email]
[v1] Mon, 18 Mar 2024 12:04:11 UTC (2,221 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators