InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Zhang, Pan; Dong, Xiaoyi; Zang, Yuhang; Cao, Yuhang; Qian, Rui; Chen, Lin; Guo, Qipeng; Duan, Haodong; Wang, Bin; Ouyang, Linke; Zhang, Songyang; Zhang, Wenwei; Li, Yining; Gao, Yang; Sun, Peng; Zhang, Xinyue; Li, Wei; Li, Jingwen; Wang, Wenhai; Yan, Hang; He, Conghui; Zhang, Xingcheng; Chen, Kai; Dai, Jifeng; Qiao, Yu; Lin, Dahua; Wang, Jiaqi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.03320 (cs)

[Submitted on 3 Jul 2024]

Title:InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Abstract:We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at this https URL.

Comments:	Technical Report. this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2407.03320 [cs.CV]
	(or arXiv:2407.03320v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.03320

Submission history

From: Jiaqi Wang [view email]
[v1] Wed, 3 Jul 2024 17:59:21 UTC (6,106 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators