The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Lei, Weixian; Wang, Jiacong; Wang, Haochen; Li, Xiangtai; Liew, Jun Hao; Feng, Jiashi; Huang, Zilong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.10462 (cs)

[Submitted on 14 Apr 2025]

Title:The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Authors:Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, Zilong Huang

View PDF HTML (experimental)

Abstract:This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.10462 [cs.CV]
	(or arXiv:2504.10462v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.10462

Submission history

From: Weixian Lei [view email]
[v1] Mon, 14 Apr 2025 17:50:20 UTC (2,306 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators