LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

Azadani, Mozhgan Nasr; Riddell, James; Sedwards, Sean; Czarnecki, Krzysztof

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.06986 (cs)

[Submitted on 13 Jan 2025]

Title:LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

Authors:Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki

View PDF HTML (experimental)

Abstract:Enhanced visual understanding serves as a cornerstone for multimodal large language models (MLLMs). Recent hybrid MLLMs incorporate a mixture of vision experts to address the limitations of using a single vision encoder and excessively long visual tokens. Despite the progress of these MLLMs, a research gap remains in effectively integrating diverse vision encoders. This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO, a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling: for each segmented tile of the input images, LEO sequentially interleaves the visual tokens from its two vision encoders. Extensive evaluation across 13 vision-language benchmarks reveals that LEO outperforms state-of-the-art open-source MLLMs and hybrid MLLMs on the majority of tasks. Furthermore, we show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe, achieving competitive performance compared to existing baselines. The code and model will be publicly available.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2501.06986 [cs.CV]
	(or arXiv:2501.06986v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.06986

Submission history

From: Mozhgan Nasr Azadani [view email]
[v1] Mon, 13 Jan 2025 00:29:55 UTC (10,684 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators