SmolVLM: Redefining small and efficient multimodal models

Marafioti, Andrés; Zohar, Orr; Farré, Miquel; Noyan, Merve; Bakouch, Elie; Cuenca, Pedro; Zakka, Cyril; Allal, Loubna Ben; Lozhkov, Anton; Tazi, Nouamane; Srivastav, Vaibhav; Lochner, Joshua; Larcher, Hugo; Morlon, Mathieu; Tunstall, Lewis; von Werra, Leandro; Wolf, Thomas

Computer Science > Artificial Intelligence

arXiv:2504.05299 (cs)

[Submitted on 7 Apr 2025]

Title:SmolVLM: Redefining small and efficient multimodal models

Authors:Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf

View PDF HTML (experimental)

Abstract:Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications.
We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints.
Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities.
Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.

Subjects:	Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.05299 [cs.AI]
	(or arXiv:2504.05299v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2504.05299

Submission history

From: Orr Zohar Mr [view email]
[v1] Mon, 7 Apr 2025 17:58:57 UTC (4,802 KB)

Computer Science > Artificial Intelligence

Title:SmolVLM: Redefining small and efficient multimodal models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:SmolVLM: Redefining small and efficient multimodal models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators