Pre-Trained Vision-Language Model Selection and Reuse for Downstream Tasks

Tan, Hao-Zhe; Zhou, Zhi; Guo, Lan-Zhe; Li, Yu-Feng

Abstract:Pre-trained Vision-Language Models (VLMs) are becoming increasingly popular across various visual tasks, and several open-sourced VLM variants have been released. However, selecting the best-performing pre-trained VLM for a specific downstream task is challenging since no single VLM can achieve promising performance on all downstream tasks, and evaluating all available VLMs is impossible due to time and data limitations. To address this problem, this paper proposes a novel paradigm to select and reuse VLM for downstream tasks, called Model Label Learning (MLL). The proposal contains three key modules: \emph{model labeling}, which assigns labels to each VLM to describe their specialty and utility; \emph{model selection}, which matches the requirements of the target task with model labels; and \emph{model reuse}, which applies selected VLMs to the target task in an ensemble manner. The proposal is highly computationally efficient and growable since the model labeling process is completed target task independent and the ability could grow with the number of candidate VLMs. We also introduce a new benchmark for evaluating VLM selection methods, including 49 VLMs and 17 target task datasets. Experimental results clearly demonstrate the effectiveness of the proposed method for selecting and reusing VLMs.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.18271 [cs.LG]
	(or arXiv:2501.18271v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2501.18271

Computer Science > Machine Learning

Title:Pre-Trained Vision-Language Model Selection and Reuse for Downstream Tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators