ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts

Wu, Yuanchen; Du, Junlong; Yan, Ke; Ding, Shouhong; Li, Xiaoqiang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.00691 (cs)

[Submitted on 1 Apr 2025]

Title:ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts

Authors:Yuanchen Wu, Junlong Du, Ke Yan, Shouhong Ding, Xiaoqiang Li

View PDF HTML (experimental)

Abstract:Vision-language (VL) learning requires extensive visual perception capabilities, such as fine-grained object recognition and spatial perception. Recent works typically rely on training huge models on massive datasets to develop these capabilities. As a more efficient alternative, this paper proposes a new framework that Transfers the knowledge from a hub of Vision Experts (ToVE) for efficient VL learning, leveraging pre-trained vision expert models to promote visual perception capability. Specifically, building on a frozen CLIP encoder that provides vision tokens for image-conditioned language generation, ToVE introduces a hub of multiple vision experts and a token-aware gating network that dynamically routes expert knowledge to vision tokens. In the transfer phase, we propose a "residual knowledge transfer" strategy, which not only preserves the generalizability of the vision tokens but also allows detachment of low-contributing experts to improve inference efficiency. Further, we explore to merge these expert knowledge to a single CLIP encoder, creating a knowledge-merged CLIP that produces more informative vision tokens without expert inference during deployment. Experiment results across various VL tasks demonstrate that the proposed ToVE achieves competitive performance with two orders of magnitude fewer training data.

Comments:	Accepted to ICLR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.00691 [cs.CV]
	(or arXiv:2504.00691v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.00691

Submission history

From: Yuanchen Wu [view email]
[v1] Tue, 1 Apr 2025 12:02:40 UTC (4,514 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators