Language Models as Black-Box Optimizers for Vision-Language Models

Liu, Shihong; Lin, Zhiqiu; Yu, Samuel; Lee, Ryan; Ling, Tiffany; Pathak, Deepak; Ramanan, Deva

Computer Science > Computation and Language

arXiv:2309.05950 (cs)

[Submitted on 12 Sep 2023 (v1), last revised 14 May 2024 (this version, v5)]

Title:Language Models as Black-Box Optimizers for Vision-Language Models

Authors:Shihong Liu, Zhiqiu Lin, Samuel Yu, Ryan Lee, Tiffany Ling, Deepak Pathak, Deva Ramanan

View PDF HTML (experimental)

Abstract:Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic hill-climbing procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit gradient direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we apply our framework to optimize the state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt inversion, and personalization.

Comments:	Published at CVPR 2024. Project site: this https URL
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2309.05950 [cs.CL]
	(or arXiv:2309.05950v5 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2309.05950

Submission history

From: Zhiqiu Lin [view email]
[v1] Tue, 12 Sep 2023 04:03:41 UTC (313 KB)
[v2] Mon, 25 Sep 2023 04:35:32 UTC (1,295 KB)
[v3] Thu, 30 Nov 2023 10:35:40 UTC (78,254 KB)
[v4] Tue, 7 May 2024 07:47:38 UTC (36,463 KB)
[v5] Tue, 14 May 2024 03:20:42 UTC (36,816 KB)

Computer Science > Computation and Language

Title:Language Models as Black-Box Optimizers for Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Language Models as Black-Box Optimizers for Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators