CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

Sain, Aneeshan; Bhunia, Ayan Kumar; Chowdhury, Pinaki Nath; Koley, Subhadeep; Xiang, Tao; Song, Yi-Zhe

Computer Science > Computer Vision and Pattern Recognition

arXiv:2303.13440 (cs)

[Submitted on 23 Mar 2023 (v1), last revised 28 Mar 2023 (this version, v3)]

Title:CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

Authors:Aneeshan Sain, Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Subhadeep Koley, Tao Xiang, Yi-Zhe Song

View PDF

Abstract:In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Project page: this https URL

Comments:	Accepted in CVPR 2023. Project page available at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2303.13440 [cs.CV]
	(or arXiv:2303.13440v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.13440

Submission history

From: Aneeshan Sain [view email]
[v1] Thu, 23 Mar 2023 17:02:00 UTC (25,832 KB)
[v2] Fri, 24 Mar 2023 03:05:23 UTC (9,200 KB)
[v3] Tue, 28 Mar 2023 02:40:58 UTC (9,200 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators