HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

Chen, Feilong; Chen, Xiuyi; Shi, Jiaxin; Zhang, Duzhen; Chang, Jianlong; Tian, Qi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2205.12105 (cs)

[Submitted on 24 May 2022 (v1), last revised 31 May 2022 (this version, v2)]

Title:HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

Authors:Feilong Chen, Xiuyi Chen, Jiaxin Shi, Duzhen Zhang, Jianlong Chang, Qi Tian

View PDF

Abstract:In the past few years, the emergence of vision-language pre-training (VLP) has brought cross-modal retrieval to a new era. However, due to the latency and computation demand, it is commonly challenging to apply VLP in a real-time online retrieval system. To alleviate the defect, this paper proposes a \textbf{Hi}erarchical \textbf{V}ision-\textbf{}Language \textbf{P}re-Training (\textbf{HiVLP}) for fast Image-Text Retrieval (ITR). Specifically, we design a novel hierarchical retrieval objective, which uses the representation of different dimensions for coarse-to-fine ITR, i.e., using low-dimensional representation for large-scale coarse retrieval and high-dimensional representation for small-scale fine retrieval. We evaluate our proposed HiVLP on two popular image-text retrieval benchmarks, i.e., Flickr30k and COCO. Extensive experiments demonstrate that our HiVLP not only has fast inference speed but also can be easily scaled to large-scale ITR scenarios. The detailed results show that HiVLP is $1,427$$\sim$$120,649\times$ faster than the fusion-based model UNITER and 2$\sim$5 faster than the fastest embedding-based model LightingDot in different candidate scenarios. It also achieves about +4.9 AR on COCO and +3.8 AR on Flickr30K than LightingDot and achieves comparable performance with the state-of-the-art (SOTA) fusion-based model METER.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2205.12105 [cs.CV]
	(or arXiv:2205.12105v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2205.12105

Submission history

From: Feilong Chen [view email]
[v1] Tue, 24 May 2022 14:32:57 UTC (18,181 KB)
[v2] Tue, 31 May 2022 08:14:53 UTC (18,181 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators