Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

Yang, Shuyu; Zhou, Yinan; Wang, Yaxiong; Wu, Yujiao; Zhu, Li; Zheng, Zhedong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.02898v3 (cs)

[Submitted on 5 Jun 2023 (v1), revised 11 Aug 2023 (this version, v3), latest version 14 Aug 2023 (v4)]

Title:Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

Authors:Shuyu Yang, Yinan Zhou, Yaxiong Wang, Yujiao Wu, Li Zhu, Zhedong Zheng

View PDF

Abstract:In this paper, we introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. In particular, MALS contains 1,510,330 image-text pairs, which is about 37.5 times larger than prevailing CUHK-PEDES, and all images are annotated with 27 attributes. Considering the privacy concerns and annotation costs, we leverage the off-the-shelf diffusion models to generate the dataset. To verify the feasibility of learning from the generated data, we develop a new joint Attribute Prompt Learning and Text Matching Learning (APTM) framework, considering the shared knowledge between attribute and text. As the name implies, APTM contains an attribute prompt learning stream and a text matching learning stream. (1) The attribute prompt learning leverages the attribute prompts for image-attribute alignment, which enhances the text matching learning. (2) The text matching learning facilitates the representation learning on fine-grained details, and in turn, boosts the attribute prompt learning. Extensive experiments validate the effectiveness of the pre-training on MALS, achieving state-of-the-art retrieval performance via APTM on three challenging real-world benchmarks. In particular, APTM achieves a consistent improvement of +6.96%, +7.68%, and +16.95% Recall@1 accuracy on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets by a clear margin, respectively.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2306.02898 [cs.CV]
	(or arXiv:2306.02898v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.02898

Submission history

From: Shuyu Yang [view email]
[v1] Mon, 5 Jun 2023 14:06:24 UTC (5,531 KB)
[v2] Tue, 6 Jun 2023 06:42:56 UTC (5,178 KB)
[v3] Fri, 11 Aug 2023 11:13:08 UTC (5,177 KB)
[v4] Mon, 14 Aug 2023 07:37:27 UTC (5,177 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators