Z-GMOT: Zero-shot Generic Multiple Object Tracking

Tran, Kim Hoang; Nguyen, Tien-Phat; Dinh, Anh Duy Le; Nguyen, Pha; Phan, Thinh; Luu, Khoa; Adjeroh, Donald; Le, Ngan Hoang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.17648v1 (cs)

[Submitted on 28 May 2023 (this version), latest version 13 Jun 2024 (v4)]

Title:Z-GMOT: Zero-shot Generic Multiple Object Tracking

Authors:Kim Hoang Tran, Tien-Phat Nguyen, Anh Duy Le Dinh, Pha Nguyen, Thinh Phan, Khoa Luu, Donald Adjeroh, Ngan Hoang Le

View PDF

Abstract:Despite the significant progress made in recent years, Multi-Object Tracking (MOT) approaches still suffer from several limitations, including their reliance on prior knowledge of tracking targets, which necessitates the costly annotation of large labeled datasets. As a result, existing MOT methods are limited to a small set of predefined categories, and they struggle with unseen objects in the real world. To address these issues, Generic Multiple Object Tracking (GMOT) has been proposed, which requires less prior information about the targets. However, all existing GMOT approaches follow a one-shot paradigm, relying mainly on the initial bounding box and thus struggling to handle variants e.g., viewpoint, lighting, occlusion, scale, and etc. In this paper, we introduce a novel approach to address the limitations of existing MOT and GMOT methods. Specifically, we propose a zero-shot GMOT (Z-GMOT) algorithm that can track never-seen object categories with zero training examples, without the need for predefined categories or an initial bounding box. To achieve this, we propose iGLIP, an improved version of Grounded language-image pretraining (GLIP), which can detect unseen objects while minimizing false positives. We evaluate our Z-GMOT thoroughly on the GMOT-40 dataset, AnimalTrack testset, DanceTrack testset. The results of these evaluations demonstrate a significant improvement over existing methods. For instance, on the GMOT-40 dataset, the Z-GMOT outperforms one-shot GMOT with OC-SORT by 27.79 points HOTA and 44.37 points MOTA. On the AnimalTrack dataset, it surpasses fully-supervised methods with DeepSORT by 12.55 points HOTA and 8.97 points MOTA. To facilitate further research, we will make our code and models publicly available upon acceptance of this paper.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.17648 [cs.CV]
	(or arXiv:2305.17648v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.17648

Submission history

From: Kim Tran [view email]
[v1] Sun, 28 May 2023 06:44:33 UTC (23,668 KB)
[v2] Mon, 21 Aug 2023 18:13:41 UTC (23,668 KB)
[v3] Mon, 15 Apr 2024 09:31:17 UTC (24,543 KB)
[v4] Thu, 13 Jun 2024 14:58:23 UTC (24,543 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Z-GMOT: Zero-shot Generic Multiple Object Tracking

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Z-GMOT: Zero-shot Generic Multiple Object Tracking

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators