STEP: Simultaneous Tracking and Estimation of Pose for Animals and Humans

Verma, Shashikant; Katti, Harish; Debnath, Soumyaratna; Swamy, Yamuna; Raman, Shanmuganathan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.13344v1 (cs)

[Submitted on 17 Mar 2025 (this version), latest version 20 Mar 2025 (v2)]

Title:STEP: Simultaneous Tracking and Estimation of Pose for Animals and Humans

Authors:Shashikant Verma, Harish Katti, Soumyaratna Debnath, Yamuna Swamy, Shanmuganathan Raman

View PDF HTML (experimental)

Abstract:We introduce STEP, a novel framework utilizing Transformer-based discriminative model prediction for simultaneous tracking and estimation of pose across diverse animal species and humans. We are inspired by the fact that the human brain exploits spatiotemporal continuity and performs concurrent localization and pose estimation despite the specialization of brain areas for form and motion processing. Traditional discriminative models typically require predefined target states for determining model weights, a challenge we address through Gaussian Map Soft Prediction (GMSP) and Offset Map Regression Adapter (OMRA) Modules. These modules remove the necessity of keypoint target states as input, streamlining the process. Our method starts with a known target state initialized through a pre-trained detector or manual initialization in the initial frame of a given video sequence. It then seamlessly tracks the target and estimates keypoints of anatomical importance as output for subsequent frames. Unlike prevalent top-down pose estimation methods, our approach doesn't rely on per-frame target detections due to its tracking capability. This facilitates a significant advancement in inference efficiency and potential applications. We train and validate our approach on datasets encompassing diverse species. Our experiments demonstrate superior results compared to existing methods, opening doors to various applications, including but not limited to action recognition and behavioral analysis.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.13344 [cs.CV]
	(or arXiv:2503.13344v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.13344

Submission history

From: Shashikant Verma [view email]
[v1] Mon, 17 Mar 2025 16:22:00 UTC (36,804 KB)
[v2] Thu, 20 Mar 2025 10:11:27 UTC (36,805 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:STEP: Simultaneous Tracking and Estimation of Pose for Animals and Humans

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:STEP: Simultaneous Tracking and Estimation of Pose for Animals and Humans

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators