HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

Yuan, Kun; Srivastav, Vinkle; Navab, Nassir; Padoy, Nicolas

Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.10075 (cs)

[Submitted on 16 May 2024]

Title:HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

Authors:Kun Yuan, Vinkle Srivastav, Nassir Navab, Nicolas Padoy

View PDF HTML (experimental)

Abstract:Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model's transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.

Comments:	Accepted by MICCAI2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2405.10075 [cs.CV]
	(or arXiv:2405.10075v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.10075

Submission history

From: Kun Yuan [view email]
[v1] Thu, 16 May 2024 13:14:43 UTC (902 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators