Slot-BERT: Self-supervised Object Discovery in Surgical Video

Liao, Guiqiu; Jogan, Matjaz; Hussing, Marcel; Nakahashi, Kenta; Yasufuku, Kazuhiro; Madani, Amin; Eaton, Eric; Hashimoto, Daniel A.

Electrical Engineering and Systems Science > Image and Video Processing

arXiv:2501.12477 (eess)

[Submitted on 21 Jan 2025 (v1), last revised 27 Jan 2025 (this version, v2)]

Title:Slot-BERT: Self-supervised Object Discovery in Surgical Video

Authors:Guiqiu Liao, Matjaz Jogan, Marcel Hussing, Kenta Nakahashi, Kazuhiro Yasufuku, Amin Madani, Eric Eaton, Daniel A. Hashimoto

View PDF HTML (experimental)

Abstract:Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including in surgical videos. While conventional object-centric methods for videos leverage recurrent processing to achieve efficiency, they often struggle with maintaining long-range temporal coherence required for long videos in surgical applications. On the other hand, fully parallel processing of entire videos enhances temporal consistency but introduces significant computational overhead, making it impractical for implementation on hardware in medical facilities. We present Slot-BERT, a bidirectional long-range model that learns object-centric representations in a latent space while ensuring robust temporal coherence. Slot-BERT scales object discovery seamlessly to long videos of unconstrained lengths. A novel slot contrastive loss further reduces redundancy and improves the representation disentanglement by enhancing slot orthogonality. We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures. Our method surpasses state-of-the-art object-centric approaches under unsupervised training achieving superior performance across diverse domains. We also demonstrate efficient zero-shot domain adaptation to data from diverse surgical specialties and databases.

Subjects:	Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.12477 [eess.IV]
	(or arXiv:2501.12477v2 [eess.IV] for this version)
	https://doi.org/10.48550/arXiv.2501.12477

Submission history

From: Guiqiu Liao [view email]
[v1] Tue, 21 Jan 2025 19:59:22 UTC (4,328 KB)
[v2] Mon, 27 Jan 2025 19:53:35 UTC (4,328 KB)

Electrical Engineering and Systems Science > Image and Video Processing

Title:Slot-BERT: Self-supervised Object Discovery in Surgical Video

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Image and Video Processing

Title:Slot-BERT: Self-supervised Object Discovery in Surgical Video

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators