Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis

Demir, Kubilay Can; Rodriguez, Belen Lojo; Weise, Tobias; Maier, Andreas; Yang, Seung Hee

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2406.14576 (eess)

[Submitted on 17 Jun 2024]

Title:Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis

Authors:Kubilay Can Demir, Belen Lojo Rodriguez, Tobias Weise, Andreas Maier, Seung Hee Yang

View PDF HTML (experimental)

Abstract:To develop intelligent speech assistants and integrate them seamlessly with intra-operative decision-support frameworks, accurate and efficient surgical phase recognition is a prerequisite. In this study, we propose a multimodal framework based on Gated Multimodal Units (GMU) and Multi-Stage Temporal Convolutional Networks (MS-TCN) to recognize surgical phases of port-catheter placement operations. Our method merges speech and image models and uses them separately in different surgical phases. Based on the evaluation of 28 operations, we report a frame-wise accuracy of 92.65 $\pm$ 3.52% and an F1-score of 92.30 $\pm$ 3.82%. Our results show approximately 10% improvement in both metrics over previous work and validate the effectiveness of integrating multimodal data for the surgical phase recognition task. We further investigate the contribution of individual data channels by comparing mono-modal models with multimodal models.

Comments:	5 Pages, Interspeech 2024
Subjects:	Audio and Speech Processing (eess.AS)
MSC classes:	00b20
Cite as:	arXiv:2406.14576 [eess.AS]
	(or arXiv:2406.14576v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2406.14576

Submission history

From: Kubilay Can Demir [view email]
[v1] Mon, 17 Jun 2024 12:47:04 UTC (9,489 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators