Multimodal Machine Learning Can Predict Videoconference Fluidity and Enjoyment

Chang, Andrew; Akkaraju, Viswadruth; Cogliano, Ray McFadden; Poeppel, David; Freeman, Dustin

Computer Science > Machine Learning

arXiv:2501.03190 (cs)

[Submitted on 6 Jan 2025 (v1), last revised 7 Jan 2025 (this version, v2)]

Title:Multimodal Machine Learning Can Predict Videoconference Fluidity and Enjoyment

Authors:Andrew Chang, Viswadruth Akkaraju, Ray McFadden Cogliano, David Poeppel, Dustin Freeman

View PDF HTML (experimental)

Abstract:Videoconferencing is now a frequent mode of communication in both professional and informal settings, yet it often lacks the fluidity and enjoyment of in-person conversation. This study leverages multimodal machine learning to predict moments of negative experience in videoconferencing. We sampled thousands of short clips from the RoomReader corpus, extracting audio embeddings, facial actions, and body motion features to train models for identifying low conversational fluidity, low enjoyment, and classifying conversational events (backchanneling, interruption, or gap). Our best models achieved an ROC-AUC of up to 0.87 on hold-out videoconference sessions, with domain-general audio features proving most critical. This work demonstrates that multimodal audio-video signals can effectively predict high-level subjective conversational outcomes. In addition, this is a contribution to research on videoconferencing user experience by showing that multimodal machine learning can be used to identify rare moments of negative user experience for further study or mitigation.

Comments:	ICASSP 2025
Subjects:	Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Cite as:	arXiv:2501.03190 [cs.LG]
	(or arXiv:2501.03190v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2501.03190

Submission history

From: Andrew Chang [view email]
[v1] Mon, 6 Jan 2025 18:05:35 UTC (3,721 KB)
[v2] Tue, 7 Jan 2025 18:34:22 UTC (3,717 KB)

Computer Science > Machine Learning

Title:Multimodal Machine Learning Can Predict Videoconference Fluidity and Enjoyment

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Multimodal Machine Learning Can Predict Videoconference Fluidity and Enjoyment

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators