Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

Wagner, Dominik; Churchill, Alexander; Sigtia, Siddharth; Georgiou, Panayiotis; Mirsamadi, Matt; Mishra, Aarshee; Marchi, Erik

Computer Science > Sound

arXiv:2312.03632 (cs)

[Submitted on 6 Dec 2023]

Title:Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

Authors:Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

View PDF HTML (experimental)

Abstract:Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.

Subjects:	Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2312.03632 [cs.SD]
	(or arXiv:2312.03632v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2312.03632

Submission history

From: Siddharth Sigtia [view email]
[v1] Wed, 6 Dec 2023 17:29:03 UTC (988 KB)

Computer Science > Sound

Title:Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators