Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

Avila, Anderson R.; Rezagholizadeh, Mehdi; Xing, Chao

Computer Science > Computation and Language

arXiv:2306.06819 (cs)

[Submitted on 12 Jun 2023 (v1), last revised 13 Jun 2023 (this version, v2)]

Title:Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

Authors:Anderson R. Avila, Mehdi Rezagholizadeh, Chao Xing

View PDF

Abstract:Recent voice assistants are usually based on the cascade spoken language understanding (SLU) solution, which consists of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. Moreover, a multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors present in the ASR transcript. The MLU benefits from self-supervised features learned from both audio and text modalities, specifically Wav2Vec for speech and Bert/RoBERTa for language. Our MLU combines an encoder network to embed the audio signal and a text encoder to process text transcripts followed by a late fusion layer to fuse audio and text logits. We found that the proposed MLU showed to be robust towards poor quality ASR transcripts, while the performance of BERT and RoBERTa are severely compromised. Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2306.06819 [cs.CL]
	(or arXiv:2306.06819v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.06819

Submission history

From: Anderson Avila [view email]
[v1] Mon, 12 Jun 2023 01:55:53 UTC (232 KB)
[v2] Tue, 13 Jun 2023 15:41:11 UTC (216 KB)

Computer Science > Computation and Language

Title:Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators