HeySQuAD: A Spoken Question Answering Dataset

Wu, Yijing; Rallabandi, SaiKrishna; Srinivasamurthy, Ravisutha; Dakle, Parag Pravin; Gon, Alolika; Raghavan, Preethi

Computer Science > Computation and Language

arXiv:2304.13689v1 (cs)

[Submitted on 26 Apr 2023 (this version), latest version 27 Feb 2024 (v2)]

Title:HeySQuAD: A Spoken Question Answering Dataset

Authors:Yijing Wu, SaiKrishna Rallabandi, Ravisutha Srinivasamurthy, Parag Pravin Dakle, Alolika Gon, Preethi Raghavan

View PDF

Abstract:Human-spoken questions are critical to evaluating the performance of spoken question answering (SQA) systems that serve several real-world use cases including digital assistants. We present a new large-scale community-shared SQA dataset, HeySQuAD that consists of 76k human-spoken questions and 97k machine-generated questions and corresponding textual answers derived from the SQuAD QA dataset. The goal of HeySQuAD is to measure the ability of machines to understand noisy spoken questions and answer the questions accurately. To this end, we run extensive benchmarks on the human-spoken and machine-generated questions to quantify the differences in noise from both sources and its subsequent impact on the model and answering accuracy. Importantly, for the task of SQA, where we want to answer human-spoken questions, we observe that training using the transcribed human-spoken and original SQuAD questions leads to significant improvements (12.51%) over training using only the original SQuAD textual questions.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2304.13689 [cs.CL]
	(or arXiv:2304.13689v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2304.13689

Submission history

From: Parag Dakle [view email]
[v1] Wed, 26 Apr 2023 17:15:39 UTC (51 KB)
[v2] Tue, 27 Feb 2024 13:57:08 UTC (236 KB)

Computer Science > Computation and Language

Title:HeySQuAD: A Spoken Question Answering Dataset

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:HeySQuAD: A Spoken Question Answering Dataset

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators