Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Zezario, Ryandhimas E.; Fu, Szu-Wei; Chen, Fei; Fuh, Chiou-Shann; Wang, Hsin-Min; Tsao, Yu

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2111.02363 (eess)

[Submitted on 3 Nov 2021 (v1), last revised 19 Dec 2024 (this version, v5)]

Title:Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Authors:Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

View PDF HTML (experimental)

Abstract:In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in short-time objective intelligibility (STOI) prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in mean opinion score (MOS) prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model.

Comments:	Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 31, pp. 54-70, 2023
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2111.02363 [eess.AS]
	(or arXiv:2111.02363v5 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2111.02363

Submission history

From: Ryandhimas Zezario [view email]
[v1] Wed, 3 Nov 2021 17:30:43 UTC (2,241 KB)
[v2] Wed, 1 Dec 2021 05:05:27 UTC (2,243 KB)
[v3] Thu, 16 Jun 2022 05:55:34 UTC (2,343 KB)
[v4] Thu, 23 Jun 2022 08:59:36 UTC (2,343 KB)
[v5] Thu, 19 Dec 2024 09:18:11 UTC (4,012 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators