Audio and Speech Processing
See recent articles
- [1] arXiv:2407.13782 [pdf, html, other]
-
Title: Self-supervised ASR Models and Features For Dysarthric and Elderly Speech RecognitionShujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying LiuComments: IEEE/ACM Transactions on Audio, Speech, and Language ProcessingSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR models. In addition, fine-tuned SSL speech features are used in acoustic-to-articulatory (A2A) inversion to construct multi-modal ASR systems. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models and their features consistently outperform the standalone fine-tuned SSL pre-trained models. These systems produced statistically significant WER or CER reductions of 6.53%, 1.90%, 2.04% and 7.97% absolute (24.10%, 23.84%, 10.14% and 31.39% relative) on the four tasks respectively. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
- [2] arXiv:2407.13840 [pdf, html, other]
-
Title: Semi-Supervised Contrastive Learning of Musical RepresentationsComments: Accepted to be published at the Proceedings of the 25th International Society for Music Information Retrieval Conference 2024, Includes non-proceedings appendixSubjects: Audio and Speech Processing (eess.AS)
Despite the success of contrastive learning in Music Information Retrieval, the inherent ambiguity of contrastive self-supervision presents a challenge. Relying solely on augmentation chains and self-supervised positive sampling strategies can lead to a pretraining objective that does not capture key musical information for downstream tasks. We introduce semi-supervised contrastive learning (SemiSupCon), a simple method for leveraging musically informed labeled data (supervision signals) in the contrastive learning of musical representations. Our approach introduces musically relevant supervision signals into self-supervised contrastive learning by combining supervised and self-supervised contrastive objectives in a simpler framework than previous approaches. This framework improves downstream performance and robustness to audio corruptions on a range of downstream MIR tasks with moderate amounts of labeled data. Our approach enables shaping the learned similarity metric through the choice of labeled data that (1) infuses the representations with musical domain knowledge and (2) improves out-of-domain performance with minimal general downstream performance loss. We show strong transfer learning performance on musically related yet not trivially similar tasks - such as pitch and key estimation. Additionally, our approach shows performance improvement on automatic tagging over self-supervised approaches with only 5\% of available labels included in pretraining.
- [3] arXiv:2407.13895 [pdf, html, other]
-
Title: Improving Robustness and Clinical Applicability of Respiratory Sound Classification via Audio EnhancementJing-Tong Tzeng, Jeng-Lin Li, Huan-Yu Chen, Chun-Hsiang Huang, Chi-Hsin Chen, Cheng-Yi Fan, Edward Pei-Chuan Huang, Chi-Chun LeeComments: The following article has been submitted to The Journal of the Acoustical Society of America (JASA). After it is published, it will be found at this https URLSubjects: Audio and Speech Processing (eess.AS)
Deep learning techniques have shown promising results in the automatic classification of respiratory sounds. However, accurately distinguishing these sounds in real-world noisy conditions poses challenges for clinical deployment. Additionally, predicting signals with only background noise could undermine user trust in the system. In this study, we propose an audio enhancement (AE) pipeline as a pre-processing step before respiratory sound classification, aiming to improve performance in noisy environments. Multiple experiments were conducted using different audio enhancement model structures, demonstrating improved classification performance compared to the baseline method of noise injection data augmentation. Specifically, the integration of the AE pipeline resulted in a 2.59% increase in the ICBHI classification score on the ICBHI respiratory sound dataset and a 2.51% improvement on our recently collected Formosa Archive of Breath Sounds (FABS) in multi-class noisy scenarios. Furthermore, a physician validation study assessed the clinical utility of our system. Quantitative analysis revealed enhancements in efficiency, diagnostic confidence, and trust during model-assisted diagnosis with our system compared to raw noisy recordings. Workflows integrating enhanced audio led to an 11.61% increase in diagnostic sensitivity and facilitated high-confidence diagnoses. Our findings demonstrate that incorporating an audio enhancement algorithm significantly enhances robustness and clinical utility.
- [4] arXiv:2407.14006 [pdf, html, other]
-
Title: MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech SynthesisQian Yang, Jialong Zuo, Zhe Su, Ziyue Jiang, Mingze Li, Zhou Zhao, Feiyang Chen, Zhefeng Wang, Baoxing HuaiComments: Accepted by INTERSPEECH 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for speech synthesis that entails multi-speaker style and prosody modeling. We have established a robust baseline, through the prompting mechanism, that can effectively synthesize speech characterized by both user-specific timbre and scene-specific prosody with arbitrary text input. The open source MSceneSpeech Dataset and audio samples of our baseline are available at this https URL.
- [5] arXiv:2407.14021 [pdf, html, other]
-
Title: GE2E-AC: Generalized End-to-End Loss Training for Accent ClassificationSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Machine Learning (stat.ML)
Accent classification or AC is a task to predict the accent type of an input utterance, and it can be used as a preliminary step toward accented speech recognition and accent conversion. Existing studies have often achieved such classification by training a neural network model to minimize the classification error of the predicted accent label, which can be obtained as a model output. Since we optimize the entire model only from the perspective of classification loss during training time in this approach, the model might learn to predict the accent type from irrelevant features, such as individual speaker identity, which are not informative during test time. To address this problem, we propose a GE2E-AC, in which we train a model to extract accent embedding or AE of an input utterance such that the AEs of the same accent class get closer, instead of directly minimizing the classification loss. We experimentally show the effectiveness of the proposed GE2E-AC, compared to the baseline model trained with the conventional cross-entropy-based loss.
- [6] arXiv:2407.14152 [pdf, html, other]
-
Title: Wideband Relative Transfer Function (RTF) Estimation Exploiting Frequency CorrelationsComments: Under review at IEEE/ACM Transactions on Audio, Speech, and Language ProcessingSubjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
This article focuses on estimating relative transfer functions (RTFs) for beamforming applications. While traditional methods assume that spectra are uncorrelated, this assumption is often violated in practical scenarios due to natural phenomena such as the Doppler effect, artificial manipulations like time-domain windowing, or the non-stationary nature of the signals, as observed in speech. To address this, we propose an RTF estimation technique that leverages spectral and spatial correlations through subspace analysis. To overcome the challenge of estimating second-order spectral statistics for real data, we employ a phase-adjusted estimator originally proposed in the context of engine fault detection. Additionally, we derive Cramér--Rao bounds (CRBs) for the RTF estimation task, providing theoretical insights into the achievable estimation accuracy. The bounds show that channel estimation can be performed more accurately if the noise or the target presents spectral correlations. Experiments on real and synthetic data show that our technique outperforms the narrowband maximum-likelihood estimator when the target exhibits spectral correlations. Although the accuracy of the proposed algorithm is generally close to the bound, there is some room for improvement, especially when noise signals with high spectral correlation are present. While the applications of channel estimation are diverse, we demonstrate the method in the context of array processing for speech.
- [7] arXiv:2407.14172 [pdf, other]
-
Title: Topology-Independent GEVD-Based Distributed Adaptive Node-Specific Signal Estimation in Ad-Hoc Wireless Acoustic Sensor NetworksComments: Presented in the 2024 32nd European Signal Processing Conference (EUSIPCO)Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
A low-rank approximation-based version of the topology-independent distributed adaptive node-specific signal estimation (TI-DANSE) algorithm is introduced, using a generalized eigenvalue decomposition (GEVD) for application in ad-hoc wireless acoustic sensor networks. This TI-GEVD-DANSE algorithm as well as the original TI-DANSE algorithm exhibit a non-strict convergence, which can lead to numerical instability over time, particularly in scenarios where the estimation of accurate spatial covariance matrices is challenging. An adaptive filter coefficient normalization strategy is proposed to mitigate this issue and enable the stable performance of TI-(GEVD-)DANSE. The method is validated in numerical simulations including dynamic acoustic scenarios, demonstrating the importance of the additional normalization.
- [8] arXiv:2407.14399 [pdf, html, other]
-
Title: PolySinger: Singing-Voice to Singing-Voice Translation from English to JapaneseComments: This paper was accepted at ISMIR 2024Subjects: Audio and Speech Processing (eess.AS); Information Retrieval (cs.IR)
The speech domain prevails in the spotlight for several natural language processing (NLP) tasks while the singing domain remains less explored. The culmination of NLP is the speech-to-speech translation (S2ST) task, referring to translation and synthesis of human speech. A disparity between S2ST and the possible adaptation to the singing domain, which we describe as singing-voice to singing-voice translation (SV2SVT), is becoming prominent as the former is progressing ever faster, while the latter is at a standstill. Singing-voice synthesis systems are overcoming the barrier of multi-lingual synthesis, despite limited attention has been paid to multi-lingual songwriting and song translation. This paper endeavors to determine what is required for successful SV2SVT and proposes PolySinger (\textbf{Poly}glot \textbf{Singer}): the first system for SV2SVT, performing lyrics translation from English to Japanese. A cascaded approach is proposed to establish a framework with a high degree of control which can potentially diminish the disparity between SV2SVT and S2ST. The performance of PolySinger is evaluated by a mean opinion score test with native Japanese speakers. Results and in-depth discussions with test subjects suggest a solid foundation for SV2SVT, but several shortcomings must be overcome, which are discussed for the future of SV2SVT.
New submissions for Monday, 22 July 2024 (showing 8 of 8 entries )
- [9] arXiv:2407.14056 (cross-list from cs.CL) [pdf, html, other]
-
Title: Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource SettingsComments: Accepted at INTERSPEECH 2024. First two authors listed contributed equallySubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We release Rasa, the first multilingual expressive TTS dataset for any Indian language, which contains 10 hours of neutral speech and 1-3 hours of expressive speech for each of the 6 Ekman emotions covering 3 languages: Assamese, Bengali, & Tamil. Our ablation studies reveal that just 1 hour of neutral and 30 minutes of expressive data can yield a Fair system as indicated by MUSHRA scores. Increasing neutral data to 10 hours, with minimal expressive data, significantly enhances expressiveness. This offers a practical recipe for resource-constrained languages, prioritizing easily obtainable neutral data alongside smaller amounts of expressive data. We show the importance of syllabically balanced data and pooling emotions to enhance expressiveness. We also highlight challenges in generating specific emotions, e.g., fear and surprise.
- [10] arXiv:2407.14180 (cross-list from cs.CL) [pdf, html, other]
-
Title: Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation AnalysisComments: Accepted to Interspeech 2024Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
This paper introduces a computational framework designed to delineate gender distribution biases in topics covered by French TV and radio news. We transcribe a dataset of 11.7k hours, broadcasted in 2023 on 21 French channels. A Large Language Model (LLM) is used in few-shot conversation mode to obtain a topic classification on those transcriptions. Using the generated LLM annotations, we explore the finetuning of a specialized smaller classification model, to reduce the computational cost. To evaluate the performances of these models, we construct and annotate a dataset of 804 dialogues. This dataset is made available free of charge for research purposes. We show that women are notably underrepresented in subjects such as sports, politics and conflicts. Conversely, on topics such as weather, commercials and health, women have more speaking time than their overall average across all subjects. We also observe representations differences between private and public service channels.
- [11] arXiv:2407.14212 (cross-list from cs.SD) [pdf, html, other]
-
Title: Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
An increasing number of Chinese people are troubled by different degrees of visual impairment, which has made the modal conversion between a single image or video frame in the visual field and the audio expressing the same information a research hotspot. Deep learning technologies such as OCR+Vocoder and Im2Wav enable English audio synthesis or image-to-sound matching in a self-supervised manner. However, the audio data used for training is limited and English is not universal for visually impaired people with different educational levels. Therefore, for the sake of solving the problems of data volume and language applicability to improve the reading efficiency of visually impaired people, a set of image-to-speech framework CLIP-KNN-Fastspeech2 based on the Chinese context was constructed. The framework integrates multiple basic models and adopts the strategy of independent pre-training and joint fine-tuning. First, the Chinese CLIP and Fastspeech2 text-to-speech models were pre-trained on two public datasets, MUGE and Baker, respectively, and their convergence was verified. Subsequently, joint fine-tuning was performed using a self-built Braille image dataset. Experimental results on multiple public datasets such as VGGSound, Flickr8k, ImageHear, and the self-built Braille dataset BIT-DP show that the model has improved objective indicators such as BLEU4,FAD(Fréchet Audio Distance), WER(Word Error Ratio), and even inference speed. This verifies that the constructed model still has the ability to synthesize high-quality speech under limited data, and also proves the effectiveness of the joint training strategy that integrates multiple basic models.
- [12] arXiv:2407.14295 (cross-list from cs.CL) [pdf, html, other]
-
Title: CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation UnitsComments: Accepted to ACL 2024 Student Research Workshop (ACL-SRW 2024)Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Multilingual code-switching research is often hindered by the lack and linguistically biased status of available datasets. To expand language representation, we synthesize code-switching data by replacing intonation units detected through PSST, a speech segmentation model fine-tuned from OpenAI's Whisper, using a speech-to-text translation dataset, CoVoST 2. With our dataset, CoVoSwitch, spanning 13 languages, we evaluate the code-switching translation performance of two multilingual translation models, M2M-100 418M and NLLB-200 600M. We reveal that the inclusion of code-switching units results in higher translation performance than monolingual settings and that models are better at code-switching translation into English than non-English. Further, low-resource languages gain most from integration of code-switched units when translating into English but much less when translating into non-English. Translations into low-resource languages also perform worse than even raw code-switched inputs. We find that systems excel at copying English tokens but struggle with non-English tokens, that the off-target problem in monolingual settings is also relevant in code-switching settings, and that models hallucinate in code-switching translation by introducing words absent in both of the original source sentences. CoVoSwitch and code are available at this https URL.
- [13] arXiv:2407.14329 (cross-list from cs.SD) [pdf, html, other]
-
Title: Efficient Audio Captioning with Encoder-Level Knowledge DistillationComments: Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Significant improvement has been achieved in automated audio captioning (AAC) with recent models. However, these models have become increasingly large as their performance is enhanced. In this work, we propose a knowledge distillation (KD) framework for AAC. Our analysis shows that in the encoder-decoder based AAC models, it is more effective to distill knowledge into the encoder as compared with the decoder. To this end, we incorporate encoder-level KD loss into training, in addition to the standard supervised loss and sequence-level KD loss. We investigate two encoder-level KD methods, based on mean squared error (MSE) loss and contrastive loss, respectively. Experimental results demonstrate that contrastive KD is more robust than MSE KD, exhibiting superior performance in data-scarce situations. By leveraging audio-only data into training in the KD framework, our student model achieves competitive performance, with an inference speed that is 19 times faster\footnote{An online demo is available at \url{this https URL}}.
- [14] arXiv:2407.14355 (cross-list from cs.SD) [pdf, html, other]
-
Title: Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language ModelsComments: Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Zero-shot audio classification aims to recognize and classify a sound class that the model has never seen during training. This paper presents a novel approach for zero-shot audio classification using automatically generated sound attribute descriptions. We propose a list of sound attributes and leverage large language model's domain knowledge to generate detailed attribute descriptions for each class. In contrast to previous works that primarily relied on class labels or simple descriptions, our method focuses on multi-dimensional innate auditory attributes, capturing different characteristics of sound classes. Additionally, we incorporate a contrastive learning approach to enhance zero-shot learning from textual labels. We validate the effectiveness of our method on VGGSound and AudioSet\footnote{The code is available at \url{this https URL}.}. Our results demonstrate a substantial improvement in zero-shot classification accuracy. Ablation results show robust performance enhancement, regardless of the model architecture.
- [15] arXiv:2407.14358 (cross-list from cs.SD) [pdf, html, other]
-
Title: Stable Audio OpenComments: Demo: this https URL Weights: this https URL Code: this https URL. arXiv admin note: text overlap with arXiv:2404.10301Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.
- [16] arXiv:2407.14364 (cross-list from cs.SD) [pdf, html, other]
-
Title: Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw AudioComments: Accepted at ISMIR 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Recent advancements in music generation are raising multiple concerns about the implications of AI in creative music processes, current business models and impacts related to intellectual property management. A relevant challenge is the potential replication and plagiarism of the training set in AI-generated music, which could lead to misuse of data and intellectual property rights violations. To tackle this issue, we present the Music Replication Assessment (MiRA) tool: a model-independent open evaluation method based on diverse audio music similarity metrics to assess data replication of the training set. We evaluate the ability of five metrics to identify exact replication, by conducting a controlled replication experiment in different music genres based on synthetic samples. Our results show that the proposed methodology can estimate exact data replication with a proportion higher than 10%. By introducing the MiRA tool, we intend to encourage the open evaluation of music generative models by researchers, developers and users concerning data replication, highlighting the importance of ethical, social, legal and economic consequences of generative AI in the music domain.
Cross submissions for Monday, 22 July 2024 (showing 8 of 8 entries )
- [17] arXiv:2303.17395 (replaced) [pdf, html, other]
-
Title: WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal ResearchXinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu WangComments: Accepted to TASLPSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing ChatGPT to enhance academic research. Our dataset and codes are available at this https URL.
- [18] arXiv:2401.11053 (replaced) [pdf, html, other]
-
Title: StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice ConversionComments: Accepted by ACL2024 (Main)Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Recent language model (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experiments demonstrate StreamVoice's streaming conversion capability while achieving zero-shot performance comparable to non-streaming VC systems.
- [19] arXiv:2403.03611 (replaced) [pdf, html, other]
-
Title: Comparison Performance of Spectrogram and Scalogram as Input of Acoustic Recognition TaskSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Acoustic recognition has emerged as a prominent task in deep learning research, frequently utilizing spectral feature extraction techniques such as the spectrogram from the Short-Time Fourier Transform and the scalogram from the Wavelet Transform. However, there is a notable deficiency in studies that comprehensively discuss the advantages, drawbacks, and performance comparisons of these methods. This paper aims to evaluate the characteristics of these two transforms as input data for acoustic recognition using Convolutional Neural Networks. The performance of the trained models employing both transforms is documented for comparison. Through this analysis, the paper elucidates the advantages and limitations of each method, provides insights into their respective application scenarios, and identifies potential directions for further research.
- [20] arXiv:2406.02560 (replaced) [pdf, html, other]
-
Title: Less Peaky and More Accurate CTC Forced Alignment by Label PriorsRuizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel Povey, Sanjeev KhudanpurComments: Accepted by ICASSP 2024. Github repo: this https URLSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.
- [21] arXiv:2407.04518 (replaced) [pdf, html, other]
-
Title: From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo PianoComments: Accepted by the 25th International Society for Music Information Retrieval (ISMIR)Subjects: Audio and Speech Processing (eess.AS)
Our study investigates an approach for understanding musical performances through the lens of audio encoding models, focusing on the domain of solo Western classical piano music. Compared to composition-level attribute understanding such as key or genre, we identify a knowledge gap in performance-level music understanding, and address three critical tasks: expertise ranking, difficulty estimation, and piano technique detection, introducing a comprehensive Pianism-Labelling Dataset (PLD) for this purpose. We leverage pre-trained audio encoders, specifically Jukebox, Audio-MAE, MERT, and DAC, demonstrating varied capabilities in tackling downstream tasks, to explore whether domain-specific fine-tuning enhances capability in capturing performance nuances. Our best approach achieved 93.6\% accuracy in expertise ranking, 33.7\% in difficulty estimation, and 46.7\% in technique detection, with Audio-MAE as the overall most effective encoder. Finally, we conducted a case study on Chopin Piano Competition data using trained models for expertise ranking, which highlights the challenge of accurately assessing top-tier performances.
- [22] arXiv:2407.06800 (replaced) [pdf, html, other]
-
Title: Learn and Don't Forget: Adding a New Language to ASR Foundation ModelsSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Foundation ASR models often support many languages, e.g. 100 languages in Whisper. However, there has been limited work on integrating an additional, typically low-resource, language, while maintaining performance on the original language set. Fine-tuning, while simple, may degrade the accuracy of the original set. We compare three approaches that exploit adaptation parameters: soft language code tuning, train only the language code; soft prompt tuning, train prepended tokens; and LoRA where a small set of additional parameters are optimised. Elastic Weight Consolidation (EWC) offers an alternative compromise with the potential to maintain performance in specific target languages. Results show that direct fine-tuning yields the best performance for the new language but degrades existing language capabilities. EWC can address this issue for specific languages. If only adaptation parameters are used, the language capabilities are maintained but at the cost of performance in the new language.
- [23] arXiv:2307.02146 (replaced) [pdf, html, other]
-
Title: LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric GenerationComments: An extension of our previous work arXiv:2305.16816 [cs.CL]Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Despite previous efforts in melody-to-lyric generation research, there is still a significant compatibility gap between generated lyrics and melodies, negatively impacting the singability of the outputs. This paper bridges the singability gap with a novel approach to generating singable lyrics by jointly Learning wOrding And Formatting during Melody-to-Lyric training. After general-domain pretraining, our proposed model acquires length awareness first from a large text-only lyric corpus. Then, we introduce a new objective informed by musicological research on the relationship between melody and lyrics during melody-to-lyric training, which enables the model to learn the fine-grained format requirements of the melody. Our model achieves 3.75% and 21.44% absolute accuracy gains in the outputs' number-of-line and syllable-per-line requirements compared to naive fine-tuning, without sacrificing text fluency. Furthermore, our model demonstrates a 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation, compared to the state-of-the-art melody-to-lyric generation model, highlighting the significance of formatting learning.
- [24] arXiv:2309.07566 (replaced) [pdf, html, other]
-
Title: Speech-to-Speech Translation with Discrete-Unit-Based Style TransferComments: accepted by ACL SRW 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and speaker similarity. Audio samples are available at this http URL .
- [25] arXiv:2311.05550 (replaced) [pdf, html, other]
-
Title: Towards End-to-End Spoken Grammatical Error CorrectionSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Grammatical feedback is crucial for L2 learners, teachers, and testers. Spoken grammatical error correction (GEC) aims to supply feedback to L2 learners on their use of grammar when speaking. This process usually relies on a cascaded pipeline comprising an ASR system, disfluency removal, and GEC, with the associated concern of propagating errors between these individual modules. In this paper, we introduce an alternative "end-to-end" approach to spoken GEC, exploiting a speech recognition foundation model, Whisper. This foundation model can be used to replace the whole framework or part of it, e.g., ASR and disfluency removal. These end-to-end approaches are compared to more standard cascaded approaches on the data obtained from a free-speaking spoken language assessment test, Linguaskill. Results demonstrate that end-to-end spoken GEC is possible within this architecture, but the lack of available data limits current performance compared to a system using large quantities of text-based GEC data. Conversely, end-to-end disfluency detection and removal, which is easier for the attention-based Whisper to learn, does outperform cascaded approaches. Additionally, the paper discusses the challenges of providing feedback to candidates when using end-to-end systems for spoken GEC.
- [26] arXiv:2406.05806 (replaced) [pdf, html, other]
-
Title: Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of WhisperComments: In progressSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper. We compare its performances when prompted by prompts with correct information and those corrupted with incorrect information. Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages despite the mismatch with pre-training scenarios. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by ignoring incorrect language tokens and focusing on the correct ones. In sum, We raise insightful questions about Whisper's prompt understanding and reveal its counter-intuitive behaviors. We encourage further studies.
- [27] arXiv:2406.14485 (replaced) [pdf, other]
-
Title: Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)Nick Bryan-Kinns, Corey Ford, Shuoyang Zheng, Helen Kennedy, Alan Chamberlain, Makayla Lewis, Drew Hemment, Zijin Li, Qiong Wu, Lanxi Xiao, Gus Xia, Jeba Rezwana, Michael Clemens, Gabriel VigliensoniComments: Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.