Sound
See recent articles
- [1] arXiv:2406.15647 [pdf, html, other]
-
Title: Generating Music with Structure Using Self-Similarity as AttentionSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Despite the innovations in deep learning and generative AI, creating long term structure as well as the layers of repeated structure common in musical works remains an open challenge in music generation. We propose an attention layer that uses a novel approach applying user-supplied self-similarity matrices to previous time steps, and demonstrate it in our Similarity Incentivized Neural Generator (SING) system, a deep learning autonomous music generation system with two layers. The first is a vanilla Long Short Term Memory layer, and the second is the proposed attention layer. During generation, this attention mechanism imposes a suggested structure from a template piece on the generated music. We train SING on the MAESTRO dataset using a novel variable batching method, and compare its performance to the same model without the attention mechanism. The addition of our proposed attention mechanism significantly improves the network's ability to replicate specific structures, and it performs better on an unseen test set than a model without the attention mechanism.
- [2] arXiv:2406.15751 [pdf, html, other]
-
Title: Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean DataYu-Hua Chen, Woosung Choi, Wei-Hsiang Liao, Marco Martínez-Ramírez, Kin Wai Cheuk, Yuki Mitsufuji, Jyh-Shing Roger Jang, Yi-Hsuan YangComments: Accepted to DAFx 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recent years have seen increasing interest in applying deep learning methods to the modeling of guitar amplifiers or effect pedals. Existing methods are mainly based on the supervised approach, requiring temporally-aligned data pairs of unprocessed and rendered audio. However, this approach does not scale well, due to the complicated process involved in creating the data pairs. A very recent work done by Wright et al. has explored the potential of leveraging unpaired data for training, using a generative adversarial network (GAN)-based framework. This paper extends their work by using more advanced discriminators in the GAN, and using more unpaired data for training. Specifically, drawing inspiration from recent advancements in neural vocoders, we employ in our GAN-based model for guitar amplifier modeling two sets of discriminators, one based on multi-scale discriminator (MSD) and the other multi-period discriminator (MPD). Moreover, we experiment with adding unprocessed audio signals that do not have the corresponding rendered audio of a target tone to the training data, to see how much the GAN model benefits from the unpaired data. Our experiments show that the proposed two extensions contribute to the modeling of both low-gain and high-gain guitar amplifiers.
- [3] arXiv:2406.15875 [pdf, html, other]
-
Title: AI-based Drone Assisted Human Rescue in Disaster Environments: Challenges and OpportunitiesJournal-ref: Pattern Recognit. Image Anal. 34 (2024)Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
In this survey we are focusing on utilizing drone-based systems for the detection of individuals, particularly by identifying human screams and other distress signals. This study has significant relevance in post-disaster scenarios, including events such as earthquakes, hurricanes, military conflicts, wildfires, and more. These drones are capable of hovering over disaster-stricken areas that may be challenging for rescue teams to access directly. Unmanned aerial vehicles (UAVs), commonly referred to as drones, are frequently deployed for search-and-rescue missions during disaster situations. Typically, drones capture aerial images to assess structural damage and identify the extent of the disaster. They also employ thermal imaging technology to detect body heat signatures, which can help locate individuals. In some cases, larger drones are used to deliver essential supplies to people stranded in isolated disaster-stricken areas. In our discussions, we delve into the unique challenges associated with locating humans through aerial acoustics. The auditory system must distinguish between human cries and sounds that occur naturally, such as animal calls and wind. Additionally, it should be capable of recognizing distinct patterns related to signals like shouting, clapping, or other ways in which people attempt to signal rescue teams. To tackle this challenge, one solution involves harnessing artificial intelligence (AI) to analyze sound frequencies and identify common audio signatures. Deep learning-based networks, such as convolutional neural networks (CNNs), can be trained using these signatures to filter out noise generated by drone motors and other environmental factors. Furthermore, employing signal processing techniques like the direction of arrival (DOA) based on microphone array signals can enhance the precision of tracking the source of human noises.
- [4] arXiv:2406.15885 [pdf, html, other]
-
Title: The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language ModelsComments: Accepted to ACL-Findings 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-related capabilities of LLMs. ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries. By leveraging ZIQI-Eval, we conduct a comprehensive evaluation over 16 LLMs to evaluate and analyze LLMs' performance in the domain of music. Results indicate that all LLMs perform poorly on the ZIQI-Eval benchmark, suggesting significant room for improvement in their musical capabilities. With ZIQI-Eval, we aim to provide a standardized and robust evaluation framework that facilitates a comprehensive assessment of LLMs' music-related abilities. The dataset is available at GitHub\footnote{this https URL} and HuggingFace\footnote{this https URL}.
- [5] arXiv:2406.16000 [pdf, html, other]
-
Title: Predicting Individual Depression Symptoms from Acoustic Features During SpeechSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Current automatic depression detection systems provide predictions directly without relying on the individual symptoms/items of depression as denoted in the clinical depression rating scales. In contrast, clinicians assess each item in the depression rating scale in a clinical setting, thus implicitly providing a more detailed rationale for a depression diagnosis. In this work, we make a first step towards using the acoustic features of speech to predict individual items of the depression rating scale before obtaining the final depression prediction. For this, we use convolutional (CNN) and recurrent (long short-term memory (LSTM)) neural networks. We consider different approaches to learning the temporal context of speech. Further, we analyze two variants of voting schemes for individual item prediction and depression detection. We also include an animated visualization that shows an example of item prediction over time as the speech progresses.
- [6] arXiv:2406.16020 [pdf, html, other]
-
Title: AudioBench: A Universal Benchmark for Audio Large Language ModelsBin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. ChenComments: 20 pages; Preprint; Code: this https URLSubjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
We introduce AudioBench, a new benchmark designed to evaluate audio large language models (AudioLLMs). AudioBench encompasses 8 distinct tasks and 26 carefully selected or newly curated datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. Despite the rapid advancement of large language models, including multimodal versions, a significant gap exists in comprehensive benchmarks for thoroughly evaluating their capabilities. AudioBench addresses this gap by providing relevant datasets and evaluation metrics. In our study, we evaluated the capabilities of four models across various aspects and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-source code, data, and leaderboard will offer a robust testbed for future model developments.
- [7] arXiv:2406.16099 [pdf, html, other]
-
Title: Speech Representation Analysis based on Inter- and Intra-Model SimilaritiesComments: 5 pages, Accepted to appear in ICASSP XAI-SA WorkshopSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation and task-specific constraint. We examine different SSL models varying their training paradigm -- Contrastive (Wav2Vec2.0) and Predictive models (HuBERT); and model sizes (base and large). We explore these models on different levels of localization/distributivity of information including (i) individual neurons; (ii) layer representation; (iii) attention weights and (iv) compare the representations with their finetuned counterparts.Our results highlight that these models converge to similar representation subspaces but not to similar neuron-localized concepts\footnote{A concept represents a coherent fragment of knowledge, such as ``a class containing certain objects as elements, where the objects have certain properties. We made the code publicly available for facilitating further research, we publicly released our code.
- [8] arXiv:2406.16148 [pdf, html, other]
-
Title: Towards Open Respiratory Acoustic Foundation Models: Pretraining and BenchmarkingYuwei Zhang, Tong Xia, Jing Han, Yu Wu, Georgios Rizos, Yang Liu, Mohammed Mosuily, Jagmohan Chauhan, Cecilia MascoloSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing advantages and possibly unlock this impasse. However, given the safety-critical nature of healthcare applications, it is pivotal to also ensure openness and replicability for any proposed foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory Acoustic foundation model pretraining and benchmarking system, as the first approach answering this need. We curate large-scale respiratory audio datasets (~136K samples, 440 hours), pretrain three pioneering foundation models, and build a benchmark consisting of 19 downstream respiratory health tasks for evaluation. Our pretrained models demonstrate superior performance (against existing acoustic models pretrained with general audio on 16 out of 19 tasks) and generalizability (to unseen datasets and new respiratory audio modalities). This highlights the great promise of respiratory acoustic foundation models and encourages more studies using OPERA as an open resource to accelerate research on respiratory audio for health. The system is accessible from this https URL.
- [9] arXiv:2406.16155 [pdf, html, other]
-
Title: Listen and Move: Improving GANs Coherency in Agnostic Sound-to-Video GenerationComments: Full paper of the homonym paper published in the ICCV 2023 workshop "AV4D: Visual Learning of Sounds in Spaces"Journal-ref: Abstract version published in the ICCV 2023 workshop "AV4D: Visual Learning of Sounds in Spaces"Subjects: Sound (cs.SD); Graphics (cs.GR); Audio and Speech Processing (eess.AS)
Deep generative models have demonstrated the ability to create realistic audiovisual content, sometimes driven by domains of different nature. However, smooth temporal dynamics in video generation is a challenging problem. This work focuses on generic sound-to-video generation and proposes three main features to enhance both image quality and temporal coherency in generative adversarial models: a triple sound routing scheme, a multi-scale residual and dilated recurrent network for extended sound analysis, and a novel recurrent and directional convolutional layer for video prediction. Each of the proposed features improves, in both quality and coherency, the baseline neural architecture typically used in the SoTA, with the video prediction layer providing an extra temporal refinement.
- [10] arXiv:2406.16317 [pdf, other]
-
Title: SNR-Progressive Model with Harmonic Compensation for Low-SNR Speech EnhancementSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Despite significant progress made in the last decade, deep neural network (DNN) based speech enhancement (SE) still faces the challenge of notable degradation in the quality of recovered speech under low signal-to-noise ratio (SNR) conditions. In this letter, we propose an SNR-progressive speech enhancement model with harmonic compensation for low-SNR SE. Reliable pitch estimation is obtained from the intermediate output, which has the benefit of retaining more speech components than the coarse estimate while possessing a significant higher SNR than the input noisy speech. An effective harmonic compensation mechanism is introduced for better harmonic recovery. Extensive ex-periments demonstrate the advantage of our proposed model. A multi-modal speech extraction system based on the proposed backbone model ranks first in the ICASSP 2024 MISP Challenge: this https URL.
- [11] arXiv:2406.16808 [pdf, html, other]
-
Title: Exploring the Capability of Mamba in Speech ApplicationsComments: Accepted at Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
This paper explores the capability of Mamba, a recently proposed architecture based on state space models (SSMs), as a competitive alternative to Transformer-based models. In the speech domain, well-designed Transformer-based models, such as the Conformer and E-Branchformer, have become the de facto standards. Extensive evaluations have demonstrated the effectiveness of these Transformer-based models across a wide range of speech tasks. In contrast, the evaluation of SSMs has been limited to a few tasks, such as automatic speech recognition (ASR) and speech synthesis. In this paper, we compared Mamba with state-of-the-art Transformer variants for various speech applications, including ASR, text-to-speech, spoken language understanding, and speech summarization. Experimental evaluations revealed that Mamba achieves comparable or better performance than Transformer-based models, and demonstrated its efficiency in long-form speech processing.
New submissions for Tuesday, 25 June 2024 (showing 11 of 11 entries )
- [12] arXiv:2406.15487 (cross-list from cs.CL) [pdf, other]
-
Title: Improving Text-To-Audio Models with Synthetic CaptionsZhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan CatanzaroSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.
- [13] arXiv:2406.15516 (cross-list from cs.CL) [pdf, html, other]
-
Title: System Description for the Displace Speaker Diarization Challenge 2023Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This paper describes our solution for the Diarization of Speaker and Language in Conversational Environments Challenge (Displace 2023). We used a combination of VAD for finding segfments with speech, Resnet architecture based CNN for feature extraction from these segments, and spectral clustering for features clustering. Even though it was not trained with using Hindi, the described algorithm achieves the following metrics: DER 27. 1% and DER 27. 4%, on the development and phase-1 evaluation parts of the dataset, respectively.
- [14] arXiv:2406.15537 (cross-list from q-bio.NC) [pdf, html, other]
-
Title: R&B -- Rhythm and Brain: Cross-subject Decoding of Music from Human Brain ActivityComments: The first two authors contributed equally to this workSubjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Music is a universal phenomenon that profoundly influences human experiences across cultures. This study investigates whether music can be decoded from human brain activity measured with functional MRI (fMRI) during its perception. Leveraging recent advancements in extensive datasets and pre-trained computational models, we construct mappings between neural data and latent representations of musical stimuli. Our approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data. Starting from the GTZan fMRI dataset, where five participants listened to 540 musical stimuli from 10 different genres while their brain activity was recorded, we used the CLAP (Contrastive Language-Audio Pretraining) model to extract latent representations of the musical stimuli and developed voxel-wise encoding models to identify brain regions responsive to these stimuli. By applying a threshold to the association between predicted and actual brain activity, we identified specific regions of interest (ROIs) which can be interpreted as key players in music processing. Our decoding pipeline, primarily retrieval-based, employs a linear map to project brain activity to the corresponding CLAP features. This enables us to predict and retrieve the musical stimuli most similar to those that originated the fMRI data. Our results demonstrate state-of-the-art identification accuracy, with our methods significantly outperforming existing approaches. Our findings suggest that neural-based music retrieval systems could enable personalized recommendations and therapeutic applications. Future work could use higher temporal resolution neuroimaging and generative models to improve decoding accuracy and explore the neural underpinnings of music perception and emotion.
- [15] arXiv:2406.15668 (cross-list from cs.CL) [pdf, html, other]
-
Title: PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and Evolving Speaker CharacteristicsComments: 11 pages, 3 figuresSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
As edge-based automatic speech recognition (ASR) technologies become increasingly prevalent for the development of intelligent and personalized assistants, three important challenges must be addressed for these resource-constrained ASR models, i.e., adaptivity, incrementality, and inclusivity. We propose a novel ASR framework, PI-Whisper, in this work and show how it can improve an ASR's recognition capabilities adaptively by identifying different speakers' characteristics in real-time, how such an adaption can be performed incrementally without repetitive retraining, and how it can improve the equity and fairness for diverse speaker groups. More impressively, our proposed PI-Whisper framework attains all of these nice properties while still achieving state-of-the-art accuracy with up to 13.7% reduction of the word error rate (WER) with linear scalability with respect to computing resources.
- [16] arXiv:2406.15723 (cross-list from cs.CL) [pdf, html, other]
-
Title: Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation AssessmentComments: Interspeech 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In automated pronunciation assessment, recent emphasis progressively lies on evaluating multiple aspects to provide enriched feedback. However, acquiring multi-aspect-score labeled data for non-native language learners' speech poses challenges; moreover, it often leads to score-imbalanced distributions. In this paper, we propose two Acoustic Feature Mixup strategies, linearly and non-linearly interpolating with the in-batch averaged feature, to address data scarcity and score-label imbalances. Primarily using goodness-of-pronunciation as an acoustic feature, we tailor mixup designs to suit pronunciation assessment. Further, we integrate fine-grained error-rate features by comparing speech recognition results with the original answer phonemes, giving direct hints for mispronunciation. Effective mixing of the acoustic features notably enhances overall scoring performances on the speechocean762 dataset, and detailed analysis highlights our potential to predict unseen distortions.
- [17] arXiv:2406.15725 (cross-list from eess.AS) [pdf, html, other]
-
Title: Self Training and Ensembling Frequency Dependent Networks with Coarse Prediction Pooling and Sound Event Bounding BoxesComments: DCASE 2024 Challenge Task 4 technical reportSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
To tackle sound event detection (SED) task, we propose frequency dependent networks (FreDNets), which heavily leverage frequency-dependent methods. We apply frequency warping and FilterAugment, which are frequency-dependent data augmentation methods. The model architecture consists of 3 branches: audio teacher-student transformer (ATST) branch, BEATs branch and CNN branch including either partial dilated frequency dynamic convolution (PDFD) or squeeze-and-Excitation (SE) with time-frame frequency-wise SE (tfwSE). To train MAESTRO labels with coarse temporal resolution, we apply max pooling on prediction for the MAESTRO dataset. Using best ensemble model, we apply self training to obtain pseudo label from DESED weak set, DESED unlabeled set and AudioSet. AudioSet labels are filtered to focus on high-confidence pseudo labels and AudioSet pseudo labels are used to train on DESED labels only. We used change-detection-based sound event bounding boxes (cSEBBs) as post processing for ensemble models on self training and submission models.
- [18] arXiv:2406.15754 (cross-list from cs.CV) [pdf, html, other]
-
Title: Multimodal Segmentation for Vocal Tract ModelingComments: Interspeech 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{this http URL}.
- [19] arXiv:2406.15888 (cross-list from cs.CL) [pdf, html, other]
-
Title: Real-time Speech Summarization for Medical ConversationsComments: Interspeech 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In doctor-patient conversations, identifying medically relevant information is crucial, posing the need for conversation summarization. In this work, we propose the first deployable real-time speech summarization system for real-world applications in industry, which generates a local summary after every N speech utterances within a conversation and a global summary after the end of a conversation. Our system could enhance user experience from a business standpoint, while also reducing computational costs from a technical perspective. Secondly, we present VietMed-Sum which, to our knowledge, is the first speech summarization dataset for medical conversations. Thirdly, we are the first to utilize LLM and human annotators collaboratively to create gold standard and synthetic summaries for medical conversation summarization. Finally, we present baseline results of state-of-the-art models on VietMed-Sum. All code, data (English-translated and Vietnamese) and models are available online: this https URL
- [20] arXiv:2406.15897 (cross-list from eess.AS) [pdf, html, other]
-
Title: Fusing Audio and Metadata Embeddings Improves Language-based Audio RetrievalComments: EUSIPCO 2024Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Matching raw audio signals with textual descriptions requires understanding the audio's content and the description's semantics and then drawing connections between the two modalities. This paper investigates a hybrid retrieval system that utilizes audio metadata as an additional clue to understand the content of audio signals before matching them with textual queries. We experimented with metadata often attached to audio recordings, such as keywords and natural-language descriptions, and we investigated late and mid-level fusion strategies to merge audio and metadata. Our hybrid approach with keyword metadata and late fusion improved the retrieval performance over a content-based baseline by 2.36 and 3.69 pp. mAP@10 on the ClothoV2 and AudioCaps benchmarks, respectively.
- [21] arXiv:2406.16120 (cross-list from eess.AS) [pdf, html, other]
-
Title: Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing LossComments: Accepted to INTERSPEECH 2024Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives. Our proposed intermediate biasing loss brings more regularization and contextualization to the network. Our method outperforms a conventional contextual biasing baseline on the LibriSpeech corpus, achieving a relative improvement of 22.5% in biased word error rate (B-WER) and up to 44% compared to the non-contextual baseline with a biasing list size of 100. Moreover, employing RNN-transducer-driven joint decoding further reduces the unbiased word error rate (U-WER), resulting in a more robust network.
- [22] arXiv:2406.16315 (cross-list from eess.AS) [pdf, html, other]
-
Title: Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis FrameworkComments: INTERSPEECH 2024 acceptedSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
We propose a data cleansing method that utilizes a neural analysis and synthesis (NANSY++) framework to train an end-to-end neural diarization model (EEND) for singer diarization. Our proposed model converts song data with choral singing which is commonly contained in popular music and unsuitable for generating a simulated dataset to the solo singing data. This cleansing is based on NANSY++, which is a framework trained to reconstruct an input non-overlapped audio signal. We exploit the pre-trained NANSY++ to convert choral singing into clean, non-overlapped audio. This cleansing process mitigates the mislabeling of choral singing to solo singing and helps the effective training of EEND models even when the majority of available song data contains choral singing sections. We experimentally evaluated the EEND model trained with a dataset using our proposed method using annotated popular duet songs. As a result, our proposed method improved 14.8 points in diarization error rate.
- [23] arXiv:2406.16716 (cross-list from eess.AS) [pdf, html, other]
-
Title: One-Class Learning with Adaptive Centroid Shift for Audio Deepfake DetectionComments: Accepted by Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Cryptography and Security (cs.CR); Sound (cs.SD)
As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafide samples to define their centroid, which can yield a specialized centroid for one-class learning. Integrating our ACS with one-class learning gathers bonafide representations into a single cluster, forming well-separated embeddings robust to unseen spoofing attacks. Our proposed method achieves an equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Furthermore, the t-SNE visualization illustrates that our method effectively maps the bonafide embeddings into a single cluster and successfully disentangles the bonafide and spoof classes.
- [24] arXiv:2406.16751 (cross-list from cs.CL) [pdf, html, other]
-
Title: Towards Zero-Shot Text-To-Speech for Arabic DialectsSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS\footnote{this https URL}\footnote{this https URL}\footnote{this https URL} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.
Cross submissions for Tuesday, 25 June 2024 (showing 13 of 13 entries )
- [25] arXiv:2308.01187 (replaced) [pdf, html, other]
-
Title: Music De-limiter Networks via Sample-wise Gain InversionComments: Results corrected as some bugs were found in the previous codes and dataset. Presented at WASPAA 2023Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
The loudness war, an ongoing phenomenon in the music industry characterized by the increasing final loudness of music while reducing its dynamic range, has been a controversial topic for decades. Music mastering engineers have used limiters to heavily compress and make music louder, which can induce ear fatigue and hearing loss in listeners. In this paper, we introduce music de-limiter networks that estimate uncompressed music from heavily compressed signals. Inspired by the principle of a limiter, which performs sample-wise gain reduction of a given signal, we propose the framework of sample-wise gain inversion (SGI). We also present the musdb-XL-train dataset, consisting of 300k segments created by applying a commercial limiter plug-in for training real-world friendly de-limiter networks. Our proposed de-limiter network achieves excellent performance with a scale-invariant source-to-distortion ratio (SI-SDR) of 24.0 dB in reconstructing musdb-HQ from musdb-XL data, a limiter-applied version of musdb-HQ. The training data, codes, and model weights are available in our repository (this https URL).
- [26] arXiv:2309.10740 (replaced) [pdf, html, other]
-
Title: ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency DistillationSubjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity.
- [27] arXiv:2309.13920 (replaced) [pdf, other]
-
Title: Real-Time Emergency Vehicle Detection using Mel Spectrograms and Regular ExpressionsComments: in Spanish languageJournal-ref: Revista Electro, Vol. 45, pp. 184-189, 2023Subjects: Sound (cs.SD); Formal Languages and Automata Theory (cs.FL); Symbolic Computation (cs.SC); Audio and Speech Processing (eess.AS)
In emergency situations, the high-speed movement of an ambulance through the city streets can be hindered by vehicular traffic. This work presents a method for detecting emergency vehicle sirens in real time. To obtain the audio fingerprint of a Hi-Lo siren, DSP and signal symbolization techniques were applied, which were contrasted against an audio classifier based on a deep neural network, using the same 280 audios of ambient sounds and 52 Hi-Lo siren audios dataset. In both methods, some classification accuracy metrics were evaluated based on its confusion matrix, resulting in the DSP algorithm having a slightly lower accuracy than the DNN model, however, it offers a self-explanatory, adjustable, portable, high performance and lower energy and consumption that makes it a more viable lower cost ADAS implementation to identify Hi-Lo sirens in real time.
- [28] arXiv:2402.04356 (replaced) [pdf, html, other]
-
Title: Bidirectional Autoregressive Diffusion Model for Dance GenerationSubjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Dance serves as a powerful medium for expressing human emotions, but the lifelike generation of dance is still a considerable challenge. Recently, diffusion models have showcased remarkable generative abilities across various domains. They hold promise for human motion generation due to their adaptable many-to-many nature. Nonetheless, current diffusion-based motion generation models often create entire motion sequences directly and unidirectionally, lacking focus on the motion with local and bidirectional enhancement. When choreographing high-quality dance movements, people need to take into account not only the musical context but also the nearby music-aligned dance motions. To authentically capture human behavior, we propose a Bidirectional Autoregressive Diffusion Model (BADM) for music-to-dance generation, where a bidirectional encoder is built to enforce that the generated dance is harmonious in both the forward and backward directions. To make the generated dance motion smoother, a local information decoder is built for local motion enhancement. The proposed framework is able to generate new motions based on the input conditions and nearby motions, which foresees individual motion slices iteratively and consolidates all predictions. To further refine the synchronicity between the generated dance and the beat, the beat information is incorporated as an input to generate better music-aligned dance movements. Experimental results demonstrate that the proposed model achieves state-of-the-art performance compared to existing unidirectional approaches on the prominent benchmark for music-to-dance generation.
- [29] arXiv:2403.09598 (replaced) [pdf, html, other]
-
Title: Mixture of Mixups for Multi-label Classification of Rare Anuran SoundsSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Multi-label imbalanced classification poses a significant challenge in machine learning, particularly evident in bioacoustics where animal sounds often co-occur, and certain sounds are much less frequent than others. This paper focuses on the specific case of classifying anuran species sounds using the dataset AnuraSet, that contains both class imbalance and multi-label examples. To address these challenges, we introduce Mixture of Mixups (Mix2), a framework that leverages mixing regularization methods Mixup, Manifold Mixup, and MultiMix. Experimental results show that these methods, individually, may lead to suboptimal results; however, when applied randomly, with one selected at each training iteration, they prove effective in addressing the mentioned challenges, particularly for rare classes with few occurrences. Further analysis reveals that Mix2 is also proficient in classifying sounds across various levels of class co-occurrences.
- [30] arXiv:2406.08336 (replaced) [pdf, html, other]
-
Title: CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech ReconstructionComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec language modeling to improve the reconstruction results, especially for the speaker similarity and prosody naturalness. Our proposed model consists of: (i) a multi-modal content encoder to extract robust phoneme embeddings from dysarthric speech with auxiliary visual inputs; (ii) a speaker codec encoder to extract and normalize the speaker-aware codecs from the dysarthric speech, in order to provide original timbre and normal prosody; (iii) a codec language model based speech decoder to reconstruct the speech based on the extracted phoneme embeddings and normalized codecs. Evaluations on the commonly used UASpeech corpus show that our proposed model can achieve significant improvements in terms of speaker similarity and prosody naturalness.
- [31] arXiv:2406.12721 (replaced) [pdf, other]
-
Title: Sound event detection based on auxiliary decoder and maximum probability aggregation for DCASE Challenge 2024 Task 4Comments: DCASE 2024 challenge Task4, 4 pagesSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
In this report, we propose three novel methods for developing a sound event detection (SED) model for the DCASE 2024 Challenge Task 4. First, we propose an auxiliary decoder attached to the final convolutional block to improve feature extraction capabilities while reducing dependency on embeddings from pre-trained large models. The proposed auxiliary decoder operates independently from the main decoder, enhancing performance of the convolutional block during the initial training stages by assigning a different weight strategy between main and auxiliary decoder losses. Next, to address the time interval issue between the DESED and MAESTRO datasets, we propose maximum probability aggregation (MPA) during the training step. The proposed MPA method enables the model's output to be aligned with soft labels of 1 s in the MAESTRO dataset. Finally, we propose a multi-channel input feature that employs various versions of logmel and MFCC features to generate time-frequency pattern. The experimental results demonstrate the efficacy of these proposed methods in a view of improving SED performance by achieving a balanced enhancement across different datasets and label types. Ultimately, this approach presents a significant step forward in developing more robust and flexible SED models
- [32] arXiv:2406.13431 (replaced) [pdf, html, other]
-
Title: Children's Speech Recognition through Discrete Token EnhancementComments: Accepted at Interspeech 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.
- [33] arXiv:2406.14485 (replaced) [pdf, other]
-
Title: Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)Nick Bryan-Kinns, Corey Ford, Shuoyang Zheng, Helen Kennedy, Alan Chamberlain, Makayla Lewis, Drew Hemment, Zijin Li, Qiong Wu, Lanxi Xiao, Gus Xia, Jeba Rezwana, Michael Clemens, Gabriel VigliensoniSubjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.