Electrical Engineering and Systems Science
See recent articles
Showing new listings for Tuesday, 1 April 2025
- [1] arXiv:2503.22687 [pdf, html, other]
-
Title: Qieemo: Speech Is All You Need in the Emotion Recognition in ConversationsSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
Emotion recognition plays a pivotal role in intelligent human-machine interaction systems. Multimodal approaches benefit from the fusion of diverse modalities, thereby improving the recognition accuracy. However, the lack of high-quality multimodal data and the challenge of achieving optimal alignment between different modalities significantly limit the potential for improvement in multimodal approaches. In this paper, the proposed Qieemo framework effectively utilizes the pretrained automatic speech recognition (ASR) model backbone which contains naturally frame aligned textual and emotional features, to achieve precise emotion classification solely based on the audio modality. Furthermore, we design the multimodal fusion (MMF) module and cross-modal attention (CMA) module in order to fuse the phonetic posteriorgram (PPG) and emotional features extracted by the ASR encoder for improving recognition accuracy. The experimental results on the IEMOCAP dataset demonstrate that Qieemo outperforms the benchmark unimodal, multimodal, and self-supervised models with absolute improvements of 3.0%, 1.2%, and 1.9% respectively.
- [2] arXiv:2503.22692 [pdf, other]
-
Title: Enhancing Aviation Communication Transcription: Fine-Tuning Distil-Whisper with LoRAComments: 14 pages, 4 Figures, 4 Tables, Under review by Journal of Aerospace Information SystemsSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Transcription of aviation communications has several applications, from assisting air traffic controllers in identifying the accuracy of read-back errors to search and rescue operations. Recent advances in artificial intelligence have provided unprecedented opportunities for improving aviation communication transcription tasks. OpenAI's Whisper is one of the leading automatic speech recognition models. However, fine-tuning Whisper for aviation communication transcription is not computationally efficient. Thus, this paper aims to use a Parameter-Efficient Fine-tuning method called Low-Rank Adaptation to fine-tune a more computationally efficient version of Whisper, distil-Whisper. To perform the fine-tuning, we used the Air Traffic Control Corpus dataset from the Linguistic Data Consortium, which contains approximately 70 hours of controller and pilot transmissions near three major airports in the US. The objective was to reduce the word error rate to enhance accuracy in the transcription of aviation communication. First, starting with an initial set of hyperparameters for LoRA (Alpha = 64 and Rank = 32), we performed a grid search. We applied a 5-fold cross-validation to find the best combination of distil-Whisper hyperparameters. Then, we fine-tuned the model for LoRA hyperparameters, achieving an impressive average word error rate of 3.86% across five folds. This result highlights the model's potential for use in the cockpit.
- [3] arXiv:2503.22703 [pdf, html, other]
-
Title: Audio Compression using Periodic Gabor with Biorthogonal Exchange: Implementation Using the Zak TransformSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
An efficient new approach to signal compression is presented based of a novel variation on the Gabor basis set. Following earlier work by Shimshovitz and Tannor, we convolve the conventional Gabor functions with Dirichlet functions to obtain a Periodic Gabor basis set (PG). The PG basis is exact for continuous functions that are periodic band-limited. Using the orthonormality of the Dirichlet functions, the calculation of the PG coefficients becomes trivial and numerically stable, but its representation does not allow compression. Large compression factors are achieved by exchanging the PG basis with its biorthogonal basis, thereby using the localized PG basis to calculate the coefficients (PGB). Here we implement the PGB formalism using the Fast Zak Transform and obtain very high efficiency with respect to both CPU and memory. We compare the method with the state of the art Short-Time Fourier Transform (STFT) and Discrete Wavelet Transform (DWT) methods on a variety of audio files, including music and speech samples. In all cases tested our scheme surpasses the STFT by far and in most cases outperforms DWT.
- [4] arXiv:2503.22705 [pdf, other]
-
Title: Enhancing nonnative speech perception and production through an AI-powered applicationSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
While research on using Artificial Intelligence (AI) through various applications to enhance foreign language pronunciation is expanding, it has primarily focused on aspects such as comprehensibility and intelligibility, largely neglecting the improvement of individual speech sounds in both perception and production. This study seeks to address this gap by examining the impact of training with an AI-powered mobile application on nonnative sound perception and production. Participants completed a pretest assessing their ability to discriminate the second language English heed-hid contrast and produce these vowels in sentence contexts. The intervention involved training with the Speakometer mobile application, which incorporated recording tasks featuring the English vowels, along with pronunciation feedback and practice. The posttest mirrored the pretest to measure changes in performance. The results revealed significant improvements in both discrimination accuracy and production of the target contrast following the intervention. However, participants did not achieve native-like competence. These findings highlight the effectiveness of AI-powered applications in facilitating speech acquisition and support their potential use for personalized, interactive pronunciation training beyond the classroom.
- [5] arXiv:2503.22713 [pdf, other]
-
Title: Chirp Localization via Fine-Tuned Transformer Model: A Proof-of-Concept StudyComments: 19 pages, 8 figuresSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
Spectrograms are pivotal in time-frequency signal analysis, widely used in audio processing and computational neuroscience. Chirp-like patterns in electroencephalogram (EEG) spectrograms (marked by linear or exponential frequency sweep) are key biomarkers for seizure dynamics, but automated tools for their detection, localization, and feature extraction are lacking. This study bridges this gap by fine-tuning a Vision Transformer (ViT) model on synthetic spectrograms, augmented with Low-Rank Adaptation (LoRA) to boost adaptability. We generated 100000 synthetic spectrograms with chirp parameters, creating the first large-scale benchmark for chirp localization. These spectrograms mimic neural chirps using linear or exponential frequency sweep, Gaussian noise, and smoothing. A ViT model, adapted for regression, predicted chirp parameters. LoRA fine-tuned the attention layers, enabling efficient updates to the pre-trained backbone. Training used MSE loss and the AdamW optimizer, with a learning rate scheduler and early stopping to curb overfitting. Only three features were targeted: Chirp Start Time (Onset Time), Chirp Start Frequency (Onset Frequency), and Chirp End Frequency (Offset Frequency). Performance was evaluated via Pearson correlation between predicted and actual labels. Results showed strong alignment: 0.9841 correlation for chirp start time, with stable inference times (137 to 140s) and minimal bias in error distributions. This approach offers a tool for chirp analysis in EEG time-frequency representation, filling a critical methodological void.
- [6] arXiv:2503.22773 [pdf, html, other]
-
Title: Congenital Heart Disease Classification Using Phonocardiograms: A Scalable Screening Tool for Diverse EnvironmentsAbdul Jabbar, Ethan Grooby, Jack Crozier, Alexander Gallon, Vivian Pham, Khawza I Ahmad, Md Hassanuzzaman, Raqibul Mostafa, Ahsan H. Khandoker, Faezeh MarzbanradComments: 12 pages, 6 figuresSubjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
Congenital heart disease (CHD) is a critical condition that demands early detection, particularly in infancy and childhood. This study presents a deep learning model designed to detect CHD using phonocardiogram (PCG) signals, with a focus on its application in global health. We evaluated our model on several datasets, including the primary dataset from Bangladesh, achieving a high accuracy of 94.1%, sensitivity of 92.7%, specificity of 96.3%. The model also demonstrated robust performance on the public PhysioNet Challenge 2022 and 2016 datasets, underscoring its generalizability to diverse populations and data sources. We assessed the performance of the algorithm for single and multiple auscultation sites on the chest, demonstrating that the model maintains over 85% accuracy even when using a single location. Furthermore, our algorithm was able to achieve an accuracy of 80% on low-quality recordings, which cardiologists deemed non-diagnostic. This research suggests that an AI- driven digital stethoscope could serve as a cost-effective screening tool for CHD in resource-limited settings, enhancing clinical decision support and ultimately improving patient outcomes.
- [7] arXiv:2503.22829 [pdf, other]
-
Title: Nonhuman Primate Brain Tissue Segmentation Using a Transfer Learning ApproachZhen Lin, Hongyu Yuan, Richard Barcus, Qing Lyu, Sucheta Chakravarty, Megan E. Lipford, Carol A. Shively, Suzanne Craft, Mohammad Kawas, Jeongchul Kim, Christopher T. WhitlowSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Non-human primates (NHPs) serve as critical models for understanding human brain function and neurological disorders due to their close evolutionary relationship with humans. Accurate brain tissue segmentation in NHPs is critical for understanding neurological disorders, but challenging due to the scarcity of annotated NHP brain MRI datasets, the small size of the NHP brain, the limited resolution of available imaging data and the anatomical differences between human and NHP brains. To address these challenges, we propose a novel approach utilizing STU-Net with transfer learning to leverage knowledge transferred from human brain MRI data to enhance segmen-tation accuracy in the NHP brain MRI, particularly when training data is this http URL combination of STU-Net and transfer learning effectively delineates complex tissue boundaries and captures fine anatomical details specific to NHP brains. Notably, our method demonstrated improvement in segmenting small subcortical structures such as putamen and thalamus that are challenging to resolve with limited spatial resolution and tissue contrast, and achieved DSC of over 0.88, IoU over 0.8 and HD95 under 7. This study introduces a robust method for multi-class brain tissue segmentation in NHPs, potentially accelerating research in evolutionary neuroscience and preclinical studies of neurological disorders relevant to human health.
- [8] arXiv:2503.22830 [pdf, html, other]
-
Title: A Multiple Artificial Potential Functions Approach for Collision Avoidance in UAV SystemsSubjects: Systems and Control (eess.SY); Robotics (cs.RO)
Collision avoidance is a problem largely studied in robotics, particularly in unmanned aerial vehicle (UAV) applications. Among the main challenges in this area are hardware limitations, the need for rapid response, and the uncertainty associated with obstacle detection. Artificial potential functions (APOFs) are a prominent method to address these challenges. However, existing solutions lack assurances regarding closed-loop stability and may result in chattering effects. Motivated by this, we propose a control method for static obstacle avoidance based on multiple artificial potential functions (MAPOFs). We derive tuning conditions on the control parameters that ensure the stability of the final position. The stability proof is established by analyzing the closed-loop system using tools from hybrid systems theory. Furthermore, we validate the performance of the MAPOF control through simulations, showcasing its effectiveness in avoiding static obstacles.
- [9] arXiv:2503.22837 [pdf, html, other]
-
Title: A Cooperative Compliance Control Framework for Socially Optimal Mixed Traffic RoutingSubjects: Systems and Control (eess.SY)
In mixed traffic environments, where Connected and Autonomed Vehicles (CAVs) coexist with potentially non-cooperative Human-Driven Vehicles (HDVs), the self-centered behavior of human drivers may compromise the efficiency, optimality, and safety of the overall traffic network. In this paper, we propose a Cooperative Compliance Control (CCC) framework for mixed traffic routing, where a Social Planner (SP) optimizes vehicle routes for system-wide optimality while a compliance controller incentivizes human drivers to align their behavior with route guidance from the SP through a "refundable toll" scheme. A key challenge arises from the heterogeneous and unknown response models of different human driver types to these tolls, making it difficult to design a proper controller and achieve desired compliance probabilities over the traffic network. To address this challenge, we employ Control Lyapunov Functions (CLFs) to adaptively correct (learn) crucial components of our compliance probability model online, construct data-driven feedback controllers, and demonstrate that we can achieve the desired compliance probability for HDVs, thereby contributing to the social optimality of the traffic network.
- [10] arXiv:2503.22850 [pdf, html, other]
-
Title: Passivity, No-Regret, and Convergent Learning in Contractive GamesSubjects: Systems and Control (eess.SY)
We investigate the interplay between passivity, no-regret, and convergence in contractive games for various learning dynamic models and their higher-order variants. Our setting is continuous time. Building on prior work for replicator dynamics, we show that if learning dynamics satisfy a passivity condition between the payoff vector and the difference between its evolving strategy and any fixed strategy, then it achieves finite regret. We then establish that the passivity condition holds for various learning dynamics and their higher-order variants. Consequentially, the higher-order variants can achieve convergence to Nash equilibrium in cases where their standard order counterparts cannot, while maintaining a finite regret property. We provide numerical examples to illustrate the lack of finite regret of different evolutionary dynamic models that violate the passivity property. We also examine the fragility of the finite regret property in the case of perturbed learning dynamics. Continuing with passivity, we establish another connection between finite regret and passivity, but with the related equilibrium-independent passivity property. Finally, we present a passivity-based classification of dynamic models according to the various passivity notions they satisfy, namely, incremental passivity, $\delta$-passivity, and equilibrium-independent passivity. This passivity-based classification provides a framework to analyze the convergence of learning dynamic models in contractive games.
- [11] arXiv:2503.22855 [pdf, other]
-
Title: Sensorless Field Oriented Control of CSI-Fed PMSM Drives Used in Submersible PumpsComments: accepted at APEC 2025 conferenceSubjects: Systems and Control (eess.SY)
This paper proposes a practical startup strategy for current source inverter (CSI)-fed Permanent Magnet Synchronous Motor (PMSM) drives in submersible pump applications, focusing on ensuring a seamless shift to sensorless field-oriented control (FOC). The method effectively manages the transition to sensorless operation without requiring precise current or alignment error calculations, thereby simplifying implementation. By addressing speed and current oscillations directly during the startup and transition stages, the approach significantly enhances overall system stability and responsiveness. Validation through simulation and experimental testing demonstrates the strategy's success in maintaining low oscillation levels across various operating conditions, confirming its reliability for high-performance industrial applications.
- [12] arXiv:2503.22860 [pdf, html, other]
-
Title: MCRB for Parameter Estimation from One-Bit Quantized and Oversampled MeasurementsSubjects: Signal Processing (eess.SP)
One-bit quantization has garnered significant attention in recent years for various signal processing and communication applications. Estimating model parameters from one bit quantized data can be challenging, particularly when the quantization process is explicitly accounted for in the estimator. In many cases, the estimator disregards quantization effects, leading to model misspecification. Consequently, estimation errors arise from both quantization and misspecification. Traditional performance bounds, such as the Cramer-Rao bound (CRB), fail to capture the impact of misspecification on estimation performance. To address this limitation, we derive the misspecified CRB (MCRB) for parameter estimation in a quantized data model consisting of a signal component in additive Gaussian noise. We apply this bound to direction-of-arrival estimation using quantized measurements from a sensor array and to frequency estimation with oversampled quantized data. The simulations show that the MCRB is asymptotically achieved by the mean-squared-error of the misspecified maximum-likelihood estimator. Our results demonstrate that, unlike in finely quantized scenarios, oversampling can significantly enhance the estimation performance in the presence of misspecified one-bit quantized measurements.
- [13] arXiv:2503.22867 [pdf, html, other]
-
Title: Markov Potential Game Construction and Multi-Agent Reinforcement Learning with Applications to Autonomous DrivingSubjects: Systems and Control (eess.SY)
Markov games (MGs) serve as the mathematical foundation for multi-agent reinforcement learning (MARL), enabling self-interested agents to learn their optimal policies while interacting with others in a shared environment. However, due to the complexities of an MG problem, seeking (Markov perfect) Nash equilibrium (NE) is often very challenging for a general-sum MG. Markov potential games (MPGs), which are a special class of MGs, have appealing properties such as guaranteed existence of pure NEs and guaranteed convergence of gradient play algorithms, thereby leading to desirable properties for many MARL algorithms in their NE-seeking processes. However, the question of how to construct MPGs has been open. This paper provides sufficient conditions on the reward design and on the Markov decision process (MDP), under which an MG is an MPG. Numerical results on autonomous driving applications are reported.
- [14] arXiv:2503.22870 [pdf, html, other]
-
Title: Attitude Synchronization for Multi-Agent Systems on SO(3) Using Vector MeasurementsSubjects: Systems and Control (eess.SY)
In this paper, we address the problem of leaderless attitude synchronization for a group of rigid body systems evolving on SO(3), relying on local measurements of some inertial (unit-length) vectors. The interaction graph among agents is assumed to be undirected, acyclic, and connected. We first present a distributed attitude synchronization scheme designed at the kinematic level of SO(3), followed by an extended scheme designed at the dynamic level. Both schemes are supported by a rigorous stability analysis, which establishes their almost global asymptotic stability properties. Finally, numerical simulations demonstrate the effectiveness of both distributed attitude synchronization schemes.
- [15] arXiv:2503.22889 [pdf, html, other]
-
Title: CLuP-Based Dual-Deconvolution in Automotive ISAC ScenariosComments: 6 pages, 4 figuresSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
Accurate target parameter estimation of range, velocity, and angle is essential for vehicle safety in advanced driver assistance systems (ADAS) and autonomous vehicles. To enable spectrum sharing, ADAS may employ integrated sensing and communications (ISAC). This paper examines a dual-deconvolution automotive ISAC scenario where the radar waveform is known but the propagation channel is not, while in the communications domain, the channel is known but the transmitted message is not. Conventional maximum likelihood (ML) estimation for automotive target parameters is computationally demanding. To address this, we propose a low-complexity approach using the controlled loosening-up (CLuP) algorithm, which employs iterative refinement for efficient separation and estimation of radar targets. We achieve this through a nuclear norm restriction that stabilizes the problem. Numerical experiments demonstrate the robustness of this approach under high-mobility and noisy automotive environments, highlighting CLuP's potential as a scalable, real-time solution for ISAC in future vehicular networks.
- [16] arXiv:2503.22992 [pdf, html, other]
-
Title: Evaluation of Remote Driver Performance in Urban Environment Operational Design DomainsComments: This work has been submitted to the IEEE for possible publicationSubjects: Systems and Control (eess.SY); Robotics (cs.RO)
Remote driving has emerged as a solution for enabling human intervention in scenarios where Automated Driving Systems (ADS) face challenges, particularly in urban Operational Design Domains (ODDs). This study evaluates the performance of Remote Drivers (RDs) of passenger cars in a representative urban ODD in Las Vegas, focusing on the influence of cumulative driving experience and targeted training approaches. Using performance metrics such as efficiency, braking, acceleration, and steering, the study shows that driving experience can lead to noticeable improvements of RDs and demonstrates how experience up to 600 km correlates with improved vehicle control. In addition, driving efficiency exhibited a positive trend with increasing kilometers, particularly during the first 300 km of experience, which reaches a plateau from 400 km within a range of 0.35 to 0.42 km/min in the defined ODD. The research further compares ODD-specific training methods, where the detailed ODD training approaches attains notable advantages over other training approaches. The findings underscore the importance of tailored ODD training in enhancing RD performance, safety, and scalability for Remote Driving System (RDS) in real-world applications, while identifying opportunities for optimizing training protocols to address both routine and extreme scenarios. The study provides a robust foundation for advancing RDS deployment within urban environments, contributing to the development of scalable and safety-critical remote operation standards.
- [17] arXiv:2503.23004 [pdf, html, other]
-
Title: The trajectoRIR Database: Room Acoustic Recordings Along a Trajectory of Moving MicrophonesComments: 15 pages, 7 figuresSubjects: Audio and Speech Processing (eess.AS)
Data availability is essential to develop acoustic signal processing algorithms, especially when it comes to data-driven approaches that demand large and diverse training datasets. For this reason, an increasing number of databases have been published in recent years, including either room impulse responses (RIRs) or recordings of moving audio. In this paper we introduce the trajectoRIR database, an extensive, multi-array collection of both dynamic and stationary acoustic recordings along a controlled trajectory in a room. Specifically, the database features recordings using moving microphones and stationary RIRs spatially sampling the room acoustics along an L-shaped, 3.74-meter-long trajectory. This combination makes trajectoRIR unique and applicable in various tasks ranging from sound source localization and tracking to spatially dynamic sound field reconstruction and system identification. The recording room has a reverberation time of 0.5 seconds, and the three different microphone configurations employed include a dummy head, with additional reference microphones located next to the ears, 3 first-order Ambisonics microphones, two circular arrays of 16 and 4 channels, and a 12-channel linear array. The motion of the microphones was achieved using a robotic cart traversing a rail at three speeds: [0.2,0.4,0.8] m/s. Audio signals were reproduced using two stationary loudspeakers. The collected database features 8648 stationary RIRs, as well as perfect sweeps, speech, music, and stationary noise recorded during motion. MATLAB and Python scripts are included to access the recorded audio as well as to retrieve geometrical information.
- [18] arXiv:2503.23010 [pdf, html, other]
-
Title: A Comprehensive Comparison between Terahertz and Optical Wireless CommunicationsSubjects: Signal Processing (eess.SP)
This paper presents a comprehensive quantitative comparison between Terahertz (THz) communication (TeraCom) and optical wireless communication (OWC) technologies, focusing on both indoor and outdoor environments. We propose a comparison method for TeraCom and vertical-cavity surface-emitting laser (VCSEL)-based OWC in indoor scenarios, incorporating misalignment effects by modeling the THz antenna radiation pattern within a multi-ray THz channel model and using a Gaussian beam model for VCSEL-based OWC. Unified beamwidth parameters allow for a detailed analysis of misalignment impact on both systems. Furthermore, we develop power consumption models for each technology, integrating key parameters such as THz phase noise, VCSEL non-linearities, and photodetector bandwidth-area tradeoffs. These models enable an in-depth analysis of energy efficiency in indoor environments, including multi-transmitter coverage scenarios. For outdoor scenarios, we summarize existing stochastic channel models addressing path loss, pointing errors, and small-scale fading for free space optics (FSO) and THz links. We then apply these models to unmanned aerial vehicle (UAV) applications to assess performance in dynamic conditions. Our results provide critical insights into the suitability of each technology for various deployment scenarios.
- [19] arXiv:2503.23042 [pdf, html, other]
-
Title: MIL vs. Aggregation: Evaluating Patient-Level Survival Prediction Strategies Using Graph-Based LearningSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Oncologists often rely on a multitude of data, including whole-slide images (WSIs), to guide therapeutic decisions, aiming for the best patient outcome. However, predicting the prognosis of cancer patients can be a challenging task due to tumor heterogeneity and intra-patient variability, and the complexity of analyzing WSIs. These images are extremely large, containing billions of pixels, making direct processing computationally expensive and requiring specialized methods to extract relevant information. Additionally, multiple WSIs from the same patient may capture different tumor regions, some being more informative than others. This raises a fundamental question: Should we use all WSIs to characterize the patient, or should we identify the most representative slide for prognosis? Our work seeks to answer this question by performing a comparison of various strategies for predicting survival at the WSI and patient level. The former treats each WSI as an independent sample, mimicking the strategy adopted in other works, while the latter comprises methods to either aggregate the predictions of the several WSIs or automatically identify the most relevant slide using multiple-instance learning (MIL). Additionally, we evaluate different Graph Neural Networks architectures under these strategies. We conduct our experiments using the MMIST-ccRCC dataset, which comprises patients with clear cell renal cell carcinoma (ccRCC). Our results show that MIL-based selection improves accuracy, suggesting that choosing the most representative slide benefits survival prediction.
- [20] arXiv:2503.23052 [pdf, html, other]
-
Title: ShiftLIC: Lightweight Learned Image Compression with Spatial-Channel Shift OperationsSubjects: Image and Video Processing (eess.IV)
Learned Image Compression (LIC) has attracted considerable attention due to their outstanding rate-distortion (R-D) performance and flexibility. However, the substantial computational cost poses challenges for practical deployment. The issue of feature redundancy in LIC is rarely addressed. Our findings indicate that many features within the LIC backbone network exhibit similarities.
This paper introduces ShiftLIC, a novel and efficient LIC framework that employs parameter-free shift operations to replace large-kernel convolutions, significantly reducing the model's computational burden and parameter count. Specifically, we propose the Spatial Shift Block (SSB), which combines shift operations with small-kernel convolutions to replace large-kernel. This approach maintains feature extraction efficiency while reducing both computational complexity and model size. To further enhance the representation capability in the channel dimension, we propose a channel attention module based on recursive feature fusion. This module enhances feature interaction while minimizing computational overhead. Additionally, we introduce an improved entropy model integrated with the SSB module, making the entropy estimation process more lightweight and thereby comprehensively reducing computational costs.
Experimental results demonstrate that ShiftLIC outperforms leading compression methods, such as VVC Intra and GMM, in terms of computational cost, parameter count, and decoding latency. Additionally, ShiftLIC sets a new SOTA benchmark with a BD-rate gain per MACs/pixel of -102.6\%, showcasing its potential for practical deployment in resource-constrained environments. The code is released at this https URL. - [21] arXiv:2503.23055 [pdf, html, other]
-
Title: Advancing THz Radio Map Construction and Obstacle Sensing: An Integrated Generative Framework in ISACTianyu Hu, Shuai Wang, Yunhang Xie, Lingxiang Li, Zhi Chen, Boyu Ning, Wassim Hamidouche, Lina Bariah, Samson Lasaulce, Merouane DebbahSubjects: Signal Processing (eess.SP)
Integrated sensing and communication (ISAC) in the terahertz (THz) band enables obstacle detection, which in turn facilitates efficient beam management to mitigate THz signal blockage. Simultaneously, a THz radio map, which captures signal propagation characteristics through the distribution of received signal strength (RSS), is well-suited for sensing, as it inherently contains obstacle-related information and reflects the unique properties of the THz channel. This means that communication-assisted sensing in ISAC can be effectively achieved using a THz radio map. However, constructing a radio map presents significant challenges due to the sparse deployment of THz sensors and their limited ability to accurately measure the RSS distribution, which directly affects obstacle sensing. In this paper, we formulate an integrated problem for the first time, leveraging the mutual enhancement between sensed obstacles and the constructed THz radio maps. To address this challenge while improving generalization, we propose an integration framework based on a conditional generative adversarial network (CGAN), which uncovers the manifold structure of THz radio maps embedded with obstacle information. Furthermore, recognizing the shared environmental semantics across THz radio maps from different beam directions, we introduce a novel voting-based sensing scheme, where obstacles are detected by aggregating votes from THz radio maps generated by the CGAN. Simulation results demonstrate that the proposed framework outperforms non-integrated baselines in both radio map construction and obstacle sensing, achieving up to 44.3% and 90.6% reductions in mean squared error (MSE), respectively, in a real-world scenario. These results validate the effectiveness of the proposed voting-based scheme.
- [22] arXiv:2503.23108 [pdf, html, other]
-
Title: SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech SystemHyeongju Kim, Jinhyeok Yang, Yechan Yu, Seunghun Ji, Jacob Morton, Frederik Bous, Joon Byun, Juheon LeeComments: 19 pages, preprintSubjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS is comprised of three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. We further simplify the TTS pipeline by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we introduce context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment. Experimental results demonstrate that SupertonicTTS achieves competitive performance while significantly reducing architectural complexity and computational overhead compared to contemporary TTS models. Audio samples demonstrating the capabilities of SupertonicTTS are available at: this https URL.
- [23] arXiv:2503.23119 [pdf, html, other]
-
Title: Channel Coding meets Sequence Design via Machine Learning for Integrated Sensing and CommunicationsComments: Submitted to IEEE Communication LettersSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
For integrated sensing and communications, an intriguing question is whether information-bearing channel-coded signals can be reused for sensing - specifically ranging. This question forces the hitherto non-overlapping fields of channel coding (communications) and sequence design (sensing) to intersect by motivating the design of error-correcting codes that have good autocorrelation properties. In this letter, we demonstrate how machine learning (ML) is well-suited for designing such codes, especially for short block lengths. As an example, for rate 1/2 and block length 32, we show that even an unsophisticated ML code has a bit-error rate performance similar to a Polar code with the same parameters, but with autocorrelation sidelobes 24dB lower. While a length-32 Zadoff-Chu (ZC) sequence has zero autocorrelation sidelobes, there are only 16 such sequences and hence, a 1/2 code rate cannot be realized by using ZC sequences as codewords. Hence, ML bridges channel coding and sequence design by trading off an ideal autocorrelation function for a large (i.e., rate-dependent) codebook size.
- [24] arXiv:2503.23149 [pdf, html, other]
-
Title: Towards Interpretable Counterfactual Generation via Multimodal AutoregressionChenglong Ma, Yuanfeng Ji, Jin Ye, Lu Zhang, Ying Chen, Tianbin Li, Mingjie Li, Junjun He, Hongming ShanSubjects: Image and Video Processing (eess.IV)
Counterfactual medical image generation enables clinicians to explore clinical hypotheses, such as predicting disease progression, facilitating their decision-making. While existing methods can generate visually plausible images from disease progression prompts, they produce silent predictions that lack interpretation to verify how the generation reflects the hypothesized progression -- a critical gap for medical applications that require traceable reasoning. In this paper, we propose Interpretable Counterfactual Generation (ICG), a novel task requiring the joint generation of counterfactual images that reflect the clinical hypothesis and interpretation texts that outline the visual changes induced by the hypothesis. To enable ICG, we present ICG-CXR, the first dataset pairing longitudinal medical images with hypothetical progression prompts and textual interpretations. We further introduce ProgEmu, an autoregressive model that unifies the generation of counterfactual images and textual interpretations. We demonstrate the superiority of ProgEmu in generating progression-aligned counterfactuals and interpretations, showing significant potential in enhancing clinical decision support and medical education. Project page: this https URL.
- [25] arXiv:2503.23151 [pdf, html, other]
-
Title: Improved Motion Plane Adaptive 360-Degree Video Compression Using Affine Motion ModelsJournal-ref: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)Subjects: Image and Video Processing (eess.IV)
Efficient compression of 360-degree video content requires the application of advanced motion models for interframe prediction. The Motion Plane Adaptive (MPA) motion model projects the frames on multiple perspective planes in the 3D space. It improves the motion compensation by estimating the motion on those planes with a translational diamond search. In this work, we enhance this motion model with an affine parameterization and motion estimation method. Thereby, we find a feasible trade-off between the quality of the reconstructed frames and the computational cost. The affine motion estimation is hereby done with the inverse compositional Lucas-Kanade algorithm. With the proposed method, it is possible to improve the motion compensation significantly, so that the motion compensated frame has a Weighted-to-Spherically-uniform Peak Signal-to-Noise Ratio (WS-PSNR) which is about 1.6 dB higher than with the conventional MPA. In a basic video codec, the improved inter prediction can lead to Bjøntegaard Delta (BD) rate savings between 9 % and 35 % depending on the block size (BS) and number of motion parameters.
- [26] arXiv:2503.23179 [pdf, html, other]
-
Title: OncoReg: Medical Image Registration for Oncological ChallengesWiebke Heyer, Yannic Elser, Lennart Berkel, Xinrui Song, Xuanang Xu, Pingkun Yan, Xi Jia, Zi Li, Tony C. W. Mok, BoWen LI, Christian Staackmann, Christoph Großbröhmer, Alessa Hering, Malte M. Sieren, Mattias P. HeinrichComments: 26 pages, 6 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In modern cancer research, the vast volume of medical data generated is often underutilised due to challenges related to patient privacy. The OncoReg Challenge addresses this issue by enabling researchers to develop and validate image registration methods through a two-phase framework that ensures patient privacy while fostering the development of more generalisable AI models. Phase one involves working with a publicly available dataset, while phase two focuses on training models on a private dataset within secure hospital networks. OncoReg builds upon the foundation established by the Learn2Reg Challenge by incorporating the registration of interventional cone-beam computed tomography (CBCT) with standard planning fan-beam CT (FBCT) images in radiotherapy. Accurate image registration is crucial in oncology, particularly for dynamic treatment adjustments in image-guided radiotherapy, where precise alignment is necessary to minimise radiation exposure to healthy tissues while effectively targeting tumours. This work details the methodology and data behind the OncoReg Challenge and provides a comprehensive analysis of the competition entries and results. Findings reveal that feature extraction plays a pivotal role in this registration task. A new method emerging from this challenge demonstrated its versatility, while established approaches continue to perform comparably to newer techniques. Both deep learning and classical approaches still play significant roles in image registration, with the combination of methods - particularly in feature extraction - proving most effective.
- [27] arXiv:2503.23187 [pdf, html, other]
-
Title: Least-Squares Khatri-Rao Factorization of a Polynomial MatrixSubjects: Signal Processing (eess.SP)
The Khatri-Rao product is extensively used in array processing, tensor decomposition, and multi-way data analysis. Many applications require a least-squares (LS) Khatri-Rao factorization. In broadband sensor array problems, polynomial matrices effectively model frequency-dependent behaviors, necessitating extensions of conventional linear algebra techniques. This paper generalizes LS Khatri-Rao factorization from ordinary to polynomial matrices by applying it to the discrete Fourier transform (DFT) samples of polynomial matrices. Phase coherence across bin-wise Khatri-Rao factors is ensured via a phasesmoothing algorithm. The proposed method is validated through broadband angle-of-arrival (AoA) estimation for uniform planar arrays (UPAs), where the steering matrix is a polynomial matrix, which can be represented as a Khatri-Rao product between steering matrix in azimuth and elevation directions.
- [28] arXiv:2503.23218 [pdf, html, other]
-
Title: Multi-Agent Reinforcement Learning for Graph Discovery in D2D-Enabled Federated LearningSubjects: Signal Processing (eess.SP)
Augmenting federated learning (FL) with device-to-device (D2D) communications can help improve convergence speed and reduce model bias through local information exchange. However, data privacy concerns, trust constraints between devices, and unreliable wireless channels each pose challenges in finding an effective yet resource efficient D2D graph structure. In this paper, we develop a decentralized reinforcement learning (RL) method for D2D graph discovery that promotes communication of impactful datapoints over reliable links for multiple learning paradigms, while following both data and device-specific trust constraints. An independent RL agent at each device trains a policy to predict the impact of incoming links in a decentralized manner without exposure of local data or significant communication overhead. For supervised settings, the D2D graph aims to improve device-specific label diversity without compromising system-level performance. For semi-supervised settings, we enable this by incorporating distributed label propagation. For unsupervised settings, we develop a variation-based diversity metric which estimates data diversity in terms of occupied latent space. Numerical experiments on five widely used datasets confirm that the data diversity improvements induced by our method increase convergence speed by up to 3 times while reducing energy consumption by up to 5 times. They also show that our method is resilient to stragglers and changes in the aggregation interval. Finally, we show that our method offers scalability benefits for larger system sizes without increases in relative overhead, and adaptability to various downstream FL architectures and to dynamic wireless environments.
- [29] arXiv:2503.23219 [pdf, html, other]
-
Title: Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMsSanjoy Chowdhury, Hanan Gani, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, Dinesh ManochaSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark comprising 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge. Evaluating 18 AVLLMs on AVReasonBench reveals significant limitations in their multi-modal reasoning capabilities. Using AURELIA, we achieve up to a 100% relative improvement, demonstrating its effectiveness. This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications. Our code and data will be publicly released at: https: //github.com/schowdhury671/aurelia.
- [30] arXiv:2503.23228 [pdf, html, other]
-
Title: Energy-Aware Lane Planning for Connected Electric Vehicles in Urban Traffic: Design and Vehicle-in-the-Loop ValidationComments: Submitted to an Invited Session at 2025 IEEE Conference on Decision and ControlSubjects: Systems and Control (eess.SY); Robotics (cs.RO)
Urban driving with connected and automated vehicles (CAVs) offers potential for energy savings, yet most eco-driving strategies focus solely on longitudinal speed control within a single lane. This neglects the significant impact of lateral decisions, such as lane changes, on overall energy efficiency, especially in environments with traffic signals and heterogeneous traffic flow. To address this gap, we propose a novel energy-aware motion planning framework that jointly optimizes longitudinal speed and lateral lane-change decisions using vehicle-to-infrastructure (V2I) communication. Our approach estimates long-term energy costs using a graph-based approximation and solves short-horizon optimal control problems under traffic constraints. Using a data-driven energy model calibrated to an actual battery electric vehicle, we demonstrate with vehicle-in-the-loop experiments that our method reduces motion energy consumption by up to 24 percent compared to a human driver, highlighting the potential of connectivity-enabled planning for sustainable urban autonomy.
- [31] arXiv:2503.23255 [pdf, html, other]
-
Title: Iterative VCG-based Mechanism Fosters Cooperation in Multi-Regional Network DesignSubjects: Systems and Control (eess.SY)
Transportation network design often involves multiple stakeholders with diverse priorities. We consider a system with a hierarchical multi-agent structure, featuring self-optimized subnetwork operators at the lower level and a central organization at the upper level. Independent regional planning can lead to inefficiencies due to the lack of coordination, hindering interregional travel and cross-border infrastructure development, while centralized methods may struggle to align local interests and can be impractical to implement. To support decision making for such a system, we introduce an iterative VCG-based mechanism for multi-regional network design that fosters cooperation among subnetwork operators. By leveraging the Vickery-Clarke-Groves (VCG) mechanism, the framework determines collective investment decisions and the necessary payments from both operators and the central organization to achieve efficient outcomes. A case study on the European Railway System validates the effectiveness of the proposed method, demonstrating significant improvements in overall network performance through enhanced cross-region cooperation.
- [32] arXiv:2503.23263 [pdf, html, other]
-
Title: A Method for Localization of Cellular Users from Call Detail RecordsComments: 9 pages, 19 figuresSubjects: Signal Processing (eess.SP)
A common problem in justice applications is localization of a user of a cellular network using a call detail record (CDR), which typically reveals only the base station and sector to which the user was connected. This precludes precise estimation of location. Instead, one is limited to estimating a region of plausible locations (RPL) using static information such as sector antenna orientation, beamwidth, and locations of nearby base stations. In this paper, we propose a method for RPL estimation in which the shape bounding the RPL is derived from a model of the antenna pattern via the Friis Transmission Equation, and the size of the RPL is determined by mean distance to nearby base stations. The performance of the proposed method is evaluated by "best server" analysis of measurements acquired from drive testing in the vicinity of Winter Garden, Florida, observing three 700 MHz-band LTE cellular networks serving this area. Of the 16 sectors evaluated, the aggregate error rate (i.e., fraction of users located outside the RPL estimated for the associated sector) is found to be 1.3%, with worst per-sector error rate of about 13.3% and error rates below 1.8% for 13 of the 16 sectors. The principal difficulty is shown to be estimation of RPL size, which entails a tradeoff between minimizing RPL area (yielding the "tightest" localization) and minimizing error rate.
- [33] arXiv:2503.23265 [pdf, html, other]
-
Title: A Lightweight Image Super-Resolution Transformer Trained on Low-Resolution Images OnlySubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Transformer architectures prominently lead single-image super-resolution (SISR) benchmarks, reconstructing high-resolution (HR) images from their low-resolution (LR) counterparts. Their strong representative power, however, comes with a higher demand for training data compared to convolutional neural networks (CNNs). For many real-world SR applications, the availability of high-quality HR training images is not given, sparking interest in LR-only training methods. The LR-only SISR benchmark mimics this condition by allowing only low-resolution (LR) images for model training. For a 4x super-resolution, this effectively reduces the amount of available training data to 6.25% of the HR image pixels, which puts the employment of a data-hungry transformer model into question. In this work, we are the first to utilize a lightweight vision transformer model with LR-only training methods addressing the unsupervised SISR LR-only benchmark. We adopt and configure a recent LR-only training method from microscopy image super-resolution to macroscopic real-world data, resulting in our multi-scale training method for bicubic degradation (MSTbic). Furthermore, we compare it with reference methods and prove its effectiveness both for a transformer and a CNN model. We evaluate on the classic SR benchmark datasets Set5, Set14, BSD100, Urban100, and Manga109, and show superior performance over state-of-the-art (so far: CNN-based) LR-only SISR methods. The code is available on GitHub: this https URL.
- [34] arXiv:2503.23267 [pdf, html, other]
-
Title: Ensuring Safe and Smooth Control in Safety-Critical Systems via Filtered Control Barrier FunctionsComments: 7 pages, 4 figuresSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
In safety-critical control systems, ensuring both system safety and smooth control input variation is essential for theoretical guarantees and practical deployment. Existing Control Barrier Function (CBF) frameworks, especially High-Order CBFs (HOCBFs), effectively enforce safety constraints but often lead to nonsmooth or discontinuous control inputs that can degrade system performance or violate actuator limitations. This paper introduces Filtered Control Barrier Functions (FCBFs), a novel extension of HOCBFs that incorporates an auxiliary dynamic system-referred to as an input regularization filter-to produce Lipschitz continuous control inputs. The proposed framework ensures safety, control bounds, and smoothness simultaneously by integrating FCBFs and HOCBFs within a unified quadratic program (QP). Theoretical guarantees are provided, and simulations on a unicycle model demonstrate the effectiveness of the proposed method compared to standard and smoothness-penalized HOCBF approaches.
- [35] arXiv:2503.23324 [pdf, html, other]
-
Title: A Time Splitting Based Optimization Method for Nonlinear MHESubjects: Systems and Control (eess.SY)
Moving Horizon Estimation~(MHE) is essentially an optimization-based approach designed to estimate the states of dynamic systems within a moving time horizon. Traditional MHE solutions become computationally prohibitive due to the \textit{curse of dimensionality} arising from increasing problem complexity and growing length of time horizon. To address this issue, we propose novel computationally efficient algorithms for solving nonlinear MHE problems. Specifically, we first introduce a distributed reformulation utilizing a time-splitting technique. Leveraging this reformulation, we develop the Efficient Gauss-Newton Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) to achieve computational efficiency. Additionally, to accommodate limited computational capabilities inherent in some sub-problem solvers, we propose the Efficient Sensitivity Assisted ALADIN, which enables sub-problems to be solved inexactly without hindering computational efficiency. Furthermore, recognizing scenarios where sub-problem solvers possess no computational power, we propose a Distributed Sequential Quadratic Programming (SQP) that relies solely on first- and second-order information of local objective functions. We demonstrate the performance and advantages of our proposed methods through numerical experiments on differential drive robots case, a practical nonlinear MHE problem. Our results demonstrate that the three proposed algorithms achieve computational efficiency while preserving high accuracy, thereby satisfying the real-time requirements of MHE.
- [36] arXiv:2503.23338 [pdf, html, other]
-
Title: Improving Neonatal Care: An Active Dry-Contact Electrode-based Continuous EEG Monitoring System with Seizure DetectionNima L. Wickramasinghe, Dinuka Sandun Udayantha, Akila Abeyratne, Kavindu Weerasinghe, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. EdussooriyaComments: 9 pages, 8 figures, Work is submitted for possible publication in IEEESubjects: Signal Processing (eess.SP)
Objective: Neonates are highly susceptible to seizures, which can have severe long-term consequences if undetected and left untreated. Early detection is crucial and typically requires continuous electroencephalography (EEG) monitoring in a hospital setting, which is costly, inconvenient, and requires specialized experts for diagnosis. In this work, we propose a new low-cost active dry-contact electrode-based adjustable EEG headset, a new explainable deep learning model to detect neonatal seizures, and an advanced signal processing algorithm to remove artifacts to address the key aspects that lead to the underdiagnosis of neonatal seizures. Methods: EEG signals are acquired through active electrodes and processed using a custom-designed analog front end (AFE) that filters and digitizes the captured EEG signals. The adjustable headset is designed using three-dimensional (3D) printing and laser cutting to fit a wide range of head sizes. A deep learning model is developed to classify seizure and non-seizure epochs in real-time. Furthermore, a separate multimodal deep learning model is designed to remove noise artifacts. The device is tested on a pediatric patient with absence seizures in a hospital setting. Simultaneous recordings are captured using both the custom device and the commercial wet electrode device available in the hospital for comparison. Results: The signals obtained using our custom design and a commercial device show a high correlation (>0.8). Further analysis using signal-to-noise ratio values shows that our device can mitigate noise similar to the commercial device. The proposed deep learning model has improvements in accuracy and recall by 2.76% and 16.33%, respectively, compared to the state-of-the-art. Furthermore, the developed artifact removal algorithm can identify and remove artifacts while keeping seizure patterns intact.
- [37] arXiv:2503.23352 [pdf, html, other]
-
Title: STAR-RIS-aided NOMA for Secured xURLLCSubjects: Signal Processing (eess.SP)
Short packet-based advanced Internet of things (A-IoT) calls for not only the next generation of ultra-reliable low-latency communications (xURLLC) but also highly secured communications. In this paper, we aim to address this objective by developing a non-orthogonal multiple access (NOMA) system with untrusted user. There exist two key problems: The confidential/private message for the far user will be exposed to the untrusted near user with successful SIC; The restrictive trade-off among reliability, security and latency poses a great challenge in achieving secured xURLLC. In order to solve these issues, we introduce simultaneous transmitting and reflecting reconfigurable intelligent surface (STAR-RIS), which provides additional degree of freedom to enable a secure and fair decoding order and achieve a desired trade-off among reliability, security and latency. To fully reveal the trade-off among reliability, security and latency, we characterize the reliability and security via decoding error probabilities. A leakage probability minimization problem is modeled to optimize the passive beamforming, power allocation and blocklength subject to secure successive interference cancellation (SIC) order, reliability and latency constraints. To solve this complex problem, we explore its intrinsic properties and propose an algorithm based on majorization minimization (MM) and alternative optimization (AO). Simulation results demonstrate the validness of our study in this paper.
- [38] arXiv:2503.23396 [pdf, other]
-
Title: Physics-Informed Adaptive Deep Koopman Operator Modeling for Autonomous Vehicle DynamicsComments: 21 pages, 9 figuresSubjects: Systems and Control (eess.SY)
Koopman operator has been recognized as an ongoing data-driven modeling method for vehicle dynamics which lifts the original state space into a high-dimensional linear state space. The deep neural networks (DNNs) are verified to be useful for the approximation of Koopman operator. To further improve the accuracy of Koopman operator approximation, this paper introduces a physical loss function term from the concept of physics-informed neural networks (PINNs), i.e., the acceleration loss between neural network output and sensor measurements, to improve the efficiency of network learning and its interpretability. Moreover, we utilize the sliding window least squares (SWLS) to update the system matrix and input matrix online in the lifted space, therefore enabling the deep Koopman operator to adapt to the rapid dynamics of autonomous vehicles in real events. The data collection and validation are conducted on CarSim/Simlink co-simulation platform. With comparison to other physics-based and data-driven approaches on various scenarios, the results reveal that the acceleration loss-informed network refines the accuracy of Koopman operator approximation and renders it with inherent generalization, and the SWLS enforces the deep Koopman operator's capability to cope with changes in vehicle parameters, road conditions, and rapid maneuvers. This indicates the proposed physics-informed adaptive deep Koopman operator is a performant and efficient data-driven modeling tool.
- [39] arXiv:2503.23416 [pdf, html, other]
-
Title: Distributed Design of Ultra Large-Scale Control Systems: Progress, Challenges, and ProspectsComments: (in press)Journal-ref: Annual Reviews in Control, vol. 59C, pp. 100987, 2025Subjects: Systems and Control (eess.SY)
The transition from large centralized complex control systems to distributed configurations that rely on a network of a very large number of interconnected simpler subsystems is ongoing and inevitable in many applications. It is attributed to the quest for resilience, flexibility, and scalability in a multitude of engineering fields with far-reaching societal impact. Although many design methods for distributed and decentralized control systems are available, most of them rely on a centralized design procedure requiring some form of global information of the whole system. Clearly, beyond a certain scale of the network, these centralized design procedures for distributed controllers are no longer feasible and we refer to the corresponding systems as ultra large-scale systems (ULSS). For these ULSS, design algorithms are needed that are distributed themselves among the subsystems and are subject to stringent requirements regarding communication, computation, and memory usage of each subsystem. In this paper, a set of requirements is provided that assures a feasible real-time implementation of all phases of a control solution on an ultra large scale. State-of-the-art approaches are reviewed in the light of these requirements and the challenges hampering the development of befitting control algorithms are pinpointed. Comparing the challenges with the current progress leads to the identification and motivation of promising research directions.
- [40] arXiv:2503.23458 [pdf, html, other]
-
Title: Exact Characterization of Aggregate Flexibility via Generalized PolymatroidsSubjects: Systems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE)
There is growing interest in utilizing the flexibility in populations of distributed energy resources (DER) to mitigate the intermittency and uncertainty of renewable generation and provide additional grid services. To enable this, aggregators must effectively represent the flexibility in the populations they control to the market or system operator. A key challenge is accurately computing the aggregate flexibility of a population, which can be formally expressed as the Minkowski sum of a collection of polytopes - a problem that is generally computationally intractable. However, the flexibility polytopes of many DERs exhibit structural symmetries that can be exploited for computational efficiency. To this end, we introduce generalized polymatroids - a family of polytope - into the flexibility aggregation literature. We demonstrate that individual flexibility sets belong to this family, enabling efficient computation of their Minkowski sum. For homogeneous populations of DERs we further derive simplifications that yield more succinct representations of aggregate flexibility. Additionally, we develop an efficient optimization framework over these sets and propose a vertex-based disaggregation method, to allocate aggregate flexibility among individual DERs. Finally, we validate the optimality and computational efficiency of our approach through comparisons with existing methods.
- [41] arXiv:2503.23477 [pdf, html, other]
-
Title: Loss-aware Pricing Strategies for Peer-to-Peer Energy TradingSubjects: Systems and Control (eess.SY)
Peer-to-peer(P2P) energy trading may increase efficiency and reduce costs, but introduces significant challenges for network operators such as maintaining grid reliability, accounting for network losses, and redistributing costs equitably. We propose a novel loss-aware pricing strategy for P2P energy markets that addresses these challenges while incentivizing participation in the cooperative energy trading market. The problem is formulated as a hierarchical Stackelberg game, where a grid operator determines network tariffs while prosumers optimize their trades based on these tariffs while guaranteeing that network constraints are satisfied. The algorithm is designed to minimize and recover their cost from the trading parties, while also minimizing the total cost of the hubs. The mechanism dynamically adjusts tariffs based on location and network topology, discouraging loss-intensive trades. Finally, the complete framework includes the computation of fair trading prices, ensuring all market participants benefit equitably. An ADMM-based hyper-gradient descent method is proposed for solving this problem. Extensive numerical simulations using the benchmark IEEE 33-bus system demonstrate significant cost reductions and improved network efficiency through reduction in network losses compared to constant tariff schemes. Results highlight the adaptability and scalability of the proposed mechanism to varying network configurations and size, demand profiles, and seasonal conditions.
- [42] arXiv:2503.23518 [pdf, html, other]
-
Title: Intent-Aware MPC for Aircraft Detect-and-Avoid with Response Delay: A Comparative Study with ACAS XuComments: 8 Pages, 14 Figures, 1 TableSubjects: Systems and Control (eess.SY)
In this paper, we propose an intent-aware Model Predictive Control (MPC) approach for the remain-well-clear (RWC) functionality of a multi-agent aircraft detect-and-avoid (DAA) system and compare its performance with the standardized Airborne Collision Avoidance System Xu (ACAS Xu). The aircraft system is modeled as a linear system for horizontal maneuvering, with advisories on the rate of turn as the control input. Both deterministic and stochastic time delays are considered to account for the lag between control guidance issuance and the response of the aircraft. The capability of the MPC scheme in producing an optimal control profile over the entire horizon is used to mitigate the impact of the delay. We compare the proposed MPC method with ACAS Xu using various evaluation metrics, including loss of DAA well-clear percentage, near mid-air collision percentage, horizontal miss distance, and additional flight distance across different encounter scenarios. It is shown that the MPC scheme achieves better evaluation metrics than ACAS Xu for both deterministic and stochastic scenarios.
- [43] arXiv:2503.23540 [pdf, html, other]
-
Title: Zak-OTFS for Mutually Unbiased Sensing and CommunicationComments: 6 pages, 4 figures. Submitted to IEEE-Globecom-2025Subjects: Signal Processing (eess.SP)
Waveforms with ideal ambiguity functions are fundamental to integrated sensing and communication, to active sensing (radar), and to uplink multiple access. We describe a general method of constructing waveforms using the discrete Zak transform (DZT) to convert sequences of length $MN$ in the time domain to waveforms in the delay-Doppler (DD) domain, each of which is defined by an $M\times N$ quasi-periodic array. The DZT preserves inner products, and we show that phase coded waveforms used in radar (CAZAC sequences) determine noise-like waveforms in the DD domain, each with low Peak to Average Power Ratio. In a Zak-OTFS communication system, we show that these waveforms are mutually unbiased with respect to every carrier and use them to integrate sensing and communication as spread pilots. We view each waveform as a linear combination of Zak-OTFS carriers and show that the self-ambiguity function is supported on a discrete line in the integers modulo $MN$. The sidelobes are significantly lower than the original CAZAC sequence, and the advantage of discrete support is better localization/resolution in delay and Doppler compared with standard methods based on chirps or tones. We show that the absolute value of the cross-ambiguity function for pairs of waveforms in the same family is small and constant. This property makes the waveforms ideal preambles in the 2-step RACH protocol introduced in Release 15, 3GPP to enable grant-free multiple access. The characteristics of the cross-ambiguity function make it possible to simultaneously detect multiple preambles in the presence of mobility and delay spread.
- [44] arXiv:2503.23586 [pdf, html, other]
-
Title: A first-order DirAC-based parametric Ambisonic coder for immersive communicationsComments: Accepted at ICASSP'25Subjects: Audio and Speech Processing (eess.AS)
Directional Audio Coding (DirAC) is a proven method for parametrically representing a 3D audio scene in B-format and is capable of reproducing it on arbitrary loudspeaker layouts. Although such a method seems well suited for low bitrate Ambisonic transmission, little work has been done on the feasibility of building a real system upon it. In this paper, we present a DirAC-based coding for Higher-Order Ambisonics (HOA), developed as part of a standardisation effort to extend the 3GPP EVS codec to immersive communications. Starting from the first-order DirAC model, we show how to reduce algorithmic delay, the bitrate required for the parameters and complexity by bringing the full synthesis in the spherical harmonic domain. The evaluation of the proposed technique for coding 3\textsuperscript{rd} order Ambisonics at bitrates from 32 to 128 kbps shows the relevance of the parametric approach compared with existing solutions.
- [45] arXiv:2503.23658 [pdf, html, other]
-
Title: Optimizing Age of Information in Networks with Large and Small UpdatesComments: To appear in WiOpt 2025Subjects: Systems and Control (eess.SY); Networking and Internet Architecture (cs.NI)
Modern sensing and monitoring applications typically consist of sources transmitting updates of different sizes, ranging from a few bytes (position, temperature, etc.) to multiple megabytes (images, video frames, LIDAR point scans, etc.). Existing approaches to wireless scheduling for information freshness typically ignore this mix of large and small updates, leading to suboptimal performance. In this paper, we consider a single-hop wireless broadcast network with sources transmitting updates of different sizes to a base station over unreliable links. Some sources send large updates spanning many time slots while others send small updates spanning only a few time slots. Due to medium access constraints, only one source can transmit to the base station at any given time, thus requiring careful design of scheduling policies that takes the sizes of updates into account. First, we derive a lower bound on the achievable Age of Information (AoI) by any transmission scheduling policy. Second, we develop optimal randomized policies that consider both switching and no-switching during the transmission of large updates. Third, we introduce a novel Lyapunov function and associated analysis to propose an AoI-based Max-Weight policy that has provable constant factor optimality guarantees. Finally, we evaluate and compare the performance of our proposed scheduling policies through simulations, which show that our Max-Weight policy achieves near-optimal AoI performance.
- [46] arXiv:2503.23663 [pdf, html, other]
-
Title: Stability and Controllability of Revenue Systems via the Bode ApproachSubjects: Systems and Control (eess.SY)
In online revenue systems, e.g. an advertising system, budget pacing plays a critical role in ensuring that the spend aligns with desired financial objectives. Pacing systems dynamically control the velocity of spending to balance auction intensity, traffic fluctuations, and other stochastic variables. Current industry practices rely heavily on trial-and-error approaches, often leading to inefficiencies and instability. This paper introduces a principled methodology rooted in Classical Control Theory to address these challenges. By modeling the pacing system as a linear time-invariant (LTI) proxy and leveraging compensator design techniques using Bode methodology, we derive a robust controller to minimize pacing errors and enhance stability. The proposed methodology is validated through simulation and tested by our in-house auction system, demonstrating superior performance in achieving precise budget allocation while maintaining resilience to traffic and auction dynamics. Our findings bridge the gap between traditional control theory and modern advertising systems in modeling, simulation, and validation, offering a scalable and systematic approach to budget pacing optimization.
- [47] arXiv:2503.23734 [pdf, html, other]
-
Title: Semantic Packet Aggregation and Repeated Transmission for Text-to-Image GenerationSubjects: Signal Processing (eess.SP)
Text-based communication is expected to be prevalent in 6G applications such as wireless AI-generated content (AIGC). Motivated by this, this paper addresses the challenges of transmitting text prompts over erasure channels for a text-to-image AIGC task by developing the semantic segmentation and repeated transmission (SMART) algorithm. SMART groups words in text prompts into packets, prioritizing the task-specific significance of semantics within these packets, and optimizes the number of repeated transmissions. Simulation results show that SMART achieves higher similarities in received texts and generated images compared to a character-level packetization baseline, while reducing computing latency by orders of magnitude compared to an exhaustive search baseline.
- [48] arXiv:2503.23742 [pdf, html, other]
-
Title: On the Steady-State Distributionally Robust Kalman FilterSubjects: Systems and Control (eess.SY)
State estimation in the presence of uncertain or data-driven noise distributions remains a critical challenge in control and robotics. Although the Kalman filter is the most popular choice, its performance degrades significantly when distributional mismatches occur, potentially leading to instability or divergence. To address this limitation, we introduce a novel steady-state distributionally robust (DR) Kalman filter that leverages Wasserstein ambiguity sets to explicitly account for uncertainties in both process and measurement noise distributions. Our filter achieves computational efficiency by requiring merely the offline solution of a single convex semidefinite program, which yields a constant DR Kalman gain for robust state estimation under distributional mismatches. Additionally, we derive explicit theoretical conditions on the ambiguity set radius that ensure the asymptotic convergence of the time-varying DR Kalman filter to the proposed steady-state solution. Numerical simulations demonstrate that our approach outperforms existing baseline filters in terms of robustness and accuracy across both Gaussian and non-Gaussian uncertainty scenarios, highlighting its significant potential for real-world control and estimation applications.
- [49] arXiv:2503.23772 [pdf, html, other]
-
Title: TransVFC: A Transformable Video Feature Compression Framework for MachinesComments: This paper is submitted to elsevier's journel Pattern RecognitionSubjects: Image and Video Processing (eess.IV)
Nowadays, more and more video transmissions primarily aim at downstream machine vision tasks rather than humans. While widely deployed Human Visual System (HVS) oriented video coding standards like H.265/HEVC and H.264/AVC are efficient, they are not the optimal approaches for Video Coding for Machines (VCM) scenarios, leading to unnecessary bitrate expenditure. The academic and technical exploration within the VCM domain has led to the development of several strategies, and yet, conspicuous limitations remain in their adaptability for multi-task scenarios. To address the challenge, we propose a Transformable Video Feature Compression (TransVFC) framework. It offers a compress-then-transfer solution and includes a video feature codec and Feature Space Transform (FST) modules. In particular, the temporal redundancy of video features is squeezed by the codec through the scheme-based inter-prediction module. Then, the codec implements perception-guided conditional coding to minimize spatial redundancy and help the reconstructed features align with downstream machine this http URL that, the reconstructed features are transferred to new feature spaces for diverse downstream tasks by FST modules. To accommodate a new downstream task, it only requires training one lightweight FST module, avoiding retraining and redeploying the upstream codec and downstream task networks. Experiments show that TransVFC achieves high rate-task performance for diverse tasks of different granularities. We expect our work can provide valuable insights for video feature compression in multi-task scenarios. The codes are at this https URL.
- [50] arXiv:2503.23783 [pdf, other]
-
Title: ANNs-SaDE: A Machine-Learning-Based Design Automation Framework for Microwave Branch-Line CouplersComments: This paper has been accepted for presentation at ISCAS 2025Subjects: Signal Processing (eess.SP)
The traditional method for designing branch-line couplers involves a trial-and-error optimization process that requires multiple design iterations through electromagnetic (EM) simulations. Thus, it is extremely time consuming and labor intensive. In this paper, a novel machine-learning-based framework is proposed to tackle this issue. It integrates artificial neural networks with a self-adaptive differential evolution algorithm (ANNs-SaDE). This framework enables the self-adaptive design of various types of microwave branch-line couplers by precisely optimizing essential electrical properties, such as coupling factor, isolation, and phase difference between output ports. The effectiveness of the ANNs-SaDE framework is demonstrated by the designs of folded single-stage branch-line couplers and multi-stage wideband branch-line couplers.
- [51] arXiv:2503.23805 [pdf, other]
-
Title: On the Analysis of Qualitative Nyquist PlotsSubjects: Systems and Control (eess.SY); Dynamical Systems (math.DS)
A powerful tool in control and systems engineering is represented by Nyquist plots, for which a qualitative representation often gives a clearer visualization of the frequency response function that is typically not given by computer programs, especially if portions of the Nyquist plot extend to infinity. This letter addresses the graphical analysis of the frequency response function, with the objective of enhancing the procedure for the qualitative construction of Nyquist plots. Several results supported by analytical proofs are derived for what concerns the low and high frequency behavior, which enable to improve the qualitative construction of Nyquist plots in the vicinity of the initial and final points.
- [52] arXiv:2503.23810 [pdf, html, other]
-
Title: Adaptive Attention-Based Model for 5G Radio-based Outdoor LocalizationComments: 6 pages, 6 figuresSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Radio-based localization in dynamic environments, such as urban and vehicular settings, requires systems that can efficiently adapt to varying signal conditions and environmental changes. Factors such as multipath interference and obstructions introduce different levels of complexity that affect the accuracy of the localization. Although generalized models offer broad applicability, they often struggle to capture the nuances of specific environments, leading to suboptimal performance in real-world deployments. In contrast, specialized models can be tailored to particular conditions, enabling more precise localization by effectively handling domain-specific variations and noise patterns. However, deploying multiple specialized models requires an efficient mechanism to select the most appropriate one for a given scenario. In this work, we develop an adaptive localization framework that combines shallow attention-based models with a router/switching mechanism based on a single-layer perceptron (SLP). This enables seamless transitions between specialized localization models optimized for different conditions, balancing accuracy, computational efficiency, and robustness to environmental variations. We design three low-complex localization models tailored for distinct scenarios, optimized for reduced computational complexity, test time, and model size. The router dynamically selects the most suitable model based on real-time input characteristics. The proposed framework is validated using real-world vehicle localization data collected from a massive MIMO base station (BS), demonstrating its ability to seamlessly adapt to diverse deployment conditions while maintaining high localization accuracy.
- [53] arXiv:2503.23818 [pdf, html, other]
-
Title: Free Parametrization of L2-bounded State Space ModelsComments: 8 pagesSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Structured state-space models (SSMs) have emerged as a powerful architecture in machine learning and control, featuring stacked layers where each consists of a linear time-invariant (LTI) discrete-time system followed by a nonlinearity. While SSMs offer computational efficiency and excel in long-sequence predictions, their widespread adoption in applications like system identification and optimal control is hindered by the challenge of ensuring their stability and robustness properties. We introduce L2RU, a novel parametrization of SSMs that guarantees input-output stability and robustness by enforcing a prescribed L-bound for all parameter values. This design eliminates the need for complex constraints, allowing unconstrained optimization over L2RUs by using standard methods such as gradient descent. Leveraging tools from system theory and convex optimization, we derive a non-conservative parametrization of square discrete-time LTI systems with a specified L2-bound, forming the foundation of the L2RU architecture. Additionally, we enhance its performance with a bespoke initialization strategy optimized for long input sequences. Through a system identification task, we validate L2RU's superior performance, showcasing its potential in learning and control applications.
- [54] arXiv:2503.23827 [pdf, html, other]
-
Title: Aud-Sur: An Audio Analyzer Assistant for Audio Surveillance ApplicationsPhat Lam, Lam Pham, Dat Tran, Alexander Schindler, Silvia Poletti, Marcel Hasenbalg, David Fischinger, Martin BoyerComments: A preprint for conference paper, 8 pages, 9 figuresSubjects: Audio and Speech Processing (eess.AS)
In this paper, we present an audio analyzer assistant tool designed for a wide range of audio-based surveillance applications (This work is a part of our DEFAME FAKES and EUCINF projects). The proposed tool, refered to as Aud-Sur, comprises two main phases Audio Analysis and Audio Retrieval, respectively. In the first phase, multiple open-source audio models are leveraged to extract information from input audio recording uploaded by a user. In the second phase, users interact with the Aud-Sur tool via a natural question-and-answer manner, powered by a large language model (LLM), to retrieve the information extracted from the processed audio file. The Aud-Sur tool was deployed using Docker on a microservices-based architecture design. By leveraging open-source audio models for information extraction, LLM for audio information retrieval, and a microservices-based deployment approach, the proposed Aud-Sur tool offers a highly extensible and adaptable framework that can integrate more audio tasks, and be widely shared within the audio community for further development.
- [55] arXiv:2503.23858 [pdf, other]
-
Title: Incremental capacity-based multi-feature fusion model for predicting state-of-health of lithium-ion batteriesSubjects: Systems and Control (eess.SY)
Lithium-ion batteries have become an indispensable part of human industrial production and daily life. For the safe use, management and maintenance of lithium-ion batteries, the state of health (SOH) of lithium-ion batteries is an important indicator so that the SOH estimation is of significant practical value. In order to accurately predict SOH, this paper proposes a fusion prediction model which combines particle swarm optimization (PSO) algorithm, bi-directional long-short time memory network (BiLSTM) and adaptive boosting (AdaBoost) algorithm. In the proposed prediction model, indirect health indicators (HIs), which characterize battery degradation, are obtained with the help of incremental capacity analysis (ICA), and is fed into BiLSTM to extract time-series features, whose parameters are optimized by employing PSO algorithm. On this basis, the AdaBoost algorithm is applied to reduce the risk of overfitting the PSO-BiLSTM model. The study based on lithium-ion battery data from Center for Advanced Life Cycle Engineering (CALCE) shows that the PSO-BiLSTM-AdaBoost model has higher accuracy, better robustness, and generalization ability.
- [56] arXiv:2503.23873 [pdf, html, other]
-
Title: Exploring In-Context Learning Capabilities of ChatGPT for Pathological Speech DetectionComments: submitted to EUSIPCO 2025Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Automatic pathological speech detection approaches have shown promising results, gaining attention as potential diagnostic tools alongside costly traditional methods. While these approaches can achieve high accuracy, their lack of interpretability limits their applicability in clinical practice. In this paper, we investigate the use of multimodal Large Language Models (LLMs), specifically ChatGPT-4o, for automatic pathological speech detection in a few-shot in-context learning setting. Experimental results show that this approach not only delivers promising performance but also provides explanations for its decisions, enhancing model interpretability. To further understand its effectiveness, we conduct an ablation study to analyze the impact of different factors, such as input type and system prompts, on the final results. Our findings highlight the potential of multimodal LLMs for further exploration and advancement in automatic pathological speech detection.
- [57] arXiv:2503.23883 [pdf, html, other]
-
Title: Algorithm Design and Prototype Validation for Reconfigurable Intelligent Sensing Surface: Forward-Only TransmissionSubjects: Signal Processing (eess.SP)
Sensing-assisted communication schemes have recently garnered significant research attention. In this work, we design a dual-function reconfigurable intelligent surface (RIS), integrating both active and passive elements, referred to as the reconfigurable intelligent sensing surface (RISS), to enhance communication. By leveraging sensing results from the active elements, we propose communication enhancement and robust interference suppression schemes for both near-field and far-field models, implemented through the passive elements. These schemes remove the need for base station (BS) feedback for RISS control, simplifying the communication process by replacing traditional channel state information (CSI) feedback with real-time sensing from the active elements. The proposed schemes are theoretically analyzed and then validated using software-defined radio (SDR). Experimental results demonstrate the effectiveness of the sensing algorithms in real-world scenarios, such as direction of arrival (DOA) estimation and radio frequency (RF) identification recognition. Moreover, the RISS-assisted communication system shows strong performance in communication enhancement and interference suppression, particularly in near-field models.
- [58] arXiv:2503.23885 [pdf, html, other]
-
Title: Robust Suboptimal Local Basis Function Algorithms for Identification of Nonstationary FIR Systems in Impulsive Noise EnvironmentsSubjects: Signal Processing (eess.SP); Systems and Control (eess.SY)
While local basis function (LBF) estimation algorithms, commonly used for identifying/tracking systems with time-varying parameters, demonstrate good performance under the assumption of normally distributed measurement noise, the estimation results may significantly deviate from satisfactory when the noise distribution is impulsive in nature, for example, corrupted by outliers. This paper introduces a computationally efficient method to make the LBF estimator robust, enhancing its resistance to impulsive noise. First, the choice of basis functions is optimized based on the knowledge of parameter variation statistics. Then, the parameter tracking algorithm is made robust using the sequential data trimming technique. Finally, it is demonstrated that the proposed algorithm can undergo online tuning through parallel estimation and leave-one-out cross-validation.
- [59] arXiv:2503.23892 [pdf, html, other]
-
Title: Surveying Uncertainty Representation: A Unified Model for Cyber-Physical SystemsSubjects: Systems and Control (eess.SY)
Cyber-Physical Systems (CPS) operate in dynamic environments, leading to different types of uncertainty. This work provides a comprehensive review of uncertainty representations and categorizes them based on the dimensions used to represent uncertainty. Through this categorization, key gaps and limitations in existing approaches are identified. To address these issues, a Conceptual Model of Uncertainty Representations in CPS is introduced, integrating and extending existing models. Its applicability is demonstrated through examples from the automotive domain, showing its effectiveness in capturing and structuring uncertainty in real-world scenarios.
- [60] arXiv:2503.23903 [pdf, html, other]
-
Title: Privacy Preservation for Statistical Input in Dynamical SystemsSubjects: Systems and Control (eess.SY)
This paper addresses the challenge of privacy preservation for statistical inputs in dynamical systems. Motivated by an autonomous building application, we formulate a privacy preservation problem for statistical inputs in linear time-invariant systems. What makes this problem widely applicable is that the inputs, rather than being assumed to be deterministic, follow a probability distribution, inherently embedding privacy-sensitive information that requires protection. This formulation also presents a technical challenge as conventional differential privacy mechanisms are not directly applicable. Through rigorous analysis, we develop strategy to achieve $(0, \delta)$ differential privacy through adding noise. Finally, the effectiveness of our methods is demonstrated by revisiting the autonomous building application.
- [61] arXiv:2503.23912 [pdf, html, other]
-
Title: Certified Approximate Reachability (CARe): Formal Error Bounds on Deep Learning of Reachable SetsPrashant Solanki, Nikolaus Vertovec, Yannik Schnitzer, Jasper Van Beers, Coen de Visser, Alessandro AbateSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
Recent approaches to leveraging deep learning for computing reachable sets of continuous-time dynamical systems have gained popularity over traditional level-set methods, as they overcome the curse of dimensionality. However, as with level-set methods, considerable care needs to be taken in limiting approximation errors, particularly since no guarantees are provided during training on the accuracy of the learned reachable set. To address this limitation, we introduce an epsilon-approximate Hamilton-Jacobi Partial Differential Equation (HJ-PDE), which establishes a relationship between training loss and accuracy of the true reachable set. To formally certify this approximation, we leverage Satisfiability Modulo Theories (SMT) solvers to bound the residual error of the HJ-based loss function across the domain of interest. Leveraging Counter Example Guided Inductive Synthesis (CEGIS), we close the loop around learning and verification, by fine-tuning the neural network on counterexamples found by the SMT solver, thus improving the accuracy of the learned reachable set. To the best of our knowledge, Certified Approximate Reachability (CARe) is the first approach to provide soundness guarantees on learned reachable sets of continuous dynamical systems.
- [62] arXiv:2503.23926 [pdf, html, other]
-
Title: Reliable Traffic Monitoring Using Low-Cost Doppler Radar UnitsSubjects: Signal Processing (eess.SP)
Road traffic monitoring typically involves the counting and recording of vehicles on public roads over extended periods. The data gathered from such monitoring provides useful information to municipal authorities in urban areas. This paper presents a low-cost, widely deployable sensing subsystem based on Continuous Wave Doppler radar. The proposed system can perform vehicle detection and speed estimation with a total cost of less than 100 USD. The sensing system (including the hardware subsystem and the algorithms) is designed to be placed on the side of the road, allowing for easy deployment and serviceability.
- [63] arXiv:2503.23933 [pdf, other]
-
Title: PupiNet: Seamless OCT-OCTA Interconversion Through Wavelet-Driven and Multi-Scale Attention MechanismsComments: 8 pages,4 figures,5 tables,submitted to the 33rd ACM International Conference on Multimedia(ACM MM 2025)Subjects: Image and Video Processing (eess.IV)
Optical Coherence Tomography (OCT) and Optical Coherence Tomography Angiography (OCTA) are key diagnostic tools for clinical evaluation and management of retinal diseases. Compared to traditional OCT, OCTA provides richer microvascular information, but its acquisition requires specialized sensors and high-cost equipment, creating significant challenges for the clinical deployment of hardware-dependent OCTA imaging methods. Given the technical complexity of OCTA image acquisition and potential mechanical artifacts, this study proposes a bidirectional image conversion framework called PupiNet, which accurately achieves bidirectional transformation between 3D OCT and 3D OCTA. The generator module of this framework innovatively integrates wavelet transformation and multi-scale attention mechanisms, significantly enhancing image conversion quality. Meanwhile, an Adaptive Discriminator Augmentation (ADA) module has been incorporated into the discriminator to optimize model training stability and convergence efficiency. To ensure clinical accuracy of vascular structures in the converted images, we designed a Vessel Structure Matcher (VSM) supervision module, achieving precise matching of vascular morphology between generated images and target images. Additionally, the Hierarchical Feature Calibration (HFC) module further guarantees high consistency of texture details between generated images and target images across different depth levels. To rigorously validate the clinical effectiveness of the proposed method, we conducted a comprehensive evaluation on a paired OCT-OCTA image dataset containing 300 eyes with various retinal pathologies. Experimental results demonstrate that PupiNet not only reliably achieves high-quality bidirectional transformation between the two modalities but also shows significant advantages in image fidelity, vessel structure preservation, and clinical usability.
- [64] arXiv:2503.23984 [pdf, html, other]
-
Title: Two-wheel-driven Electric Superbike Powertrain OptimizationComments: 6 pages, 3 figures, 3 tablesSubjects: Systems and Control (eess.SY)
In this paper, we propose an optimization framework for the powertrain design of a two-wheel-driven electric superbike, minimizing energy consumption. Specifically, we jointly optimize the force distribution between the wheels with the gear ratio, and rear motor and battery sizing while explicitly considering vehicle dynamics and performance constraints. First, we present an energy consumption model of the vehicle, including a scalable model of the electric machine based on data from the industry, accounting for iron, copper, and mechanical losses. Then, we analyze the propulsive blending strategy to distribute the required power to the wheels while considering adherence limits. Finally, we demonstrate the effectiveness of our approach by analyzing the design of a superbike, based on regulatory driving cycles and a custom high-performance circuit by comparing the force distribution approaches. The results underline the significance of joint optimization of powertrain components and propulsive bias, achieving a reduction of up to 22.36% in energy consumption for the Sport high-performance driving cycle.
- [65] arXiv:2503.24002 [pdf, other]
-
Title: A Simple BER Expression for FSO Systems with Weak Turbulence and Pointing ErrorsSubjects: Signal Processing (eess.SP)
We develop a simple approximation for the average BER for an FSO system impacted by weak turbulence and pointing errors. Numerical results show that the proposed expression accurately predicts the true BER.
- [66] arXiv:2503.24025 [pdf, html, other]
-
Title: Consensus on Open Multi-Agent Systems Over Graphs Sampled from GraphonsComments: 8 pages, 1 figureSubjects: Systems and Control (eess.SY); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
We show how graphons can be used to model and analyze open multi-agent systems, which are multi-agent systems subject to arrivals and departures, in the specific case of linear consensus. First, we analyze the case of replacements, where under the assumption of a deterministic interval between two replacements, we derive an upper bound for the disagreement in expectation. Then, we study the case of arrivals and departures, where we define a process for the evolution of the number of agents that guarantees a minimum and a maximum number of agents. Next, we derive an upper bound for the disagreement in expectation, and we establish a link with the spectrum of the expected graph used to generate the graph topologies. Finally, for stochastic block model (SBM) graphons, we prove that the computation of the spectrum of the expected graph can be performed based on a matrix whose dimension depends only on the graphon and it is independent of the number of agents.
- [67] arXiv:2503.24031 [pdf, html, other]
-
Title: An ANN-Enhanced Approach for Flatness-Based Constrained Control of Nonlinear SystemsSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Neural networks have proven practical for a synergistic combination of advanced control techniques. This work analyzes the implementation of rectified linear unit neural networks to achieve constrained control in differentially flat systems. Specifically, the class of flat systems enjoys the benefit of feedback linearizability, i.e., the systems can be linearized by means of a proper variable transformation. However, the price for linearizing the dynamics is that the constraint descriptions are distorted geometrically. Our results show that, by using neural networks, these constraints can be represented as a union of polytopes, enabling the use of mixed-integer programming tools to guarantee constraint satisfaction. We further analyze the integration of the characterization into efficient settings such as control Lyapunov function-based and model predictive control (MPC). Interestingly, this description also allows us to explicitly compute the solution of the MPC problem for the nonlinear system. Several examples are provided to illustrate the effectiveness of our framework.
- [68] arXiv:2503.24063 [pdf, other]
-
Title: A robot-assisted pipeline to rapidly scan 1.7 million historical aerial photographsSheila Masson, Alan Potts, Allan Williams, Steve Berggreen, Kevin McLaren, Sam Martin, Eugenio Noda, Nicklas Nordfors, Nic Ruecroft, Hannah Druckenmiller, Solomon Hsiang, Andreas Madestam, Anna TompsettSubjects: Image and Video Processing (eess.IV); General Economics (econ.GN); Systems and Control (eess.SY)
During the 20th Century, aerial surveys captured hundreds of millions of high-resolution photographs of the earth's surface. These images, the precursors to modern satellite imagery, represent an extraordinary visual record of the environmental and social upheavals of the 20th Century. However, most of these images currently languish in physical archives where retrieval is difficult and costly. Digitization could revolutionize access, but manual scanning is slow and expensive. Here, we describe and validate a novel robot-assisted pipeline that increases worker productivity in scanning 30-fold, applied at scale to digitize an archive of 1.7 million historical aerial photographs from 65 countries.
- [69] arXiv:2503.24085 [pdf, html, other]
-
Title: Unraveling tensor structures in correct-by-design controller synthesisSubjects: Systems and Control (eess.SY)
Formal safety guarantees on the synthesis of controllers for stochastic systems can be obtained using correct-by-design approaches. These approaches often use abstractions as finite-state Markov Decision Processes. As the state space of these MDPs grows, the curse of dimensionality makes the computational and memory cost of the probabilistic guarantees, quantified with dynamic programming, scale exponentially. In this work, we leverage decoupled dynamics and unravel, via dynamic programming operations, a tree structure in the Canonical Polyadic Decomposition (CPD) of the value functions.
For discrete-time stochastic systems with syntactically co-safe linear temporal logic (scLTL) specifications, we provide provable probabilistic safety guarantees and significantly alleviate the computational burden. We provide an initial validation of the theoretical results on several typical case studies and showcase that the uncovered tree structure enables efficient reductions in the computational burden. - [70] arXiv:2503.24089 [pdf, html, other]
-
Title: Initial State Privacy of Nonlinear Systems on Riemannian ManifoldsSubjects: Systems and Control (eess.SY)
In this paper, we investigate initial state privacy protection for discrete-time nonlinear closed systems. By capturing Riemannian geometric structures inherent in such privacy challenges, we refine the concept of differential privacy through the introduction of an initial state adjacency set based on Riemannian distances. A new differential privacy condition is formulated using incremental output boundedness, enabling the design of time-varying Laplacian noise to achieve specified privacy guarantees. The proposed framework extends beyond initial state protection to also cover system parameter privacy, which is demonstrated as a special application.
- [71] arXiv:2503.24093 [pdf, html, other]
-
Title: Active Reconfigurable Intelligent Surfaces: Circuit Modeling and Reflection Amplification OptimizationPanagiotis Gavriilidis, Deepak Mishra, Besma Smida, Ertugrul Basar, Chau Yuen, George C. AlexandropoulosComments: 16 pages, 8 figures, submitted to an IEEE journalSubjects: Signal Processing (eess.SP)
Reconfigurable Intelligent Surfaces (RISs) constitute a promising emerging technology that enables wireless systems to control the propagation environment to enhance diverse communication objectives. To mitigate double-fading attenuation in RIS-aided links, the paradigm of active metamaterials capable of amplifying their incident wave has emerged. In this paper, capitalizing on the inherent negative-resistance region of tunnel diodes, we propose their integration into each RIS unit element to enable RISs with reflection amplification entirely in the analog domain. We derive novel realistic phase-amplitude relationships and power constraints specific to this model, addressing gaps in the existing literature where amplitude limits are often chosen arbitrarily. This characterization of our active RIS unit elements is incorporated into two novel optimization frameworks targeting the spectral efficiency maximization of RIS-assisted Multiple-Input-Multiple-Output (MIMO) systems, which are solved via an one-step approach and an iterative Alternating Optimization (AO) method. The former approach is used to initialize the AO framework, enhancing both its performance and convergence. Our numerical investigations emphasize the importance of accurately modeling phase-amplitude dependencies, and provide key insights into the impact of RIS-induced noise as well as the trade-off between available power and the number of active elements.
- [72] arXiv:2503.24104 [pdf, html, other]
-
Title: Application of Battery Storage to Switching Predictive Control of Power Distribution Systems Including Road HeatingComments: 13 pages, 14 figuresSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
A road heating system is an electrical device which promotes snow melting by burying a heating cable as a thermal source underground. When integrating road heating into the power distribution system, we need to optimize the flow of electric power by appropriately integrating distributed power sources and conventional power distribution equipment. In this paper, we extend the power distribution system considered in the authors' previous study to the case where battery storage is installed. As a main result, we propose a predictive switching control that achieves the reduction of distribution loss, attenuation of voltage fluctuation, and efficient snow melting, simultaneously. We verify the effectiveness of the application of battery storage through numerical simulation.
- [73] arXiv:2503.24105 [pdf, html, other]
-
Title: Data-Driven Distributed Output Synchronization of Heterogeneous Discrete-Time Multi-Agent SystemsComments: Extended version of the conference paper submitted to 64th IEEE Conference on Decision and ControlSubjects: Systems and Control (eess.SY)
In this paper, we assume that an autonomous exosystem generates a reference output, and we consider the problem of designing a distributed data-driven control law for a family of discrete-time heterogeneous LTI agents, connected through a directed graph, in order to synchronize the agents' outputs to the reference one. The agents of the network are split into two categories: leaders, with direct access to the exosystem output, and followers, that only receive information from their neighbors. All agents aim to achieve output synchronization by means of a state feedback that makes use of their own states as well as of an estimate of the exogenous system state, provided by an internal state observer. Such observer has a different structure for leaders and followers. Necessary and sufficient conditions for the existence of a solution are first derived in the model-based set-up and then in a data-driven context. An example illustrates both the implementation procedure and the performance of the proposed approach.
- [74] arXiv:2503.24138 [pdf, html, other]
-
Title: AI-Assisted Colonoscopy: Polyp Detection and Segmentation using Foundation ModelsUxue Delaquintana-Aramendi, Leire Benito-del-Valle, Aitor Alvarez-Gila, Javier Pascau, Luisa F Sánchez-Peralta, Artzai Picón, J Blas Pagador, Cristina L SaratxagaComments: This work has been submitted to the IEEE TMI for possible publicationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In colonoscopy, 80% of the missed polyps could be detected with the help of Deep Learning models. In the search for algorithms capable of addressing this challenge, foundation models emerge as promising candidates. Their zero-shot or few-shot learning capabilities, facilitate generalization to new data or tasks without extensive fine-tuning. A concept that is particularly advantageous in the medical imaging domain, where large annotated datasets for traditional training are scarce. In this context, a comprehensive evaluation of foundation models for polyp segmentation was conducted, assessing both detection and delimitation. For the study, three different colonoscopy datasets have been employed to compare the performance of five different foundation models, DINOv2, YOLO-World, GroundingDINO, SAM and MedSAM, against two benchmark networks, YOLOv8 and Mask R-CNN. Results show that the success of foundation models in polyp characterization is highly dependent on domain specialization. For optimal performance in medical applications, domain-specific models are essential, and generic models require fine-tuning to achieve effective results. Through this specialization, foundation models demonstrated superior performance compared to state-of-the-art detection and segmentation models, with some models even excelling in zero-shot evaluation; outperforming fine-tuned models on unseen data.
- [75] arXiv:2503.24147 [pdf, html, other]
-
Title: Net 3.2 Tbps 225 Gbaud PAM4 O-Band IM/DD 2 km Transmission Using FR8 and DR8 with a CMOS 3 nm SerDes and TFLN ModulatorsCharles St-Arnault, Santiago Bernal, Derek Kita, Ross Dickson, Mariam Yehia Abdelaziz, Aleksandar Nikic, Benton Qiu, Benjamin Krueger, Fabio Pittalà, Christian Reimer, Bruce Beggs, Naim Ben-Hamida, David V. PlantSubjects: Signal Processing (eess.SP)
We report the first 3.2 and 4.2 Tbps (8 x 225Gbaud PAM4-8), IM/DD transmission system using FR8 and DR8 configurations with TFLN modulators driven by a 3nm SerDes under the HD-FEC threshold.
- [76] arXiv:2503.24152 [pdf, html, other]
-
Title: Quantifying Grid-Forming Behavior: Bridging Device-level Dynamics and System-Level StabilitySubjects: Systems and Control (eess.SY)
Grid-Forming (GFM) technology is considered a promising solution to build power electronics-dominated power systems. However, the impact of GFM converters on the system stability is still unquantified, creating a gap between the system- and device-level perspectives. To fill this gap, at the device-level, we propose a Forming Index to quantify a converter's response to grid voltage variations, providing a characterization of its GFM behavior. At the system-level, a quantitative notion of System Strength is introduced to capture the fundamental requirements for power system formation. Finally, we establish the alignment between device- and system-level metrics by demonstrating that GFM converters provably enhance system strength.
- [77] arXiv:2503.24156 [pdf, html, other]
-
Title: Reinforcing Localization Credibility Through Convex OptimizationSubjects: Signal Processing (eess.SP)
This work proposes a novel approach to reinforce localization security in wireless networks in the presence of malicious nodes that are able to manipulate (spoof) radio measurements. It substitutes the original measurement model by another one containing an auxiliary variance dilation parameter that disguises corrupted radio links into ones with large noise variances. This allows for relaxing the non-convex maximum likelihood estimator (MLE) into a semidefinite programming (SDP) problem by applying convex-concave programming (CCP) procedure. The proposed SDP solution simultaneously outputs target location and attacker detection estimates, eliminating the need for further application of sophisticated detectors. Numerical results corroborate excellent performance of the proposed method in terms of localization accuracy and show that its detection rates are highly competitive with the state of the art.
- [78] arXiv:2503.24169 [pdf, html, other]
-
Title: Disturbance-adaptive Model Predictive Control for Bounded Average Constraint ViolationsSubjects: Systems and Control (eess.SY)
This paper considers stochastic linear time-invariant systems subject to constraints on the average number of state-constraint violations over time without knowing the disturbance distribution. We present a novel disturbance-adaptive model predictive control (DAD-MPC) framework, which adjusts the disturbance model based on measured constraint violations. Using a robust invariance method, DAD-MPC ensures recursive feasibility and guarantees asymptotic or robust bounds on average constraint violations. Additionally, the bounds hold even with an inaccurate disturbance model, which allows for data-driven disturbance quantification methods to be used, such as conformal prediction. Simulation results demonstrate that the proposed approach outperforms state-of-the-art methods while satisfying average violation constraints.
- [79] arXiv:2503.24240 [pdf, html, other]
-
Title: Analysis of the French system imbalance paving the way for a novel operating reserve sizing approachComments: Paper accepted to be presented at the EEM 2025 conferenceSubjects: Systems and Control (eess.SY)
This paper examines the relationship between system imbalance and several explanatory variables within the French electricity system. The factors considered include lagged imbalance values, observations of renewable energy sources (RES) generation and consumption, and forecasts for RES generation and consumption. The study analyzes the distribution of system imbalance in relation to these variables. Additionally, an HGBR machine-learning model is employed to assess the predictability of imbalances and the explanatory power of the input variables studied.
The results indicate no clear correlation between RES generation or consumption and the observed imbalances. However, it is possible to predict the imbalance adequately using forecasts available a few hours before real-time, along with the lagged values of the imbalance. Predicting the imbalance a day in advance proves to be complex with the variables examined; however, the extreme quantiles of the imbalance used for reserve sizing and contracting can be predicted with sufficient accuracy. - [80] arXiv:2503.24253 [pdf, html, other]
-
Title: Deep Learning-Based Data Fusion of 6G Sensing and Inertial Information for Target Positioning: Experimental ValidationSubjects: Signal Processing (eess.SP)
The sixth-generation (6G) cellular technology will be deployed with a key feature of Integrated Sensing and Communication (ISAC), allowing the cellular network to map the environment through radar sensing on top of providing communication services. In this regard, the entire network can be considered as a sensor with a broader Field of View (FoV) of the environment, assisting in both the positioning of active and detection of passive targets. On the other hand, the non-3GPP sensors available on the target can provide additional information specific to the target that can be beneficially combined with ISAC sensing information to enhance the overall achievable positioning accuracy. In this paper, we first study the performance of the ISAC system in terms of its achievable accuracy in positioning the mobile target in an indoor scenario. Second, we study the performance gain achieved in the ISAC positioning accuracy after fusing the information from the target's non-3GPP sensors. To this end, we propose a novel data fusion solution based on the deep learning framework to fuse the information from ISAC and non-3GPP sensors.
We validate our proposed data fusion and positioning solution with a real-world ISAC Proof-of-Concept (PoC) as the wireless infrastructure, an Automated Guided Vehicle (AGV) as the target, and the Inertial Measurement Unit (IMU) sensor on the target as the non-3GPP sensor. The experimental results show that our proposed solution achieves an average positioning error of $3~\textrm{cm}$, outperforming the considered baselines. - [81] arXiv:2503.24314 [pdf, html, other]
-
Title: Impact of Synchronization Offsets and CSI Feedback Delay in Distributed MIMO SystemsSubjects: Signal Processing (eess.SP)
The main challenges of distributed MIMO systems lie in achieving highly accurate synchronization and ensuring the availability of accurate channel state information (CSI) at distributed nodes. This paper analytically examines the effects of synchronization offsets and CSI feedback delays on system capacity, providing insights into how these affect the coherent joint transmission gain. The capacity expressions are first derived under ideal conditions, and the effects of synchronization offsets and feedback delays are subsequently incorporated. This analysis can be applied to any distributed MIMO architecture. A comprehensive study, including system models and simulations evaluating the analytical expressions, is presented to quantify the capacity degradation caused by these factors. This study provides valuable insights into the design and performance of distributed MIMO systems. The analysis shows that time and frequency offsets, along with CSI feedback delay, cause inter-layer interference. Additionally, time offsets result in inter-symbol interference.
- [82] arXiv:2503.24342 [pdf, html, other]
-
Title: Coordinating Distributed Energy Resources with Nodal Pricing in Distribution Networks: a Game-Theoretic ApproachSubjects: Systems and Control (eess.SY)
We propose a real-time nodal pricing mechanism for cost minimization and voltage control in a distribution network with autonomous distributed energy resources and analyze the resulting market using stochastic game theory. Unlike existing methods, the proposed pricing scheme does not require device-aware centralized coordination or communication between prosumers. By developing new sufficient conditions under which a stochastic game is a Markov potential game, we show that the problem of computing an equilibrium for the proposed model is equivalent to solving a single-agent Markov Decision Process. These new conditions are general and may apply to other applications. We compute the equilibrium for an IEEE test system to empirically demonstrate the effectiveness of the pricing policy.
- [83] arXiv:2503.24371 [pdf, html, other]
-
Title: Policy Gradient for LQR with Domain RandomizationSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Domain randomization (DR) enables sim-to-real transfer by training controllers on a distribution of simulated environments, with the goal of achieving robust performance in the real world. Although DR is widely used in practice and is often solved using simple policy gradient (PG) methods, understanding of its theoretical guarantees remains limited. Toward addressing this gap, we provide the first convergence analysis of PG methods for domain-randomized linear quadratic regulation (LQR). We show that PG converges globally to the minimizer of a finite-sample approximation of the DR objective under suitable bounds on the heterogeneity of the sampled systems. We also quantify the sample-complexity associated with achieving a small performance gap between the sample-average and population-level objectives. Additionally, we propose and analyze a discount-factor annealing algorithm that obviates the need for an initial jointly stabilizing controller, which may be challenging to find. Empirical results support our theoretical findings and highlight promising directions for future work, including risk-sensitive DR formulations and stochastic PG algorithms.
New submissions (showing 83 of 83 entries)
- [84] arXiv:2503.22711 (cross-list from cs.SD) [pdf, html, other]
-
Title: Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditionsComments: 11 pages, 5 figuresSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.
- [85] arXiv:2503.22712 (cross-list from cs.SD) [pdf, html, other]
-
Title: Risk-Calibrated Affective Speech Recognition via Conformal Coverage Guarantees: A Stochastic Calibrative Framework for Emergent Uncertainty QuantificationSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Traffic safety challenges arising from extreme driver emotions highlight the urgent need for reliable emotion recognition systems. Traditional deep learning approaches in speech emotion recognition suffer from overfitting and poorly calibrated confidence estimates. We propose a framework integrating Conformal Prediction (CP) and Risk Control,using Mel-spectrogram features processed through a pre-trained convolutional neural network. Our key innovation is the development of a nonconformity score that heuristically measures how closely a classifier's predictions align with given inputs. Through calibration samples, we compute this score and derive a statistically rigorous threshold based on user-specified risk level $\alpha$, constructing prediction sets with provable coverage guarantees ($\geq 1-\alpha$). The Risk Control framework enables task-specific adaptation through customizable loss functions, dynamically adjusting prediction set sizes while maintaining coverage guarantees. Cross-dataset experiments on IEMOCAP and TESS demonstrate: 1) Strict coverage guarantee, 2) Significant negative correlation between Average Prediction Set Size (APSS) and $\alpha$, revealing reduced model uncertainty under high-risk conditions. We further propose APSS as a novel metric for evaluating classification uncertainty. This approach enhances speech emotion recognition reliability, with direct applications in intelligent transportation systems and real-time emotion monitoring.
- [86] arXiv:2503.22721 (cross-list from cs.LG) [pdf, html, other]
-
Title: PowerGNN: A Topology-Aware Graph Neural Network for Electricity GridsSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
The increasing penetration of renewable energy sources introduces significant variability and uncertainty in modern power systems, making accurate state prediction critical for reliable grid operation. Conventional forecasting methods often neglect the power grid's inherent topology, limiting their ability to capture complex spatio temporal dependencies. This paper proposes a topology aware Graph Neural Network (GNN) framework for predicting power system states under high renewable integration. We construct a graph based representation of the power network, modeling buses and transmission lines as nodes and edges, and introduce a specialized GNN architecture that integrates GraphSAGE convolutions with Gated Recurrent Units (GRUs) to model both spatial and temporal correlations in system dynamics. The model is trained and evaluated on the NREL 118 test system using realistic, time synchronous renewable generation profiles. Our results show that the proposed GNN outperforms baseline approaches including fully connected neural networks, linear regression, and rolling mean models, achieving substantial improvements in predictive accuracy. The GNN achieves average RMSEs of 0.13 to 0.17 across all predicted variables and demonstrates consistent performance across spatial locations and operational conditions. These results highlight the potential of topology aware learning for scalable and robust power system forecasting in future grids with high renewable penetration.
- [87] arXiv:2503.22728 (cross-list from cs.SD) [pdf, html, other]
-
Title: Dual Audio-Centric Modality Coupling for Talking Head GenerationComments: 9 pages, 4 figuresSubjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media. Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues. In this paper, we propose a novel NeRF-based framework, Dual Audio-Centric Modality Coupling (DAMC), which effectively integrates content and dynamic features from audio inputs. By leveraging a dual encoder structure, DAMC captures semantic content through the Content-Aware Encoder and ensures precise visual synchronization through the Dynamic-Sync Encoder. These features are fused using a Cross-Synchronized Fusion Module (CSFM), enhancing content representation and lip synchronization. Extensive experiments show that our method outperforms existing state-of-the-art approaches in key metrics such as lip synchronization accuracy and image quality, demonstrating robust generalization across various audio inputs, including synthetic speech from text-to-speech (TTS) systems. Our results provide a promising solution for high-quality, audio-driven talking head generation and present a scalable approach for creating realistic talking heads.
- [88] arXiv:2503.22757 (cross-list from cs.RO) [pdf, html, other]
-
Title: Strategies for decentralised UAV-based collisions monitoring in rugbyComments: Submitted for publication in an IEEE publicationSubjects: Robotics (cs.RO); Systems and Control (eess.SY); Adaptation and Self-Organizing Systems (nlin.AO)
Recent advancements in unmanned aerial vehicle (UAV) technology have opened new avenues for dynamic data collection in challenging environments, such as sports fields during fast-paced sports action. For the purposes of monitoring sport events for dangerous injuries, we envision a coordinated UAV fleet designed to capture high-quality, multi-view video footage of collision events in real-time. The extracted video data is crucial for analyzing athletes' motions and investigating the probability of sports-related traumatic brain injuries (TBI) during impacts. This research implemented a UAV fleet system on the NetLogo platform, utilizing custom collision detection algorithms to compare against traditional TV-coverage strategies. Our system supports decentralized data capture and autonomous processing, providing resilience in the rapidly evolving dynamics of sports collisions.
The collaboration algorithm integrates both shared and local data to generate multi-step analyses aimed at determining the efficacy of custom methods in enhancing the accuracy of TBI prediction models. Missions are simulated in real-time within a two-dimensional model, focusing on the strategic capture of collision events that could lead to TBI, while considering operational constraints such as rapid UAV maneuvering and optimal positioning. Preliminary results from the NetLogo simulations suggest that custom collision detection methods offer superior performance over standard TV-coverage strategies by enabling more precise and timely data capture. This comparative analysis highlights the advantages of tailored algorithmic approaches in critical sports safety applications. - [89] arXiv:2503.22849 (cross-list from math.OC) [pdf, html, other]
-
Title: Distances between finite-horizon linear behaviorsComments: IEEE Control Systems Letters / 64th IEEE Conference on Decision and ControlSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
The paper introduces a class of distances for linear behaviors over finite time horizons. These distances allow for comparisons between finite-horizon linear behaviors represented by matrices of possibly different dimensions. They remain invariant under coordinate changes, rotations, and permutations, ensuring independence from input-output partitions. Moreover, they naturally encode complexity-misfit trade-offs for Linear Time-Invariant (LTI) behaviors, providing a principled solution to a longstanding puzzle in behavioral systems theory. The resulting framework characterizes modeling as a minimum distance problem, identifying the Most Powerful Unfalsified Model (MPUM) as optimal among all systems unfalsified by a given dataset.
- [90] arXiv:2503.22896 (cross-list from math.AP) [pdf, html, other]
-
Title: Representation and Stability Analysis of 1D PDEs with Periodic Boundary ConditionsSubjects: Analysis of PDEs (math.AP); Systems and Control (eess.SY); Optimization and Control (math.OC)
Periodic boundary conditions are frequently used to model processes on large or infinite domains using PDEs on finite intervals, assuming solutions within the interval to extend periodically to the larger domain. However, stability analysis of PDEs with periodic boundary conditions is complicated by underlying uniform solutions admitted by these conditions, potentially giving rise to non-isolated equilibria. To resolve this issue, in this paper, it is shown how such underlying solutions for linear, 2nd order, 1D PDEs with periodic as well as more general boundary conditions can be modeled separately using the Partial Integral Equation (PIE) representation. In particular, it is first shown how any vector-valued function satisfying such boundary conditions is uniquely defined by its second-order derivative and some uniform or affine function, parameterized by auxiliary variables in $\mathbb{R}^{m}$. An equivalent representation of linear PDEs is then derived as a PIE, explicitly defining the dynamics of both the second-order derivative and auxiliary variables. Finally, a stability test for the PIE representation is formulated as a linear operator inequality, which can be solved using semidefinite programming. The proposed methodology is applied to two PDE examples, demonstrating that stability can be verified with tight bounds on the rate of exponential decay.
- [91] arXiv:2503.22911 (cross-list from physics.med-ph) [pdf, other]
-
Title: Development of a Miniaturized, Automated, and Cost-Effective Device for Enzyme-Linked Immunosorbent AssayComments: references in page 12, before tables and figuresSubjects: Medical Physics (physics.med-ph); Systems and Control (eess.SY)
In this work, a miniaturized, automated, and cost-effective ELISA device is designed and implemented, without the utilization of conventional techniques such as pipetting or microfluidic valve technologies. The device has dimensions of 24 cm x 19 cm x 14 cm and weighs <3 Kg. The total hardware cost of the device is estimated to be approximately $1,200, which can be further reduced through optimization during scale-up production. 3D printed disposable parts, including the reagent reservoir disk and the microfluidic connector, have also been developed. IL-6 is used as a model system to demonstrate how the device provides an ELISA measurement. The cost per test is estimated to be less than ten dollars. The compactness, automated operation, along with the cost-effectiveness of this ELISA device, makes it suitable for point-of-care applications in resource-limited regions.
- [92] arXiv:2503.22928 (cross-list from math.OC) [pdf, html, other]
-
Title: Optimal Control of an Epidemic with Intervention DesignComments: For code and computational details in Python, please refer to \url{this https URL\%20With\%20Intervention/Epidemic.ipynb}Subjects: Optimization and Control (math.OC); Theoretical Economics (econ.TH); Systems and Control (eess.SY)
In this paper, I propose a controlled SEIR model that advances epidemic management through optimal control theory. I improve the traditional framework by incorporating practical intervention constraints and economic considerations. Approaching this problem using modern methods of calculus of variations, I first conduct a rigorous mathematical analysis of the controlled system. Then, I formulate an infinite time horizon control problem and investigate its mathematical connections with finite time, setting the stage for applying the Hamiltonian procedure.
- [93] arXiv:2503.22969 (cross-list from math.OC) [pdf, html, other]
-
Title: An Adaptive Collaborative Neurodynamic Approach to Compute Nash Equilibrium in Normal-Form GamesSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
The Nash Equilibrium (NE), one of the elegant and fundamental concepts in game theory, plays a crucial part within various fields, including engineering and computer science. However, efficiently computing an NE in normal-form games remains a significant challenge, particularly for large-scale problems. In contrast to widely applied simplicial and homotopy methods, this paper designs a novel Adaptive Collaborative Neurodynamic Approach (ACNA), which for the first time guarantees both exact and global NE computation for general $N$-player normal-form games with mixed strategies, where the payoff functions are non-convex and the pseudo-gradient is non-monotone. Additionally, leveraging the adaptive penalty method, the ACNA ensures its state enters the constraint set in finite time, which avoids the second-order sufficiency conditions required by Lagrangian methods, and the computationally complicated penalty parameter estimation needed by exact penalty methods. Furthermore, by incorporating the particle swarm algorithm, it is demonstrated that the ACNA achieves global convergence to an exact NE with probability one. At last, a simulation is conducted to validate the effectiveness of the proposed approach.
- [94] arXiv:2503.23102 (cross-list from cs.LG) [pdf, html, other]
-
Title: The geomagnetic storm and Kp prediction using Wasserstein transformerSubjects: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Mathematical Physics (math-ph)
The accurate forecasting of geomagnetic activity is important. In this work, we present a novel multimodal Transformer based framework for predicting the 3 days and 5 days planetary Kp index by integrating heterogeneous data sources, including satellite measurements, solar images, and KP time series. A key innovation is the incorporation of the Wasserstein distance into the transformer and the loss function to align the probability distributions across modalities. Comparative experiments with the NOAA model demonstrate performance, accurately capturing both the quiet and storm phases of geomagnetic activity. This study underscores the potential of integrating machine learning techniques with traditional models for improved real time forecasting.
- [95] arXiv:2503.23103 (cross-list from cs.IT) [pdf, html, other]
-
Title: Towards Secure Semantic Communications in the Presence of Intelligent EavesdroppersSubjects: Information Theory (cs.IT); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
Semantic communication has emerged as a promising paradigm for enhancing communication efficiency in sixth-generation (6G) networks. However, the broadcast nature of wireless channels makes SemCom systems vulnerable to eavesdropping, which poses a serious threat to data privacy. Therefore, we investigate secure SemCom systems that preserve data privacy in the presence of eavesdroppers. Specifically, we first explore a scenario where eavesdroppers are intelligent and can exploit semantic information to reconstruct the transmitted data based on advanced artificial intelligence (AI) techniques. To counter this, we introduce novel eavesdropping attack strategies that utilize model inversion attacks and generative AI (GenAI) models. These strategies effectively reconstruct transmitted private data processed by the semantic encoder, operating in both glass-box and closed-box settings. Existing defense mechanisms against eavesdropping often cause significant distortions in the data reconstructed by eavesdroppers, potentially arousing their suspicion. To address this, we propose a semantic covert communication approach that leverages an invertible neural network (INN)-based signal steganography module. This module covertly embeds the channel input signal of a private sample into that of a non-sensitive host sample, thereby misleading eavesdroppers. Without access to this module, eavesdroppers can only extract host-related information and remain unaware of the hidden private content. We conduct extensive simulations under various channel conditions in image transmission tasks. Numerical results show that while conventional eavesdropping strategies achieve a success rate of over 80\% in reconstructing private information, the proposed semantic covert communication effectively reduces the eavesdropping success rate to 0.
- [96] arXiv:2503.23128 (cross-list from cs.SD) [pdf, html, other]
-
Title: CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and MiningComments: Accepted by ICME2025Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs' comprehensive music knowledge to generate contextually rich descriptions. Exten1sive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform.
- [97] arXiv:2503.23148 (cross-list from physics.optics) [pdf, html, other]
-
Title: Reducing Artifacts in Grating Interferometry Using Multiple Harmonics and Phase Step CorrectionsHunter C. Meyer, Conner B. Dooley, Victoria L. Fontenot, Kyungmin Ham, Leslie G. Butler, Alexandra Noel, Joyoni DeySubjects: Optics (physics.optics); Image and Video Processing (eess.IV); Medical Physics (physics.med-ph)
X-ray interferometry is an emerging imaging modality with a wide variety of potential clinical applications, including lung and breast imaging, as well as in non-destructive testing, such as additive manufacturing and porosimetry. A grating interferometer uses a diffraction grating to produce a periodic interference pattern and measures how a patient or sample perturbs the pattern, producing three unique images that highlight X-ray absorption, refraction, and small angle scattering, known as the transmission, differential-phase, and dark-field images, respectively. Image artifacts that are unique to X-ray interferometry are introduced when assuming the fringe pattern is perfectly sinusoidal and the phase steps are evenly spaced. Inaccuracies in grating position, coupled with multi-harmonic fringes, lead to remnant oscillations and phase wraparound artifacts. We have developed an image recovery algorithm that uses additional harmonics, direct relative phase fitting, and phase step corrections to prevent them. The direct relative phase fitting removes the phase wraparound artifact. Correcting the phase step positions and introducing the additional harmonic removes the grating remnant artifact present in the transmission, differential-phase, and dark-field images. By modifying existing algorithms, the fit to the fringe pattern is greatly improved and artifacts are minimized, as we demonstrate with the imaging of several samples, including PMMA microspheres, ex vivo formalin fixed mouse lungs, and porous alumina.
- [98] arXiv:2503.23258 (cross-list from cs.SD) [pdf, html, other]
-
Title: Joint Source-Environment Adaptation of Data-Driven Underwater Acoustic Source Ranging Based on Model UncertaintySubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Adapting pre-trained deep learning models to new and unknown environments is a difficult challenge in underwater acoustic localization. We show that although pre-trained models have performance that suffers from mismatch between the training and test data, they generally exhibit a higher ``implied uncertainty'' in environments where there is more mismatch. Leveraging this notion of implied uncertainty, we partition the test samples into more certain and less certain sets, and implement an estimation method using the certain samples to improve the labeling for uncertain samples, which helps to adapt the model. We use an efficient method to quantify model prediction uncertainty, and an innovative approach to adapt a pre-trained model to unseen underwater environments at test time. This eliminates the need for labeled data from the target environment or the original training data. This adaptation is enhanced by integrating an independent estimate based on the received signal energy. We validate the approach extensively using real experimental data, as well as synthetic data consisting of model-generated signals with real ocean noise. The results demonstrate significant improvements in model prediction accuracy, underscoring the potential of the method to enhance underwater acoustic localization in diverse, noisy, and unknown environments.
- [99] arXiv:2503.23260 (cross-list from cs.SD) [pdf, html, other]
-
Title: Mismatch-Robust Underwater Acoustic Localization Using A Differentiable Modular Forward ModelSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
In this paper, we study the underwater acoustic localization in the presence of environmental mismatch. Especially, we exploit a pre-trained neural network for the acoustic wave propagation in a gradient-based optimization framework to estimate the source location. To alleviate the effect of mismatch between the training data and the test data, we simultaneously optimize over the network weights at the inference time, and provide conditions under which this method is effective. Moreover, we introduce a physics-inspired modularity in the forward model that enables us to learn the path lengths of the multipath structure in an end-to-end training manner without access to the specific path labels. We investigate the validity of the assumptions in a simple yet illustrative environment model.
- [100] arXiv:2503.23262 (cross-list from cs.SD) [pdf, html, other]
-
Title: Joint Source-Environment Adaptation for Deep Learning-Based Underwater Acoustic Source RangingSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
In this paper, we propose a method to adapt a pre-trained deep-learning-based model for underwater acoustic localization to a new environment. We use unsupervised domain adaptation to improve the generalization performance of the model, i.e., using an unsupervised loss, fine-tune the pre-trained network parameters without access to any labels of the target environment or any data used to pre-train the model. This method improves the pre-trained model prediction by coupling that with an almost independent estimation based on the received signal energy (that depends on the source). We show the effectiveness of this approach on Bellhop generated data in an environment similar to that of the SWellEx-96 experiment contaminated with real ocean noise from the KAM11 experiment.
- [101] arXiv:2503.23377 (cross-list from cs.CV) [pdf, html, other]
-
Title: JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior SynchronizationKai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng ChuaComments: Work in progress. Homepage: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at this https URL.
- [102] arXiv:2503.23387 (cross-list from cs.SD) [pdf, other]
-
Title: HearFit+: Personalized Fitness Monitoring via Audio Signals on Smart SpeakersComments: IEEE Transactions on Mobile Computing ( Volume: 22, Issue: 5, 01 May 2023)Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Fitness can help to strengthen muscles, increase resistance to diseases, and improve body shape. Nowadays, a great number of people choose to exercise at home/office rather than at the gym due to lack of time. However, it is difficult for them to get good fitness effects without professional guidance. Motivated by this, we propose the first personalized fitness monitoring system, HearFit+, using smart speakers at home/office. We explore the feasibility of using acoustic sensing to monitor fitness. We design a fitness detection method based on Doppler shift and adopt the short time energy to segment fitness actions. Based on deep learning, HearFit+ can perform fitness classification and user identification at the same time. Combined with incremental learning, users can easily add new actions. We design 4 evaluation metrics (i.e., duration, intensity, continuity, and smoothness) to help users to improve fitness effects. Through extensive experiments including over 9,000 actions of 10 types of fitness from 12 volunteers, HearFit+ can achieve an average accuracy of 96.13% on fitness classification and 91% accuracy for user identification. All volunteers confirm that HearFit+ can help improve the fitness effect in various environments.
- [103] arXiv:2503.23391 (cross-list from cs.SD) [pdf, other]
-
Title: HearSmoking: Smoking Detection in Driving Environment via Acoustic Sensing on SmartphonesComments: IEEE Transactions on Mobile Computing ( Volume: 21, Issue: 8, 01 August 2022)Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Driving safety has drawn much public attention in recent years due to the fast-growing number of cars. Smoking is one of the threats to driving safety but is often ignored by drivers. Existing works on smoking detection either work in contact manner or need additional devices. This motivates us to explore the practicability of using smartphones to detect smoking events in driving environment. In this paper, we propose a cigarette smoking detection system, named HearSmoking, which only uses acoustic sensors on smartphones to improve driving safety. After investigating typical smoking habits of drivers, including hand movement and chest fluctuation, we design an acoustic signal to be emitted by the speaker and received by the microphone. We calculate Relative Correlation Coefficient of received signals to obtain movement patterns of hands and chest. The processed data is sent into a trained Convolutional Neural Network for classification of hand movement. We also design a method to detect respiration at the same time. To improve system performance, we further analyse the periodicity of the composite smoking motion. Through extensive experiments in real driving environments, HearSmoking detects smoking events with an average total accuracy of 93.44 percent in real-time.
- [104] arXiv:2503.23393 (cross-list from cs.SD) [pdf, other]
-
Title: D3-Guard: Acoustic-based Drowsy Driving Detection Using SmartphonesComments: IEEE INFOCOM 2019-IEEE Conference on Computer CommunicationsSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Since the number of cars has grown rapidly in recent years, driving safety draws more and more public attention. Drowsy driving is one of the biggest threatens to driving safety. Therefore, a simple but robust system that can detect drowsy driving with commercial off-the-shelf devices (such as smartphones) is very necessary. With this motivation, we explore the feasibility of purely using acoustic sensors embedded in smartphones to detect drowsy driving. We first study characteristics of drowsy driving, and find some unique patterns of Doppler shift caused by three typical drowsy behaviors, i.e. nodding, yawning and operating steering wheel. We then validate our important findings through empirical analysis of the driving data collected from real driving environments. We further propose a real-time Drowsy Driving Detection system (D3-Guard) based on audio devices embedded in smartphones. In order to improve the performance of our system, we adopt an effective feature extraction method based on undersampling technique and FFT, and carefully design a high-accuracy detector based on LSTM networks for the early detection of drowsy driving. Through extensive experiments with 5 volunteer drivers in real driving environments, our system can distinguish drowsy driving actions with an average total accuracy of 93.31% in real-time. Over 80% drowsy driving actions can be detected within first 70% of action duration.
- [105] arXiv:2503.23395 (cross-list from cs.SD) [pdf, html, other]
-
Title: Scaling Auditory Cognition via Test-Time Compute in Audio Language ModelsSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Large language models (LLMs) have shown exceptional versatility in natural language processing, prompting recent efforts to extend their multimodal capabilities to speech processing through the development of audio large language models (Audio LLMs). While Audio LLMs excel in tasks such as speech recognition and synthesis, it remains unclear how they perform when faced with the auditory cognitive challenges posed by real-world environments, such as audio comprehension and listening recall, particularly in the presence of background noise or overlapping speech. Unlike text-based LLMs, which have access to vast amounts of text data for pre-training, retraining Audio LLMs with diverse auditory cognitive scenes is difficult due to the limited datasets that simulate real-world auditory cognitive scenarios and the challenge of acquiring auditory cognitive labels for training. While test-time compute (TTC) methods have been shown to enhance the capabilities of text-based LLMs during inference, a key challenge lies in designing these TTC methods to improve the auditory capabilities of Audio LLMs. This study aims to address these two research gaps by: i) exploring the auditory cognitive capabilities of Audio LLMs, and ii) enhancing their capabilities using TTC approaches. We have investigated five different Audio LLMs for auditory cognition using a \textit{self-collected} database and have proposed five TTC approaches to enhance auditory cognitive capabilities during inference. Our findings reveal that Audio LLMs performance decreases in more challenging auditory cognitive tasks. The proposed TTC approaches significantly enhance cognitive auditory capabilities, advancing the development of more adaptable and resilient Audio LLMs for practical applications such as assistive listening devices, voice-based AI assistants, and communication technologies.
- [106] arXiv:2503.23439 (cross-list from cs.CL) [pdf, html, other]
-
Title: Speculative End-Turn Detector for Efficient Speech Chatbot AssistantComments: PreprintSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Spoken dialogue systems powered by large language models have demonstrated remarkable abilities in understanding human speech and generating appropriate spoken responses. However, these systems struggle with end-turn detection (ETD) -- the ability to distinguish between user turn completion and hesitation. This limitation often leads to premature or delayed responses, disrupting the flow of spoken conversations. In this paper, we introduce the ETD Dataset, the first public dataset for end-turn detection. The ETD dataset consists of both synthetic speech data generated with text-to-speech models and real-world speech data collected from web sources. We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments. Our approach jointly employs a lightweight GRU-based model, which rapidly detects the non-speaking units in real-time on local devices, and a high-performance Wav2vec-based model running on the server to make a more challenging classification of distinguishing turn ends from mere pauses. Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low. Datasets and code will be available after the review.
- [107] arXiv:2503.23446 (cross-list from cs.NI) [pdf, html, other]
-
Title: Semantic Communication for the Internet of Space: New Architecture, Challenges, and Future VisionComments: 9 pages, 6 figuresSubjects: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Signal Processing (eess.SP)
The expansion of sixth-generation (6G) wireless networks into space introduces technical challenges that conventional bit-oriented communication approaches cannot efficiently address, including intermittent connectivity, severe latency, limited bandwidth, and constrained onboard resources. To overcome these limitations, semantic communication has emerged as a transformative paradigm, shifting the communication focus from transmitting raw data to delivering context-aware, missionrelevant information. In this article, we propose a semantic communication architecture explicitly tailored for the 6G Internet of Space (IoS), integrating multi-modal semantic processing, AIdriven semantic encoding and decoding, and adaptive transmission mechanisms optimized for space environments. The effectiveness of our proposed framework is demonstrated through a representative deep-space scenario involving semantic-based monitoring of Mars dust storms. Finally, we outline open research challenges and discuss future directions toward realizing practical semantic-enabled IoS systems.
- [108] arXiv:2503.23470 (cross-list from cs.SD) [pdf, html, other]
-
Title: Evaluation of the Pronunciation of Tajweed Rules Based on DNN as a Step Towards Interactive Recitation LearningSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Proper recitation of the Quran, adhering to the rules of Tajweed, is crucial for preventing mistakes during recitation and requires significant effort to master. Traditional methods of teaching these rules are limited by the availability of qualified instructors and time constraints. Automatic evaluation of recitation can address these challenges by providing prompt feedback and supporting independent practice. This study focuses on developing a deep learning model to classify three Tajweed rules - separate stretching (Al Mad), tight noon (Ghunnah), and hide (Ikhfaa) - using the publicly available QDAT dataset, which contains over 1,500 audio recordings. The input data consisted of audio recordings from this dataset, transformed into normalized mel-spectrograms. For classification, the EfficientNet-B0 architecture was used, enhanced with a Squeeze-and-Excitation attention mechanism. The developed model achieved accuracy rates of 95.35%, 99.34%, and 97.01% for the respective rules. An analysis of the learning curves confirmed the model's robustness and absence of overfitting. The proposed approach demonstrates high efficiency and paves the way for developing interactive educational systems for Tajweed study.
- [109] arXiv:2503.23507 (cross-list from cs.CV) [pdf, html, other]
-
Title: Federated Self-Supervised Learning for One-Shot Cross-Modal and Cross-Imaging Technique SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Medical Physics (physics.med-ph)
Decentralized federated learning enables learning of data representations from multiple sources without compromising the privacy of the clients. In applications like medical image segmentation, where obtaining a large annotated dataset from a single source is a distressing problem, federated self-supervised learning can provide some solace. In this work, we push the limits further by exploring a federated self-supervised one-shot segmentation task representing a more data-scarce scenario. We adopt a pre-existing self-supervised few-shot segmentation framework CoWPro and adapt it to the federated learning scenario. To the best of our knowledge, this work is the first to attempt a self-supervised few-shot segmentation task in the federated learning domain. Moreover, we consider the clients to be constituted of data from different modalities and imaging techniques like MR or CT, which makes the problem even harder. Additionally, we reinforce and improve the baseline CoWPro method using a fused dice loss which shows considerable improvement in performance over the baseline CoWPro. Finally, we evaluate this novel framework on a completely unseen held-out part of the local client dataset. We observe that the proposed framework can achieve performance at par or better than the FedAvg version of the CoWPro framework on the held-out validation dataset.
- [110] arXiv:2503.23561 (cross-list from cs.LG) [pdf, html, other]
-
Title: Bridging conformal prediction and scenario optimizationSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
Conformal prediction and scenario optimization constitute two important classes of statistical learning frameworks to certify decisions made using data. They have found numerous applications in control theory, machine learning and robotics. Despite intense research in both areas, and apparently similar results, a clear connection between these two frameworks has not been established. By focusing on the so-called vanilla conformal prediction, we show rigorously how to choose appropriate score functions and set predictor map to recover well-known bounds on the probability of constraint violation associated with scenario programs. We also show how to treat ranking of nonconformity scores as a one-dimensional scenario program with discarded constraints, and use such connection to recover vanilla conformal prediction guarantees on the validity of the set predictor. We also capitalize on the main developments of the scenario approach, and show how we could analyze calibration conditional conformal prediction under this lens. Our results establish a theoretical bridge between conformal prediction and scenario optimization.
- [111] arXiv:2503.23600 (cross-list from math.OC) [pdf, html, other]
-
Title: Online Convex Optimization and Integral Quadratic Constraints: A new approach to regret analysisSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
We propose a novel approach for analyzing dynamic regret of first-order constrained online convex optimization algorithms for strongly convex and Lipschitz-smooth objectives. Crucially, we provide a general analysis that is applicable to a wide range of first-order algorithms that can be expressed as an interconnection of a linear dynamical system in feedback with a first-order oracle. By leveraging Integral Quadratic Constraints (IQCs), we derive a semi-definite program which, when feasible, provides a regret guarantee for the online algorithm. For this, the concept of variational IQCs is introduced as the generalization of IQCs to time-varying monotone operators. Our bounds capture the temporal rate of change of the problem in the form of the path length of the time-varying minimizer and the objective function variation. In contrast to standard results in OCO, our results do not require nerither the assumption of gradient boundedness, nor that of a bounded feasible set. Numerical analyses showcase the ability of the approach to capture the dependence of the regret on the function class condition number.
- [112] arXiv:2503.23641 (cross-list from math.OC) [pdf, html, other]
-
Title: Remarks on the Polyak-Lojasiewicz inequality and the convergence of gradient systemsSubjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
This work explores generalizations of the Polyak-Lojasiewicz inequality (PLI) and their implications for the convergence behavior of gradient flows in optimization problems. Motivated by the continuous-time linear quadratic regulator (CT-LQR) policy optimization problem -- where only a weaker version of the PLI is characterized in the literature -- this work shows that while weaker conditions are sufficient for global convergence to, and optimality of the set of critical points of the cost function, the "profile" of the gradient flow solution can change significantly depending on which "flavor" of inequality the cost satisfies. After a general theoretical analysis, we focus on fitting the CT-LQR policy optimization problem to the proposed framework, showing that, in fact, it can never satisfy a PLI in its strongest form. We follow up our analysis with a brief discussion on the difference between continuous- and discrete-time LQR policy optimization, and end the paper with some intuition on the extension of this framework to optimization problems with L1 regularization and solved through proximal gradient flows.
- [113] arXiv:2503.23731 (cross-list from cs.CV) [pdf, other]
-
Title: Investigation of intelligent barbell squat coaching system based on computer vision and machine learningYinq-Rong Chern, Yuhao Lee, Hsiao-Ching Lin, Guan-Ting Chen, Ying-Hsien Chen, Fu-Sung Lin, Chih-Yao Chuang, Jenn-Jier James Lien, Chih-Hsien HuangSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Purpose: Research has revealed that strength training can reduce the incidence of chronic diseases and physical deterioration at any age. Therefore, having a movement diagnostic system is crucial for training alone. Hence, this study developed an artificial intelligence and computer vision-based barbell squat coaching system with a real-time mode that immediately diagnoses the issue and provides feedback after each squat. In addition, a replay mode allows users to examine their previous squats and check their comments. Initially, four primary characteristics of the barbell squat were identified: body joint angles, dorsiflexion, the ratio of knee-to-hip movement, and barbell stability. Methods: We collect 8,151 squats from 77 participants, categorizing them as good squats and six issues. Then, we trained the diagnosis models with three machine-learning architectures. Furthermore, this research applied the SHapley Additive exPlanations (SHAP) method to enhance the accuracy of issue prediction and reduce the computation time by feature selection. Results: The F1 score of the six issues reached 86.86%, 69.01%, 77.42%, 90.74%, 95.83%, and 100%. Each squat diagnosis took less than 0.5 seconds. Finally, this study examined the efficacy of the proposed system with two groups of participants trained with and without the system. Subsequently, participants trained with the system exhibited substantial improvements in their squat technique, as assessed both by the system itself and by a professional weightlifting coach. Conclusion: This is a comprehensive study that integrates artificial intelligence, computer vision and multivariable processing technologies, aimed at building a real-time, user-friendly barbell squat feedback and training system.
- [114] arXiv:2503.23762 (cross-list from cs.SD) [pdf, html, other]
-
Title: UniSep: Universal Target Audio Separation with Language Models at ScaleYuanyuan Wang, Hangting Chen, Dongchao Yang, Weiqin Li, Dan Luo, Guangzhi Li, Shan Yang, Zhiyong Wu, Helen Meng, Xixin WuComments: Accepted by ICME 2025Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
We propose Universal target audio Separation (UniSep), addressing the separation task on arbitrary mixtures of different types of audio. Distinguished from previous studies, UniSep is performed on unlimited source domains and unlimited source numbers. We formulate the separation task as a sequence-to-sequence problem, and a large language model (LLM) is used to model the audio sequence in the discrete latent space, leveraging the power of LLM in handling complex mixture audios with large-scale data. Moreover, a novel pre-training strategy is proposed to utilize audio-only data, which reduces the efforts of large-scale data simulation and enhances the ability of LLMs to understand the consistency and correlation of information within audio sequences. We also demonstrate the effectiveness of scaling datasets in an audio separation task: we use large-scale data (36.5k hours), including speech, music, and sound, to train a universal target audio separation model that is not limited to a specific domain. Experiments show that UniSep achieves competitive subjective and objective evaluation results compared with single-task models.
- [115] arXiv:2503.23795 (cross-list from cs.RO) [pdf, other]
-
Title: Trajectory Planning for Automated Driving using Target FunnelsComments: accepted to European Control Conference 2025 (ECC25)Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Self-driving vehicles rely on sensory input to monitor their surroundings and continuously adapt to the most likely future road course. Predictive trajectory planning is based on snapshots of the (uncertain) road course as a key input. Under noisy perception data, estimates of the road course can vary significantly, leading to indecisive and erratic steering behavior. To overcome this issue, this paper introduces a predictive trajectory planning algorithm with a novel objective function: instead of targeting a single reference trajectory based on the most likely road course, tracking a series of target reference sets, called a target funnel, is considered. The proposed planning algorithm integrates probabilistic information about the road course, and thus implicitly considers regular updates to road perception. Our solution is assessed in a case study using real driving data collected from a prototype vehicle. The results demonstrate that the algorithm maintains tracking accuracy and substantially reduces undesirable steering commands in the presence of noisy road perception, achieving a 56% reduction in input costs compared to a certainty equivalent formulation.
- [116] arXiv:2503.23832 (cross-list from cs.LG) [pdf, html, other]
-
Title: An extrapolated and provably convergent algorithm for nonlinear matrix decomposition with the ReLU functionComments: 27 pages. Codes and data available from this https URLSubjects: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Optimization and Control (math.OC); Machine Learning (stat.ML)
Nonlinear matrix decomposition (NMD) with the ReLU function, denoted ReLU-NMD, is the following problem: given a sparse, nonnegative matrix $X$ and a factorization rank $r$, identify a rank-$r$ matrix $\Theta$ such that $X\approx \max(0,\Theta)$. This decomposition finds application in data compression, matrix completion with entries missing not at random, and manifold learning. The standard ReLU-NMD model minimizes the least squares error, that is, $\|X - \max(0,\Theta)\|_F^2$. The corresponding optimization problem is nondifferentiable and highly nonconvex. This motivated Saul to propose an alternative model, Latent-ReLU-NMD, where a latent variable $Z$ is introduced and satisfies $\max(0,Z)=X$ while minimizing $\|Z - \Theta\|_F^2$ (``A nonlinear matrix decomposition for mining the zeros of sparse data'', SIAM J. Math. Data Sci., 2022). Our first contribution is to show that the two formulations may yield different low-rank solutions $\Theta$; in particular, we show that Latent-ReLU-NMD can be ill-posed when ReLU-NMD is not, meaning that there are instances in which the infimum of Latent-ReLU-NMD is not attained while that of ReLU-NMD is. We also consider another alternative model, called 3B-ReLU-NMD, which parameterizes $\Theta=WH$, where $W$ has $r$ columns and $H$ has $r$ rows, allowing one to get rid of the rank constraint in Latent-ReLU-NMD. Our second contribution is to prove the convergence of a block coordinate descent (BCD) applied to 3B-ReLU-NMD and referred to as BCD-NMD. Our third contribution is a novel extrapolated variant of BCD-NMD, dubbed eBCD-NMD, which we prove is also convergent under mild assumptions. We illustrate the significant acceleration effect of eBCD-NMD compared to BCD-NMD, and also show that eBCD-NMD performs well against the state of the art on synthetic and real-world data sets.
- [117] arXiv:2503.23890 (cross-list from cs.RO) [pdf, html, other]
-
Title: Less is More: Contextual Sampling for Nonlinear Data-Enabled Predictive ControlComments: Submitted to IROS 2025 on March 1stSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Data-enabled Predictive Control (DeePC) is a powerful data-driven approach for predictive control without requiring an explicit system model. However, its high computational cost limits its applicability to real-time robotic systems. For robotic applications such as motion planning and trajectory tracking, real-time control is crucial. Nonlinear DeePC either relies on large datasets or learning the nonlinearities to ensure predictive accuracy, leading to high computational complexity. This work introduces contextual sampling, a novel data selection strategy to handle nonlinearities for DeePC by dynamically selecting the most relevant data at each time step. By reducing the dataset size while preserving prediction accuracy, our method improves computational efficiency, of DeePC for real-time robotic applications. We validate our approach for autonomous vehicle motion planning. For a dataset size of 100 sub-trajectories, Contextual sampling DeePC reduces tracking error by 53.2 % compared to Leverage Score sampling. Additionally, Contextual sampling reduces max computation time by 87.2 % compared to using the full dataset of 491 sub-trajectories while achieving comparable tracking performance. These results highlight the potential of Contextual sampling to enable real-time, data-driven control for robotic systems.
- [118] arXiv:2503.23893 (cross-list from cs.LG) [pdf, other]
-
Title: DiffScale: Continuous Downscaling and Bias Correction of Subseasonal Wind Speed Forecasts using Diffusion ModelsComments: 28 pages, 18 figures, preprint under reviewSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Renewable resources are strongly dependent on local and large-scale weather situations. Skillful subseasonal to seasonal (S2S) forecasts -- beyond two weeks and up to two months -- can offer significant socioeconomic advantages to the energy sector. This study aims to enhance wind speed predictions using a diffusion model with classifier-free guidance to downscale S2S forecasts of surface wind speed. We propose DiffScale, a diffusion model that super-resolves spatial information for continuous downscaling factors and lead times. Leveraging weather priors as guidance for the generative process of diffusion models, we adopt the perspective of conditional probabilities on sampling super-resolved S2S forecasts. We aim to directly estimate the density associated with the target S2S forecasts at different spatial resolutions and lead times without auto-regression or sequence prediction, resulting in an efficient and flexible model. Synthetic experiments were designed to super-resolve wind speed S2S forecasts from the European Center for Medium-Range Weather Forecast (ECMWF) from a coarse resolution to a finer resolution of ERA5 reanalysis data, which serves as a high-resolution target. The innovative aspect of DiffScale lies in its flexibility to downscale arbitrary scaling factors, enabling it to generalize across various grid resolutions and lead times -without retraining the model- while correcting model errors, making it a versatile tool for improving S2S wind speed forecasts. We achieve a significant improvement in prediction quality, outperforming baselines up to week 3.
- [119] arXiv:2503.23922 (cross-list from math.OC) [pdf, html, other]
-
Title: Distributionally Robust Model Order Reduction for Linear SystemsSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
In this paper, we investigate distributionally robust model order reduction for linear, discrete-time, time-invariant systems. The external input is assumed to follow an uncertain distribution within a Wasserstein ambiguity set. We begin by considering the case where the distribution is certain and formulate an optimization problem to obtain the reduced model. When the distribution is uncertain, the interaction between the reduced-order model and the distribution is modeled by a Stackelberg game. To ensure solvability, we first introduce the Gelbrich distance and demonstrate that the Stackelberg game within a Wasserstein ambiguity set is equivalent to that within a Gelbrich ambiguity set. Then, we propose a nested optimization problem to solve the Stackelberg game. Furthermore, the nested optimization problem is relaxed into a nested convex optimization problem, ensuring computational feasibility. Finally, a simulation is presented to illustrate the effectiveness of the proposed method.
- [120] arXiv:2503.24086 (cross-list from math.OC) [pdf, html, other]
-
Title: Distributed AC Optimal Power Flow: A Scalable Solution for Large-Scale ProblemsSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
This paper introduces a novel distributed optimization framework for large-scale AC Optimal Power Flow (OPF) problems, offering both theoretical convergence guarantees and rapid convergence in practice. By integrating smoothing techniques and the Schur complement, the proposed approach addresses the scalability challenges and reduces communication overhead in distributed AC OPF. Additionally, optimal network decomposition enables efficient parallel processing under the single program multiple data (SPMD) paradigm. Extensive simulations on large-scale benchmarks across various operating scenarios indicate that the proposed framework outperforms the state-of-the-art centralized solver IPOPT on modest hardware. This paves the way for more scalable and efficient distributed optimization in future power system applications.
- [121] arXiv:2503.24151 (cross-list from math.OC) [pdf, html, other]
-
Title: Robust Feedback Optimization with Model Uncertainty: A Regularization ApproachSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
Feedback optimization optimizes the steady state of a dynamical system by implementing optimization iterations in closed loop with the plant. It relies on online measurements and limited model information, namely, the input-output sensitivity. In practice, various issues including inaccurate modeling, lack of observation, or changing conditions can lead to sensitivity mismatches, causing closed-loop sub-optimality or even instability. To handle such uncertainties, we pursue robust feedback optimization, where we optimize the closed-loop performance against all possible sensitivities lying in specific uncertainty sets. We provide tractable reformulations for the corresponding min-max problems via regularizations and characterize the online closed-loop performance through the tracking error in case of time-varying optimal solutions. Simulations on a distribution grid illustrate the effectiveness of our robust feedback optimization controller in addressing sensitivity mismatches in a non-stationary environment.
- [122] arXiv:2503.24159 (cross-list from math.OC) [pdf, html, other]
-
Title: A system level approach to generalised feedback Nash equilibrium seeking in partially-observed gamesSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
This work proposes an algorithm for seeking generalised feedback Nash equilibria (GFNE) in noncooperative dynamic games. The focus is on cyber-physical systems with dynamics which are linear, stochastic, potentially unstable, and partially observed. We employ System Level Synthesis (SLS) to reformulate the problem as the search for an equilibrium profile of closed-loop responses to noise, which can then be used to reconstruct a stabilising output-feedback policy. Under this setup, we leverage monotone operator theory to design a GFNE-seeking algorithm capable to enforce closed-loop stability, operational constraints, and communication constraints onto the control policies. This algorithm is amenable to numerical implementation and we provide conditions for its convergence. We demonstrate our approach in a simulated experiment on the noncooperative stabilisation of a decentralised power-grid.
- [123] arXiv:2503.24313 (cross-list from physics.optics) [pdf, other]
-
Title: 1-Tb/s/λ Transmission over Record 10714-km AR-HCFDawei Ge, Siyuan Liu, Qiang Qiu, Peng Li, Qiang Guo, Yiqi Li, Dong Wang, Baoluo Yan, Mingqing Zuo, Lei Zhang, Dechao Zhang, Hu Shi, Jie Luo, Han Li, Zhangyuan ChenSubjects: Optics (physics.optics); Signal Processing (eess.SP)
We present the first single-channel 1.001-Tb/s DP-36QAM-PCS recirculating transmission over 73 loops of 146.77-km ultra-low-loss & low-IMI DNANF-5 fiber, achieving a record transmission distance of 10,714.28 km.
Cross submissions (showing 40 of 40 entries)
- [124] arXiv:1909.02070 (replaced) [pdf, html, other]
-
Title: Correct by construction requirement decompositionSubjects: Systems and Control (eess.SY)
In systems engineering, accurately decomposing requirements is crucial for creating well-defined and manageable system components, particularly in safety-critical domains. Despite the critical need, rigorous, top-down methodologies for effectively breaking down complex requirements into precise, actionable sub-requirements are scarce, especially compared to the wealth of bottom-up verification techniques. Addressing this gap, we introduce a formal decomposition for contract-based design that guarantees the correctness of decomposed requirements if specific conditions are met. Our (semi-)automated methodology augments contract-based design with reachability analysis and constraint programming to systematically identify, verify, and validate sub-requirements representable by continuous bounded sets -- continuous relations between real-valued inputs and outputs. We demonstrate the efficacy and practicality of a correct-by-construction approach through a comprehensive case study on a cruise control system, highlighting how our methodology improves the interpretability, tractability, and verifiability of system requirements.
- [125] arXiv:2308.15144 (replaced) [pdf, other]
-
Title: TKwinFormer: Top k Window Attention in Vision Transformers for Feature MatchingComments: After careful reconsideration, we have decided to withdraw the manuscript due to data inconsistencies and issues with methodology. Given these concerns, we believe it would be inappropriate to proceed with the revised version, and we have therefore decided to retract our submissionSubjects: Image and Video Processing (eess.IV)
Local feature matching remains a challenging task, primarily due to difficulties in matching sparse keypoints and low-texture regions. The key to solving this problem lies in effectively and accurately integrating global and local information. To achieve this goal, we introduce an innovative local feature matching method called TKwinFormer. Our approach employs a multi-stage matching strategy to optimize the efficiency of information interaction. Furthermore, we propose a novel attention mechanism called Top K Window Attention, which facilitates global information interaction through window tokens prior to patch-level matching, resulting in improved matching accuracy. Additionally, we design an attention block to enhance attention between channels. Experimental results demonstrate that TKwinFormer outperforms state-of-the-art methods on various benchmarks. Code is available at: this https URL.
- [126] arXiv:2310.10414 (replaced) [pdf, html, other]
-
Title: Style transfer between Microscopy and Magnetic Resonance Imaging via Generative Adversarial Network in small sample size settingsComments: 2023 IEEE International Conference on Image Processing (ICIP)Journal-ref: 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 2023, pp. 1120-1124Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Cross-modal augmentation of Magnetic Resonance Imaging (MRI) and microscopic imaging based on the same tissue samples is promising because it can allow histopathological analysis in the absence of an underlying invasive biopsy procedure. Here, we tested a method for generating microscopic histological images from MRI scans of the corpus callosum using conditional generative adversarial network (cGAN) architecture. To our knowledge, this is the first multimodal translation of the brain MRI to histological volumetric representation of the same sample. The technique was assessed by training paired image translation models taking sets of images from MRI scans and microscopy. The use of cGAN for this purpose is challenging because microscopy images are large in size and typically have low sample availability. The current work demonstrates that the framework reliably synthesizes histology images from MRI scans of corpus callosum, emphasizing the network's ability to train on high resolution histologies paired with relatively lower-resolution MRI scans. With the ultimate goal of avoiding biopsies, the proposed tool can be used for educational purposes.
- [127] arXiv:2312.12903 (replaced) [pdf, html, other]
-
Title: A Minimal Control Family of Dynamical Systems for Universal ApproximationComments: 12 pagesSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS)
The universal approximation property (UAP) holds a fundamental position in deep learning, as it provides a theoretical foundation for the expressive power of neural networks. It is widely recognized that a composition of linear and nonlinear functions, such as the rectified linear unit (ReLU) activation function, can approximate continuous functions on compact domains. In this paper, we extend this efficacy to a scenario containing dynamical systems with controls. We prove that the control family $\mathcal{F}_1$ containing all affine maps and the nonlinear ReLU map is sufficient for generating flow maps that can approximate orientation-preserving (OP) diffeomorphisms on any compact domain. Since $\mathcal{F}_1$ contains only one nonlinear function and the UAP does not hold if we remove the nonlinear function, we call $\mathcal{F}_1$ a minimal control family for the UAP. On this basis, several mild sufficient conditions, such as affine invariance, are established for the control family and discussed. Our results reveal an underlying connection between the approximation power of neural networks and control systems and could provide theoretical guidance for examining the approximation power of flow-based models.
- [128] arXiv:2403.09184 (replaced) [pdf, html, other]
-
Title: Learning Algorithms for Verification of Markov Decision ProcessesTomáš Brázdil, Krishnendu Chatterjee, Martin Chmelik, Vojtěch Forejt, Jan Křetínský, Marta Kwiatkowska, Tobias Meggendorfer, David Parker, Mateusz UjmaComments: 82 pages. This is the TheoretiCS journal versionJournal-ref: TheoretiCS, Volume 4 (2025), Article 10, 1-82Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
We present a general framework for applying learning algorithms and heuristical guidance to the verification of Markov decision processes (MDPs). The primary goal of our techniques is to improve performance by avoiding an exhaustive exploration of the state space, instead focussing on particularly relevant areas of the system, guided by heuristics. Our work builds on the previous results of Br{á}zdil et al., significantly extending it as well as refining several details and fixing errors.
The presented framework focuses on probabilistic reachability, which is a core problem in verification, and is instantiated in two distinct scenarios. The first assumes that full knowledge of the MDP is available, in particular precise transition probabilities. It performs a heuristic-driven partial exploration of the model, yielding precise lower and upper bounds on the required probability. The second tackles the case where we may only sample the MDP without knowing the exact transition dynamics. Here, we obtain probabilistic guarantees, again in terms of both the lower and upper bounds, which provides efficient stopping criteria for the approximation. In particular, the latter is an extension of statistical model-checking (SMC) for unbounded properties in MDPs. In contrast to other related approaches, we do not restrict our attention to time-bounded (finite-horizon) or discounted properties, nor assume any particular structural properties of the MDP. - [129] arXiv:2403.15674 (replaced) [pdf, html, other]
-
Title: Safe and Stable Formation Control with Autonomous Multi-Agents Using Adaptive ControlComments: Under Review - Modeling, Estimation and Control Conference 2025Subjects: Systems and Control (eess.SY)
This manuscript considers the problem of ensuring stability and safety during formation control with distributed multi-agent systems in the presence of parametric uncertainty in the dynamics and limited communication. We propose an integrative approach that combines Adaptive Control, Control Barrier Functions (CBFs), and connected graphs. The main elements employed in the integrative approach are an adaptive control design that ensures stability, a CBF-based safety filter that generates safe commands based on a reference model dynamics, and a reference model that ensures formation control with multi-agent systems when no uncertainties are present. The overall control design is shown to lead to a closed-loop adaptive system that is stable, avoids unsafe regions, and converges to a desired formation of the multi-agents. Numerical examples are provided to support the theoretical derivations.
- [130] arXiv:2404.05863 (replaced) [pdf, html, other]
-
Title: Optimal robust exact first-order differentiators with Lipschitz continuous outputJournal-ref: "Optimal robust exact first-order differentiators with Lipschitz-continuous output," in IEEE Transactions on Automatic ControlSubjects: Systems and Control (eess.SY)
The signal differentiation problem involves the development of algorithms that allow to recover a signal's derivatives from noisy measurements. This paper develops a first-order differentiator with the following combination of properties: robustness to measurement noise, exactness in the absence of noise, optimal worst-case differentiation error, and Lipschitz continuous output where the output's Lipschitz constant is a tunable parameter. This combination of advantageous properties is not shared by any existing differentiator. Both continuous-time and sample-based versions of the differentiator are developed and theoretical guarantees are established for both. The continuous-time version of the differentiator consists in a regularized and sliding-mode-filtered linear adaptive differentiator. The sample-based, implementable version is then obtained through appropriate discretization. An illustrative example is provided to highlight the features of the developed differentiator.
- [131] arXiv:2404.14729 (replaced) [pdf, html, other]
-
Title: Emergent Cooperation for Energy-efficient Connectivity via Wireless Power TransferSubjects: Systems and Control (eess.SY)
This paper addresses the challenge of incentivizing energy-constrained, non-cooperative user equipment (UE) to serve as cooperative relays. We consider a source UE with a non-line-of-sight channel to an access point (AP), where direct communication may be infeasible or may necessitate a substantial transmit power. Other UEs in the vicinity are viewed as relay candidates, and our aim is to enable energy-efficient connectivity for the source, while accounting for the self-interested behavior and private channel state information of these candidates, by allowing the source to ``pay" the candidates via wireless power transfer (WPT). We propose a cooperation-inducing protocol, inspired by Myerson auction theory, which ensures that candidates truthfully report power requirements while minimizing the expected power used by the source. Through rigorous analysis, we establish the regularity of valuations for lognormal fading channels, which allows for the efficient determination of the optimal source transmit power. Extensive simulation experiments, employing real-world communication and WPT parameters, validate our theoretical framework. Our results demonstrate over 71% reduction in outage probability with as few as 4 relay candidates, compared to the non-cooperative scenario, and as much as 70% source power savings compared to a baseline approach, highlighting the efficacy of our proposed methodology.
- [132] arXiv:2404.14738 (replaced) [pdf, other]
-
Title: Uncrewed Vehicles in 6G Networks: A Unifying Treatment of Problems, Formulations, and ToolsSubjects: Systems and Control (eess.SY); Networking and Internet Architecture (cs.NI)
Uncrewed Vehicles (UVs) functioning as autonomous agents are anticipated to play a crucial role in the 6th Generation of wireless networks. Their seamless integration, cost-effectiveness, and the additional controllability through motion planning make them an attractive deployment option for a wide range of applications, both as assets in the network (e.g., mobile base stations) and as consumers of network services (e.g., autonomous delivery systems). However, despite their potential, the convergence of UVs and wireless systems brings forth numerous challenges that require attention from both academia and industry. This paper then aims to offer a comprehensive overview encompassing the transformative possibilities as well as the significant challenges associated with UV-assisted next-generation wireless communications. Considering the diverse landscape of possible application scenarios, problem formulations, and mathematical tools related to UV-assisted wireless systems, the underlying core theme of this paper is the unification of the problem space, providing a structured framework to understand the use cases, problem formulations, and necessary mathematical tools. Overall, the paper sets forth a clear understanding of how uncrewed vehicles can be integrated in the 6G ecosystem, paving the way towards harnessing the full potential at this intersection.
- [133] arXiv:2404.15321 (replaced) [pdf, html, other]
-
Title: Characteristics-Based Design of Generalized-Exponent Bandpass FiltersComments: 16 pages, 7 figures, 2 tables, 26 equationsSubjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We develop characteristics-based filter design methods for a class of IIR bandpass filters, which we refer to as Generalized-Exponent Filters (GEFs) and that are represented as second-order filters raised to non-unitary exponents. GEFs have a peak, are effectively linear phase, and are useful for seismic signal phase-picking, cochlear implants, and equalizers. The native frequency-domain specifications for GEFs are not on given frequency responses but rather on filter characteristics such as peak frequency, bandwidth, and group delay. Our characteristics-based method for filter design accommodates direct specification of a trio of frequency-domain characteristics from amongst the peak frequency, convexity, ndB quality factors, equivalent rectangular bandwidth, maximum group delay, and phase accumulation. We achieve this by deriving filter parameterizations with sets of filter characteristics which involves deriving closed-form analytic expressions mapping sets of filter characteristics to the original filter constants by making sharp-filter approximations. This results in parameterizations for GEFs including ones with simultaneous specification of magnitude-based and phase-based characteristics (e.g. bandwidths and group delays). This in turn enables designing sharply tuned filters without significant group delay, and simultaneous control over frequency selectivity and synchronization which is important in designing filterbanks. Our filter design methods with direct control over characteristics may also be utilized beyond static filter design for higher-order variable bandpass filter design and may be useful for characteristics-based adaptive filtering. Our methods are inherently stable, highly accurate in meeting strict specifications on desired characteristics, simple, and computationally efficient. The methods extend to the design of related bandpass and multiband filters.
- [134] arXiv:2406.13794 (replaced) [pdf, html, other]
-
Title: Adaptive Curves for Optimally Efficient Market MakingSubjects: Systems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE); Trading and Market Microstructure (q-fin.TR)
Automated Market Makers (AMMs) are essential in Decentralized Finance (DeFi) as they match liquidity supply with demand. They function through liquidity providers (LPs) who deposit assets into liquidity pools. However, the asset trading prices in these pools often trail behind those in more dynamic, centralized exchanges, leading to potential arbitrage losses for LPs. This issue is tackled by adapting market maker bonding curves to trader behavior, based on the classical market microstructure model of Glosten and Milgrom. Our approach ensures a zero-profit condition for the market maker's prices. We derive the differential equation that an optimal adaptive curve should follow to minimize arbitrage losses while remaining competitive. Solutions to this optimality equation are obtained for standard Gaussian and Lognormal price models using Kalman filtering. A key feature of our method is its ability to estimate the external market price without relying on price or loss oracles. We also provide an equivalent differential equation for the implied dynamics of canonical static bonding curves and establish conditions for their optimality. Our algorithms demonstrate robustness to changing market conditions and adversarial perturbations, and we offer an on-chain implementation using Uniswap v4 alongside off-chain AI co-processors.
- [135] arXiv:2406.16877 (replaced) [pdf, html, other]
-
Title: Rational-Exponent Filters with Applications to Generalized Exponent FiltersComments: 14 pages, 9 figures, 2 tables, 32 equations. Submitted to IEEE TCAS-ISubjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS); Systems and Control (eess.SY)
We present filters with rational exponents in order to provide a continuum of filter behavior not classically achievable. We discuss their stability, the flexibility they afford, and various representations useful for analysis, design and implementations. We do this for a generalization of second-order filters which we refer to as rational-exponent Generalized Exponent Filters (GEFs) that are useful for a diverse array of applications. We present equivalent representations for rational-exponent GEFs in the time and frequency domains: transfer functions, impulse responses, and integral expressions - the last of which allows for efficient real-time processing without preprocessing requirements. Rational-exponent filters enable filter characteristics to be on a continuum rather than limiting them to discrete values thereby resulting in greater flexibility in the behavior of these filters without additional complexity in causality and stability analyses compared with classical filters. In the case of GEFs, this allows for having arbitrary continuous rather than discrete values for filter characteristics such as (1) the ratio of 3dB quality factor to maximum group delay - particularly important for filterbanks which have simultaneous requirements on frequency selectivity and synchronization; and (2) the ratio of 3dB to 15dB quality factors that dictates the shape of the frequency response magnitude.
- [136] arXiv:2407.04506 (replaced) [pdf, other]
-
Title: Balancing Operators Risk Averseness in Model Predictive Control for Real-time Reservoir Flood ControlJa-Ho Koo (1 and 2), Edo Abraham (2), Andreja Jonoski (1), Dimitri P. Solomatine (1 and 2 and 3) ((1) Department of Hydroinformatics and Socio-Technical Innovation, IHE Delft, (2) Department of Water Management, Faculty of Civil Engineering and Geosciences, Delft University of Technology, (3) Department of river basins hydrology, Water Problems Institute of RAS, Moscow, Russia)Comments: This article was published at the Journal of Hydroinformatics in 2025. (Koo, Ja-Ho, Edo Abraham, Andreja Jonoski, and Dimitri P. Solomatine. Balancing operators risk averseness in model predictive control for real-time reservoir flood control. Journal of Hydroinformatics (2025): jh2025019.)Journal-ref: Journal of Hydroinformatics jh2025019 (2025)Subjects: Systems and Control (eess.SY)
Model Predictive Control (MPC) is an optimal control strategy suited for flood control of water resources infrastructure. Despite many studies on reservoir flood control and their theoretical contribution, optimisation methodologies have not been widely applied in real-time operation due to disparities between research assumptions and practical requirements. To address this gap, we include practical objectives, such as minimising the magnitude and frequency of changes in the existing outflow schedule. Incorporating these objectives transforms the problem into a multi-objective nonlinear optimisation problem that is difficult to solve in real-time. Additionally, it is reasonable to assume that the weights and some parameters, considered the operators' preferences, vary depending on the system state. To overcome these limitations, we propose a framework that converts the original intractable problem into parameterized linear MPC problems with dynamic optimisation of weights and parameters. This is done by introducing a model-based learning concept. We refer to this framework as Parameterised Dynamic MPC (PD-MPC). The effectiveness of this framework is demonstrated through a numerical experiment for the Daecheong multipurpose reservoir in South Korea. We find that PD-MPC outperforms standard MPC-based designs without a dynamic optimisation process for the objective weights and model parameters. Moreover, we demonstrate that the weights and parameters vary with changing hydrological conditions.
- [137] arXiv:2408.06999 (replaced) [pdf, html, other]
-
Title: Robust Model Predictive Control for Aircraft Intent-Aware Collision AvoidanceComments: 8 Pages, 10 Figs, Accepted for presentation at ECC 2025Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
This paper presents the use of robust model predictive control for the design of an intent-aware collision avoidance system for multi-agent aircraft engaged in horizontal maneuvering scenarios. We assume that information from other agents is accessible in the form of waypoints or destinations. Consequently, we consider that other agents follow their optimal Dubin's path--a trajectory that connects their current state to their intended state--while accounting for potential uncertainties. We propose using scenario tree model predictive control as a robust approach that demonstrates computational efficiency. We demonstrate that the proposed method can easily integrate intent information and offer a robust scheme that handles different uncertainties. The method is illustrated through simulation results.
- [138] arXiv:2408.13065 (replaced) [pdf, html, other]
-
Title: SIMPLE: Simultaneous Multi-Plane Self-Supervised Learning for Isotropic MRI Restoration from Anisotropic DataSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Magnetic resonance imaging (MRI) is crucial in diagnosing various abdominal conditions and anomalies. Traditional MRI scans often yield anisotropic data due to technical constraints, resulting in varying resolutions across spatial dimensions, which limits diagnostic accuracy and volumetric analysis. Super-resolution (SR) techniques aim to address these limitations by reconstructing isotropic high-resolution images from anisotropic data. However, current SR methods often depend on indirect mappings and scarce 3D isotropic data for training, primarily focusing on two-dimensional enhancements rather than achieving genuine three-dimensional isotropy. We introduce ``SIMPLE,'' a Simultaneous Multi-Plane Self-Supervised Learning approach for isotropic MRI restoration from anisotropic data. Our method leverages existing anisotropic clinical data acquired in different planes, bypassing the need for simulated downsampling processes. By considering the inherent three-dimensional nature of MRI data, SIMPLE ensures realistic isotropic data generation rather than solely improving through-plane slices. This approach's flexibility allows it to be extended to multiple contrast types and acquisition methods commonly used in clinical settings. Our experiments on two distinct datasets (brain and abdomen) show that SIMPLE outperforms state-of-the-art methods both quantitatively using the Kernel Inception Distance (KID), semi-quantitatively through radiologist evaluations, and qualitatively through Fourier domain analysis. The generated isotropic volume facilitates more accurate volumetric analysis and 3D reconstructions, promising significant improvements in clinical diagnostic capabilities.
- [139] arXiv:2409.12560 (replaced) [pdf, html, other]
-
Title: AudioComposer: Towards Fine-grained Audio Generation with Natural Language DescriptionsComments: Accepted by ICASSP 2025Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style. Some studies try to improve the granularity by incorporating additional frame-level conditions or control networks. However, this usually leads to complex system design and difficulties due to the requirement for reference frame-level conditions. To address these challenges, we propose AudioComposer, a novel TTA generation framework that relies solely on natural language descriptions (NLDs) to provide both content specification and style control information. To further enhance audio generative modeling, we employ flow-based diffusion transformers with the cross-attention mechanism to incorporate text descriptions effectively into audio generation processes, which can not only simultaneously consider the content and style information in the text inputs, but also accelerate generation compared to other architectures. Furthermore, we propose a novel and comprehensive automatic data simulation pipeline to construct data with fine-grained text descriptions, which significantly alleviates the problem of data scarcity in the area. Experiments demonstrate the effectiveness of our framework using solely NLDs as inputs for content specification and style control. The generation quality and controllability surpass state-of-the-art TTA models, even with a smaller model size.
- [140] arXiv:2410.02903 (replaced) [pdf, html, other]
-
Title: Dissipative Avoidance Feedback for Reactive Navigation Under Second-Order DynamicsComments: 8 pages, 6 figuresSubjects: Systems and Control (eess.SY)
This paper addresses the problem of autonomous robot navigation in unknown, obstacle-filled environments with second-order dynamics by proposing a Dissipative Avoidance Feedback (DAF). Compared to the Artificial Potential Field (APF), which primarily uses repulsive forces based on position, DAF employs a dissipative feedback mechanism that accounts for both position and velocity, contributing to smoother and more natural obstacle avoidance. The proposed continuously differentiable controller solves the motion-to-goal problem while guaranteeing collision-free navigation by using the robot's state and local obstacle distance information. We show that the controller guarantees safe navigation in generic $n$-dimensional environments and that all undesired $\omega$-limit points are unstable under certain controlled curvature conditions. Designed for real-time implementation, DAF requires only locally measured data from limited-range sensors (e.g., LiDAR, depth cameras), making it particularly effective for robots navigating unknown workspaces. Simulations in 2D and 3D environments are conducted to validate the theoretical results and to showcase the effectiveness of our approach.
- [141] arXiv:2411.04472 (replaced) [pdf, other]
-
Title: Accurate Calculation of Switching Events in Electromagnetic Transient Simulation Considering State Variable DiscontinuitiesComments: Accepted by the 2025 IEEE PES General MeetingSubjects: Systems and Control (eess.SY)
Accurate calculation of switching events is important for electromagnetic transient simulation to obtain reliable results. The common presumption of continuous differential state variables could prevent the accurate calculation, thus leading to unreliable results. This paper explores accurately calculating switching events without presuming continuous differential state variables. Possibility of the calculation is revealed by the proposal of related methods. Feasibility and accuracy of the proposed methods are demonstrated and analyzed via numerical case studies.
- [142] arXiv:2411.07640 (replaced) [pdf, html, other]
-
Title: Minimally Conservative Controlled-Invariant Set Synthesis Using Control Barrier CertificatesSubjects: Systems and Control (eess.SY)
Finding a controlled-invariant set for a system with state and control constraints is crucial for safety-critical applications. However, existing methods often produce overly conservative solutions. This paper presents a method for generating controlled-invariant (safe) sets for nonlinear polynomial control-affine systems using Control Barrier Certificates (CBCs). We formulate CBC conditions as Sum-of-Squares (SOS) constraints and solve them via an SOS Program (SOSP). First, we generalize existing SOSPs for CBC synthesis to handle environments with complex unsafe state representations. Then, we propose an iterative algorithm that progressively enlarges the safe set constructed by the synthesized CBCs by maximizing boundary expansion at each iteration. We theoretically prove that our method guarantees strict safe set expansion at every step. Finally, we validate our approach with numerical simulations in 2D and 3D for single-input and multi-input systems. Empirical results show that the safe set generated by our method covers in most part a larger portion of the state space compared to two state-of-the-art techniques.
- [143] arXiv:2411.07642 (replaced) [pdf, html, other]
-
Title: Safety Filter Design for Articulated Frame Steering Vehicles In the Presence of Actuator Dynamics Using High-Order Control Barrier FunctionsSubjects: Systems and Control (eess.SY)
Articulated Frame Steering (AFS) vehicles are widely used in heavy-duty industries, where they often operate near operators and laborers. Therefore, designing safe controllers for AFS vehicles is essential. In this paper, we develop a Quadratic Program (QP)-based safety filter that ensures feasibility for AFS vehicles with affine actuator dynamics. To achieve this, we first derive the general equations of motion for AFS vehicles, incorporating affine actuator dynamics. We then introduce a novel High-Order Control Barrier Function (HOCBF) candidate with equal relative degrees for both system controls. Finally, we design a Parametric Adaptive HOCBF (PACBF) and an always-feasible, QP-based safety filter. Numerical simulations of AFS vehicle kinematics demonstrate the effectiveness of our approach.
- [144] arXiv:2411.16187 (replaced) [pdf, html, other]
-
Title: Goal-oriented Semantic Communications for Metaverse Construction via Generative AI and Optimal TransportSubjects: Systems and Control (eess.SY); Signal Processing (eess.SP)
The emergence of the metaverse has boosted productivity and creativity, driving real-time updates and personalized content, which will substantially increase data traffic. However, current bit-oriented communication networks struggle to manage this high volume of dynamic information, restricting metaverse applications interactivity. To address this research gap, we propose a goal-oriented semantic communication (GSC) framework for metaverse. Building on an existing metaverse wireless construction task, our proposed GSC framework includes an hourglass network-based (HgNet) encoder to extract semantic information of objects in the metaverse; and a semantic decoder that uses this extracted information to reconstruct the metaverse content after wireless transmission, enabling efficient communication and real-time object behaviour updates to the scenery for metaverse construction task. To overcome the wireless channel noise at the receiver, we design an optimal transport (OT)-enabled semantic denoiser, which enhances the accuracy of metaverse scenery through wireless communication. Experimental results show that compared to the conventional metaverse construction, our proposed GSC framework significantly reduces wireless metaverse construction latency by 92.6\%, while improving metaverse object status accuracy and viewing experience by 45.6\% and 44.7\%, respectively.
- [145] arXiv:2412.00422 (replaced) [pdf, html, other]
-
Title: IRS Aided Federated Learning: Multiple Access and Fundamental TradeoffSubjects: Signal Processing (eess.SP); Distributed, Parallel, and Cluster Computing (cs.DC)
This paper investigates an intelligent reflecting surface (IRS) aided wireless federated learning (FL) system, where an access point (AP) coordinates multiple edge devices to train a machine leaning model without sharing their own raw data. During the training process, we exploit the joint channel reconfiguration via IRS and resource allocation design to reduce the latency of a FL task. Particularly, we propose three transmission protocols for assisting the local model uploading from multiple devices to an AP, namely IRS aided time division multiple access (I-TDMA), IRS aided frequency division multiple access (I-FDMA), and IRS aided non-orthogonal multiple access (INOMA), to investigate the impact of IRS on the multiple access for FL. Under the three protocols, we minimize the per-round latency subject to a given training loss by jointly optimizing the device scheduling, IRS phase-shifts, and communicationcomputation resource allocation. For the associated problem under I-TDMA, an efficient algorithm is proposed to solve it optimally by exploiting its intrinsic structure, whereas the highquality solutions of the problems under I-FDMA and I-NOMA are obtained by invoking a successive convex approximation (SCA) based approach. Then, we further develop a theoretical framework for the performance comparison of the proposed three transmission protocols. Sufficient conditions for ensuring that I-TDMA outperforms I-NOMA and those of its opposite are unveiled, which is fundamentally different from that NOMA always outperforms TDMA in the system without IRS. Simulation results validate our theoretical findings and also demonstrate the usefulness of IRS for enhancing the fundamental tradeoff between the learning latency and learning accuracy.
- [146] arXiv:2412.04648 (replaced) [pdf, html, other]
-
Title: Generalized Recorrupted-to-Recorrupted: Self-Supervised Learning Beyond Gaussian NoiseSubjects: Image and Video Processing (eess.IV); Machine Learning (stat.ML)
Recorrupted-to-Recorrupted (R2R) has emerged as a methodology for training deep networks for image restoration in a self-supervised manner from noisy measurement data alone, demonstrating equivalence in expectation to the supervised squared loss in the case of Gaussian noise. However, its effectiveness with non-Gaussian noise remains unexplored. In this paper, we propose Generalized R2R (GR2R), extending the R2R framework to handle a broader class of noise distribution as additive noise like log-Rayleigh and address the natural exponential family including Poisson and Gamma noise distributions, which play a key role in many applications including low-photon imaging and synthetic aperture radar. We show that the GR2R loss is an unbiased estimator of the supervised loss and that the popular Stein's unbiased risk estimator can be seen as a special case. A series of experiments with Gaussian, Poisson, and Gamma noise validate GR2R's performance, showing its effectiveness compared to other self-supervised methods.
- [147] arXiv:2412.05580 (replaced) [pdf, html, other]
-
Title: Self-Supervised Masked Mesh Learning for Unsupervised Anomaly Detection on 3D Cortical SurfacesHao-Chun Yang, Sicheng Dai, Saige Rutherford, Christian Gaser, Andre F Marquand, Christian F Beckmann, Thomas WolfersSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Unsupervised anomaly detection in brain imaging is challenging. In this paper, we propose self-supervised masked mesh learning for unsupervised anomaly detection on 3D cortical surfaces. Our framework leverages the intrinsic geometry of the cortical surface to learn a self-supervised representation that captures the underlying structure of the brain. We introduce a masked mesh convolutional neural network (MMN) that learns to predict masked regions of the cortical surface. By training the MMN on a large dataset of healthy subjects, we learn a representation that captures the normal variation in the cortical surface. We then use this representation to detect anomalies in unseen individuals by calculating anomaly scores based on the reconstruction error of the MMN. We evaluated our framework by training on population-scale dataset UKB and HCP-Aging and testing on two datasets of Alzheimer's disease patients ADNI and OASIS3. Our results show that our framework can detect anomalies in cortical thickness, cortical volume, and cortical sulcus characteristics, which are known to be biomarkers of Alzheimer's disease. Our proposed framework provides a promising approach for unsupervised anomaly detection based on normative variation of cortical features.
- [148] arXiv:2412.10031 (replaced) [pdf, html, other]
-
Title: FM2S: Towards Spatially-Correlated Noise Modeling in Zero-Shot Fluorescence Microscopy Image DenoisingComments: 14 pages, 10 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Fluorescence microscopy image (FMI) denoising faces critical challenges due to the compound mixed Poisson-Gaussian noise with strong spatial correlation and the impracticality of acquiring paired noisy/clean data in dynamic biomedical scenarios. While supervised methods trained on synthetic noise (e.g., Gaussian/Poisson) suffer from out-of-distribution generalization issues, existing self-supervised approaches degrade under real FMI noise due to oversimplified noise assumptions and computationally intensive deep architectures. In this paper, we propose Fluorescence Micrograph to Self (FM2S), a zero-shot denoiser that achieves efficient FMI denoising through three key innovations: 1) A noise injection module that ensures training data sufficiency through adaptive Poisson-Gaussian synthesis while preserving spatial correlation and global statistics of FMI noise for robust model generalization; 2) A two-stage progressive learning strategy that first recovers structural priors via pre-denoised targets then refines high-frequency details through noise distribution alignment; 3) An ultra-lightweight network (3.5k parameters) enabling rapid convergence with 270$\times$ faster training and inference than SOTAs. Extensive experiments across FMI datasets demonstrate FM2S's superiority: It outperforms CVF-SID by 1.4dB PSNR on average while requiring 0.1% parameters of AP-BSN. Notably, FM2S maintains stable performance across varying noise levels, proving its practicality for microscopy platforms with diverse sensor characteristics. Code and datasets will be released.
- [149] arXiv:2412.12009 (replaced) [pdf, html, other]
-
Title: SpeechPrune: Context-aware Token Pruning for Speech Information RetrievalYueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai "Helen" Li, Yiran ChenComments: Accepted at IEEE ICME 2025. Project page: this https URLSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
We introduce Speech Information Retrieval (SIR), a new long-context task for Speech Large Language Models (Speech LLMs), and present SPIRAL, a 1,012-sample benchmark testing models' ability to extract critical details from approximately 90-second spoken inputs. While current Speech LLMs excel at short-form tasks, they struggle with the computational and representational demands of longer audio sequences. To address this limitation, we propose SpeechPrune, a training-free token pruning strategy that uses speech-text similarity and approximated attention scores to efficiently discard irrelevant tokens. In SPIRAL, SpeechPrune achieves accuracy improvements of 29% and up to 47% over the original model and the random pruning model at a pruning rate of 20%, respectively. SpeechPrune can maintain network performance even at a pruning level of 80%. This approach highlights the potential of token-level pruning for efficient and scalable long-form speech understanding.
- [150] arXiv:2412.14846 (replaced) [pdf, html, other]
-
Title: Head and Neck Tumor Segmentation of MRI from Pre- and Mid-radiotherapy with Pre-training, Data Augmentation and Dual Flow UNetSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Head and neck tumors and metastatic lymph nodes are crucial for treatment planning and prognostic analysis. Accurate segmentation and quantitative analysis of these structures require pixel-level annotation, making automated segmentation techniques essential for the diagnosis and treatment of head and neck cancer. In this study, we investigated the effects of multiple strategies on the segmentation of pre-radiotherapy (pre-RT) and mid-radiotherapy (mid-RT) images. For the segmentation of pre-RT images, we utilized: 1) a fully supervised learning approach, and 2) the same approach enhanced with pre-trained weights and the MixUp data augmentation technique. For mid-RT images, we introduced a novel computational-friendly network architecture that features separate encoders for mid-RT images and registered pre-RT images with their labels. The mid-RT encoder branch integrates information from pre-RT images and labels progressively during the forward propagation. We selected the highest-performing model from each fold and used their predictions to create an ensemble average for inference. In the final test, our models achieved a segmentation performance of 82.38% for pre-RT and 72.53% for mid-RT on aggregated Dice Similarity Coefficient (DSC) as HiLab. Our code is available at this https URL.
- [151] arXiv:2412.19475 (replaced) [pdf, html, other]
-
Title: Exploiting Dynamic Sparsity for Near-Field Spatial Non-Stationary XL-MIMO Channel TrackingComments: 13 pages, 11 figures,Submitted to IEEE TSPSubjects: Signal Processing (eess.SP)
This work considers a spatial non-stationary channel tracking problem in broadband extremely large-scale multiple-input-multiple-output (XL-MIMO) systems. In the case of spatial non-stationary, each scatterer has a certain visibility region (VR) over antennas and power change may occur among visible antennas. Concentrating on the temporal correlation of XL-MIMO channels, we design a three-layer Markov prior model and hierarchical two-dimensional (2D) Markov model to exploit the dynamic sparsity of sparse channel vectors and VRs, respectively. Then, we formulate the channel tracking problem as a bilinear measurement process, and a novel dynamic alternating maximum a posteriori (DA-MAP) framework is developed to solve the problem. The DA-MAP contains four basic modules: channel estimation module, VR detection module, grid update module, and temporal correlated module. Specifically, the first module is an inverse-free variational Bayesian inference (IF-VBI) estimator that avoids computational intensive matrix inverse each iteration; the second module is a turbo compressive sensing (Turbo-CS) algorithm that only needs small-scale matrix operations in a parallel fashion; the third module refines the polar-delay domain grid; and the fourth module can process the temporal prior information to ensure high-efficiency channel tracking. Simulations show that the proposed method can achieve a significant channel tracking performance while achieving low computational overhead.
- [152] arXiv:2501.00641 (replaced) [pdf, html, other]
-
Title: Rethink Delay Doppler Channels and Time-Frequency CodingSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
In this paper, we rethink delay Doppler channels (also called doubly selective channels). We prove that no modulation schemes (including the current active VOFDM/OTFS) can compensate a non-trivial Doppler spread well. We then discuss some of the existing methods to deal with time-varying channels, in particular time-frequency (TF) coding in an OFDM system. TF coding is equivalent to space-time coding in the math part. We also summarize state of the art on space-time coding that was an active research topic over a decade ago.
- [153] arXiv:2501.08163 (replaced) [pdf, html, other]
-
Title: DH-Mamba: Exploring Dual-domain Hierarchical State Space Models for MRI ReconstructionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The accelerated MRI reconstruction poses a challenging ill-posed inverse problem due to the significant undersampling in k-space. Deep neural networks, such as CNNs and ViTs, have shown substantial performance improvements for this task while encountering the dilemma between global receptive fields and efficient computation. To this end, this paper explores selective state space models (Mamba), a new paradigm for long-range dependency modeling with linear complexity, for efficient and effective MRI reconstruction. However, directly applying Mamba to MRI reconstruction faces three significant issues: (1) Mamba typically flattens 2D images into distinct 1D sequences along rows and columns, disrupting k-space's unique spectrum and leaving its potential in k-space learning unexplored. (2) Existing approaches adopt multi-directional lengthy scanning to unfold images at the pixel level, leading to long-range forgetting and high computational burden. (3) Mamba struggles with spatially-varying contents, resulting in limited diversity of local representations. To address these, we propose a dual-domain hierarchical Mamba for MRI reconstruction from the following perspectives: (1) We pioneer vision Mamba in k-space learning. A circular scanning is customized for spectrum unfolding, benefiting the global modeling of k-space. (2) We propose a hierarchical Mamba with an efficient scanning strategy in both image and k-space domains. It mitigates long-range forgetting and achieves a better trade-off between efficiency and performance. (3) We develop a local diversity enhancement module to improve the spatially-varying representation of Mamba. Extensive experiments are conducted on three public datasets for MRI reconstruction under various undersampling patterns. Comprehensive results demonstrate that our method significantly outperforms state-of-the-art methods with lower computational cost.
- [154] arXiv:2501.16625 (replaced) [pdf, html, other]
-
Title: An Iterative Bayesian Approach for System Identification based on Linear Gaussian ModelsComments: Submitted to the IEEE CDCSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
We tackle the problem of system identification, where we select inputs, observe the corresponding outputs from the true system, and optimize the parameters of our model to best fit the data. We propose a flexible and computationally tractable methodology that is compatible with any system and parametric family of models. Our approach only requires input-output data from the system and first-order information from the model with respect to the parameters. Our algorithm consists of two modules. First, we formulate the problem of system identification from a Bayesian perspective and use a linear Gaussian model approximation to iteratively optimize the model's parameters. In each iteration, we propose to use the input-output data to tune the covariance of the linear Gaussian model. This statistically calibrates the approach. Secondly, we define a Gaussian-based uncertainty measure for the model parameters, which we can then minimize with respect to the next selected input. We test our method with linear and nonlinear dynamics.
- [155] arXiv:2502.10557 (replaced) [pdf, html, other]
-
Title: Can Large Language Model Agents Balance Energy Systems?Subjects: Systems and Control (eess.SY)
This paper presents a hybrid approach that integrates Large Language Models (LLMs) with a multi-scenario Stochastic Unit Commitment (SUC) framework to enhance both efficiency and reliability under high wind generation uncertainties. In a 10-trial study on the test energy system, the traditional SUC approach incurs an average total cost of 187.68 million dollars, whereas the LLM-assisted SUC (LLM-SUC) achieves a mean cost of 185.58 million dollars (range: 182.61 to 188.65 million dollars), corresponding to a cost reduction of 1.1 to 2.7 percent. Furthermore, LLM-SUC reduces load curtailment by 26.3 percent (2.24 plus/minus 0.31 GWh versus 3.04 GWh for SUC), while both methods maintain zero wind curtailment. Detailed temporal analysis shows that LLM-SUC achieves lower costs in the majority of time intervals and consistently outperforms SUC in 90 percent of cases, with solutions clustering in a favorable cost-reliability region (Coefficient of Variation = 0.93 percent for total cost and 13.8 percent for load curtailment). By leveraging an LLM agent to guide generator commitment decisions and dynamically adjust to stochastic conditions, the proposed framework improves demand fulfillment and operational resilience.
- [156] arXiv:2502.14630 (replaced) [pdf, html, other]
-
Title: Understanding long-term energy use in off-grid solar home systems in sub-Saharan AfricaComments: Draft updates, including text and figure changesSubjects: Systems and Control (eess.SY)
Solar home systems provide low-cost electricity access for rural off-grid communities. As access to them increases, more long-term data becomes available on how these systems are used throughout their lifetime. This work analyses a dataset of 1,000 systems across sub-Saharan Africa. Dynamic time warping clustering was applied to the load demand data from the systems, identifying five distinct archetypal daily load profiles and their occurrence across the dataset. Temporal analysis reveals a general decline in daily energy consumption over time, with 77% of households reducing their usage compared to the start of ownership. On average, there is a 33% decrease in daily consumption by the end of the second year compared to the peak demand, which occurs on the 96th day. Combining the load demand analysis with payment data shows that this decrease in energy consumption is observed even in households that are not experiencing economic hardship, indicating there are reasons beyond financial constraints for decreasing energy use once energy access is obtained.
- [157] arXiv:2503.13528 (replaced) [pdf, other]
-
Title: Internet of Things-Based Smart Precision Farming in Soilless Agriculture:Opportunities and Challenges for Global Food SecurityMonica Dutta, Deepali Gupta, Sumegh Tharewal, Deepam Goyal, Jasminder Kaur Sandhu, Manjit Kaur, Ahmad Ali Alzubi, Jazem Mutared AlanaziSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
The rapid growth of the global population and the continuous decline in cultivable land pose significant threats to food security. This challenge worsens as climate change further reduces the availability of farmland. Soilless agriculture, such as hydroponics, aeroponics, and aquaponics, offers a sustainable solution by enabling efficient crop cultivation in controlled environments. The integration of the Internet of Things (IoT) with smart precision farming improves resource efficiency, automates environmental control, and ensures stable and high-yield crop production. IoT-enabled smart farming systems utilize real-time monitoring, data-driven decision-making, and automation to optimize water and nutrient usage while minimizing human intervention. This paper explores the opportunities and challenges of IoT-based soilless farming, highlighting its role in sustainable agriculture, urban farming, and global food security. These advanced farming methods ensure greater productivity, resource conservation, and year-round cultivation. However, they also face challenges such as high initial investment, technological dependency, and energy consumption. Through a comprehensive study, bibliometric analysis, and comparative analysis, this research highlights current trends and research gaps. It also outlines future directions for researchers, policymakers, and industry stakeholders to drive innovation and scalability in IoT-driven soilless agriculture. By emphasizing the benefits of vertical farming and Controlled Environment Agriculture (CEA)-enabled soilless techniques, this paper supports informed decision-making to address food security challenges and promote sustainable agricultural innovations.
- [158] arXiv:2503.13583 (replaced) [pdf, html, other]
-
Title: Stability results for MIMO LTI systems via Scaled Relative GraphsComments: Submitted to CDC 2025Subjects: Systems and Control (eess.SY)
This paper proposes a new approach for stability analysis of multi-input, multi-output (MIMO) feedback systems through Scaled Relative Graphs (SRGs). Unlike traditional methods, such as the Generalized Nyquist Criterion (GNC), which relies on a coupled analysis that requires the multiplication of models, our approach enables the evaluation of system stability in a decoupled fashion and provides an intuitive, visual representation of system behavior. Our results provide conditions for certifying the stability of feedback MIMO Linear Time-Invariant (LTI) systems.
- [159] arXiv:2503.14379 (replaced) [pdf, html, other]
-
Title: On the Standard Performance Criteria for Applied Control Design: PID, MPC or Machine Learning Controller?Subjects: Systems and Control (eess.SY)
The traditional control theory and its application to basic and complex systems have reached an advanced level of maturity. This includes aerial, marine, and ground vehicles, as well as robotics, chemical, transportation, and electrical systems widely used in our daily lives. The emerging era of data-driven methods, Large Language Models (LLMs), and AI-based controllers does not indicate a weakness in well-established control theory. Instead, it aims to reduce dependence on models and uncertainties, address increasingly complex systems, and potentially achieve decision-making capabilities comparable to human-level performance. This revolution integrates knowledge from computer science, machine learning, biology, and classical control, producing promising algorithms that are yet to demonstrate widespread real-world applicability. Despite the maturity of control theory and the presence of various performance criteria, there is still a lack of standardised metrics for testing, evaluation, Verification and Validation ($V\&V$) of algorithms. This gap can lead to algorithms that, while optimal in certain aspects, may fall short of practical implementation, sparking debates within the literature. For a controller to succeed in real-world applications, it must satisfy three key categories of performance metrics: tracking quality, control effort (energy consumption), and robustness. This paper rather takes an applied perspective, proposing and consolidating standard performance criteria for testing and analysing control systems, intended for researchers and students. The proposed framework ensures the post-design applicability of a black-box algorithm, aligning with modern data analysis and $V\&V$ perspectives to prevent resource allocation to systems with limited impact or imprecise claims.
- [160] arXiv:2503.19046 (replaced) [pdf, html, other]
-
Title: Learning Beamforming Codebooks for Active Sensing with Reconfigurable Intelligent SurfaceComments: Accepted in IEEE Transactions on Wireless CommunicationsSubjects: Signal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
This paper explores the design of beamforming codebooks for the base station (BS) and for the reconfigurable intelligent surfaces (RISs) in an active sensing scheme for uplink localization, in which the mobile user transmits a sequence of pilots to the BS through reflection at the RISs, and the BS and the RISs are adaptively configured by carefully choosing BS beamforming codeword and RIS codewords from their respective codebooks in a sequential manner to progressively focus onto the user. Most existing codebook designs for RIS are not tailored for active sensing, by which we mean the choice of the next codeword should depend on the measurements made so far, and the sequence of codewords should dynamically focus reflection toward the user. Moreover, most existing codeword selection methods rely on exhaustive search in beam training to identify the codeword with the highest signal-to-noise ratio (SNR), thus incurring substantial pilot overhead as the size of the codebook scales. This paper proposes a learning-based approach for codebook construction and for codeword selection for active sensing. The proposed learning approach aims to locate a target in the service area by recursively selecting a sequence of BS beamforming codewords and RIS codewords from the respective codebooks as more measurements become available without exhaustive beam training. The codebook design and the codeword selection fuse key ideas from the vector quantized variational autoencoder (VQ-VAE) and the long short-term memory (LSTM) network to learn respectively the discrete function space of the codebook and the temporal dependencies between measurements.
- [161] arXiv:2503.20274 (replaced) [pdf, html, other]
-
Title: Near-Field THz Bending Beamforming: A Convex Optimization PerspectiveSubjects: Signal Processing (eess.SP)
Terahertz (THz) communication systems suffer severe
blockage issues, which may significantly degrade the communica tion coverage and quality. Bending beams, capable of adjusting
their propagation direction to bypass obstacles, have recently
emerged as a promising solution to resolve this issue by engineer ing the propagation trajectory of the beam. However, traditional
bending beam generation methods rely heavily on the specific
geometric properties of the propagation trajectory and can only
achieve sub-optimal performance. In this paper, we propose a new
and general bending beamforming method by adopting the convex
optimization techniques. In particular, we formulate the bending
beamforming design as a max-min optimization problem, aiming
to optimize the analog or digital transmit beamforming vector to
maximize the minimum received signal power among all positions
along the bending beam trajectory. However, the resulting problem
is non-convex and difficult to be solved optimally. To tackle this
difficulty, we apply the successive convex approximation (SCA)
technique to obtain a high-quality suboptimal solution. Numerical
results show that our proposed bending beamforming method
outperforms the traditional method and shows robustness to the
obstacle in the environment. - [162] arXiv:2503.21040 (replaced) [pdf, html, other]
-
Title: Local Stability and Stabilization of Quadratic-Bilinear Systems using Petersen's LemmaSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Quadratic-bilinear (QB) systems arise in many areas of science and engineering. In this paper, we present a scalable approach for designing locally stabilizing state-feedback control laws and certifying the local stability of QB systems. Sufficient conditions are established for local stability and stabilization based on quadratic Lyapunov functions, which also provide ellipsoidal inner-estimates for the region of attraction and region of stabilizability of an equilibrium point. Our formulation exploits Petersen's Lemma to convert the problem of certifying the sign-definiteness of the Lyapunov condition into a line search over a single scalar parameter. The resulting linear matrix inequality (LMI) conditions scale quadratically with the state dimension for both stability analysis and control synthesis, thus enabling analysis and control of QB systems with hundreds of state variables without resorting to specialized implementations. We demonstrate the approach on three benchmark problems from the existing literature. In all cases, we find our formulation yields comparable approximations of stability domains as determined by other established tools that are otherwise restricted to systems with up to tens of state variables.
- [163] arXiv:2503.22489 (replaced) [pdf, other]
-
Title: Energy-efficient UAV movement and user-UAV association in multi-UAV networksComments: Submitted for a possible publicationSubjects: Systems and Control (eess.SY)
These days, unmanned aerial vehicle (UAV)-based millimeter wave (mmWave) communication systems have drawn a lot of attention due to the increasing demand for faster data rates. Given the susceptibility of mmWave signals to obstacles and high propagation loss of mmWaves, ensuring line-of-sight (LoS) connectivity is critical for maintaining robust and efficient communication. Furthermore, UAVs have limited power resource and limited capacity in terms of number of users it can serve. Most significantly different users have different delay requirements and they keep moving while interacting with the UAVs. In this paper, first, we have provided an efficient solution for the optimal movement of the UAVs, by taking into account the energy efficiency of the UAVs as well as the mobility and delay priority of the users. Next, we have proposed a greedy solution for the optimal user-UAV assignment. After that, the numerical results show how well the suggested solution performs in comparison to the current benchmarks in terms of delay suffered by the users, number of unserved users, and energy efficiency of the UAVs.
- [164] arXiv:2210.08339 (replaced) [pdf, html, other]
-
Title: Reachable Polyhedral Marching (RPM): An Exact Analysis Tool for Deep-Learned Control SystemsComments: Submitted to IEEE Transactions on Neural Networks and Learning Systems. arXiv admin note: text overlap with arXiv:2011.11609Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
Neural networks are increasingly used in robotics as policies, state transition models, state estimation models, or all of the above. With these components being learned from data, it is important to be able to analyze what behaviors were learned and how this affects closed-loop performance. In this paper we take steps toward this goal by developing methods for computing control invariant sets and regions of attraction (ROAs) of dynamical systems represented as neural networks. We focus our attention on feedforward neural networks with the rectified linear unit (ReLU) activation, which are known to implement continuous piecewise-affine (PWA) functions. We describe the Reachable Polyhedral Marching (RPM) algorithm for enumerating the affine pieces of a neural network through an incremental connected walk. We then use this algorithm to compute exact forward and backward reachable sets, from which we provide methods for computing control invariant sets and ROAs. Our approach is unique in that we find these sets incrementally, without Lyapunov-based tools. In our examples we demonstrate the ability of our approach to find non-convex control invariant sets and ROAs on tasks with learned van der Pol oscillator and pendulum models. Further, we provide an accelerated algorithm for computing ROAs that leverages the incremental and connected enumeration of affine regions that RPM provides. We show this acceleration to lead to a 15x speedup in our examples. Finally, we apply our methods to find a set of states that are stabilized by an image-based controller for an aircraft runway control problem.
- [165] arXiv:2301.06227 (replaced) [pdf, html, other]
-
Title: General Distribution Steering: A Sub-Optimal Solution by Convex OptimizationComments: 16 pages, 23 figuresSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
General distribution steering is intrinsically an infinite-dimensional problem, when the continuous distributions to steer are arbitrary. We put forward a moment representation of the primal system for control in [42]. However, the system trajectory was a predetermined one without optimization towards a design criterion, which doesn't always ensure a most satisfactory solution. In this paper, we propose an optimization approach to the general distribution steering problem of the first-order discrete-time linear system, i.e., an optimal control law for the corresponding moment system. The domain of all feasible control inputs is non-convex and has a complex topology. We obtain a subset of it by minimizing a weighted sum of squared integral distances alongside the system trajectory. The feasible domain is then proved convex, and the optimal control problem can be treated as a convex optimization or by exhaustive search, based on the type of the cost function. Algorithms of steering for continuous and discrete distributions are then put forward respectively, by adopting a realization scheme of control inputs. We also provide an explicit advantage of our proposed algorithm by truncated power moments to the prevailing Gaussian Mixture Models. Experiments on different types of cost functions are given to validate the performance of our proposed algorithm. Since the moment system is a dimension-reduced counterpart of the primal system, we call this solution a sub-optimal one to the primal general distribution steering problem.
- [166] arXiv:2308.03240 (replaced) [pdf, html, other]
-
Title: Carbon-Aware Optimal Power FlowSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
To facilitate effective decarbonization of the electric power sector, this paper introduces the generic Carbon-aware Optimal Power Flow (C-OPF) method for power system decision-making that considers demand-side carbon accounting and emission management. Built upon the classic optimal power flow (OPF) model, the C-OPF method incorporates carbon emission flow equations and constraints, as well as carbon-related objectives, to jointly optimize power flow and carbon flow. In particular, this paper establishes the feasibility and solution uniqueness of the carbon emission flow equations, and proposes modeling and linearization techniques to address the issues of undetermined power flow directions and bilinear terms in the C-OPF model. Additionally, two novel carbon emission models, together with the carbon accounting schemes, for energy storage systems are developed and integrated into the C-OPF model. Numerical simulations demonstrate the characteristics and effectiveness of the C-OPF method, in comparison with OPF solutions.
- [167] arXiv:2310.10545 (replaced) [pdf, other]
-
Title: Optimal vintage factor analysis with deflation varimaxSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
Vintage factor analysis is one important type of factor analysis that aims to first find a low-dimensional representation of the original data, and then to seek a rotation such that the rotated low-dimensional representation is scientifically meaningful. The most widely used vintage factor analysis is the Principal Component Analysis (PCA) followed by the varimax rotation. Despite its popularity, little theoretical guarantee can be provided to date mainly because varimax rotation requires to solve a non-convex optimization over the set of orthogonal matrices.
In this paper, we propose a deflation varimax procedure that solves each row of an orthogonal matrix sequentially. In addition to its net computational gain and flexibility, we are able to fully establish theoretical guarantees for the proposed procedure in a broader context. Adopting this new deflation varimax as the second step after PCA, we further analyze this two step procedure under a general class of factor models. Our results show that it estimates the factor loading matrix in the minimax optimal rate when the signal-to-noise-ratio (SNR) is moderate or large. In the low SNR regime, we offer possible improvement over using PCA and the deflation varimax when the additive noise under the factor model is structured. The modified procedure is shown to be minimax optimal in all SNR regimes. Our theory is valid for finite sample and allows the number of the latent factors to grow with the sample size as well as the ambient dimension to grow with, or even exceed, the sample size. Extensive simulation and real data analysis further corroborate our theoretical findings. - [168] arXiv:2312.01970 (replaced) [pdf, html, other]
-
Title: Cascade Reinforcement Learning with State Space Factorization for O-RAN-based Traffic SteeringComments: 9 pages, 8 figuresSubjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
The Open Radio Access Network (O-RAN) architecture empowers intelligent and automated optimization of the RAN through applications deployed on the RAN Intelligent Controller (RIC) platform, enabling capabilities beyond what is achievable with traditional RAN solutions. Within this paradigm, Traffic Steering (TS) emerges as a pivotal RIC application that focuses on optimizing cell-level mobility settings in near-real-time, aiming to significantly improve network spectral efficiency. In this paper, we design a novel TS algorithm based on a Cascade Reinforcement Learning (CaRL) framework. We propose state space factorization and policy decomposition to reduce the need for large models and well-labeled datasets. For each sub-state space, an RL sub-policy will be trained to learn an optimized mapping onto the action space. To apply CaRL on new network regions, we propose a knowledge transfer approach to initialize a new sub-policy based on knowledge learned by the trained policies. To evaluate CaRL, we build a data-driven and scalable RIC digital twin (DT) that is modeled using important real-world data, including network configuration, user geo-distribution, and traffic demand, among others, from a tier-1 mobile operator in the US. We evaluate CaRL on two DT scenarios representing two network clusters in two different cities and compare its performance with the business-as-usual (BAU) policy and other competing optimization approaches using heuristic and Q-table algorithms. Benchmarking results show that CaRL performs the best and improves the average cluster-aggregated downlink throughput over the BAU policy by 24% and 18% in these two scenarios, respectively.
- [169] arXiv:2402.01116 (replaced) [pdf, html, other]
-
Title: Scalable Multi-modal Model Predictive Control via Duality-based Interaction PredictionsComments: Accepted at IEEE Intelligent Vehicles Symposium 2024Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
We propose a hierarchical architecture designed for scalable real-time Model Predictive Control (MPC) in complex, multi-modal traffic scenarios. This architecture comprises two key components: 1) RAID-Net, a novel attention-based Recurrent Neural Network that predicts relevant interactions along the MPC prediction horizon between the autonomous vehicle and the surrounding vehicles using Lagrangian duality, and 2) a reduced Stochastic MPC problem that eliminates irrelevant collision avoidance constraints, enhancing computational efficiency. Our approach is demonstrated in a simulated traffic intersection with interactive surrounding vehicles, showcasing a 12x speed-up in solving the motion planning problem. A video demonstrating the proposed architecture in multiple complex traffic scenarios can be found here: this https URL. GitHub: this https URL
- [170] arXiv:2403.02963 (replaced) [pdf, html, other]
-
Title: Opportunistic User Scheduling for Secure RIS-aided Wireless CommunicationsSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
In this paper, we provide expressions for the secrecy outage probability (SOP) for suboptimal and optimal opportunistic scheduling schemes in a reconfigurable intelligent surface (RIS) aided {single antenna} system with multiple eavesdroppers in approximate closed form. A suboptimal scheduling (SS) scheme is analyzed, which is used when the channel state information (CSI) of the eavesdropping links is unavailable, and the optimal scheduling (OS) scheme is also analyzed, which is used when the global CSI is available. For each scheme, we provide a simplified expression for the SOP in the high signal-to-noise ratio (SNR) regime to demonstrate its behavior as a function of the key system parameters. At high SNR, the SOP saturates to a constant level which decreases exponentially with the number of RIS elements in the SS scheme and with the product of the number of RIS elements and the number of users in the OS scheme. We also show that the derived SOP of the SS scheme can directly provide the SOP for the best antenna-user pair scheduling scheme in a multiple antenna system. We compare the performance of the opportunistic user scheduling schemes with that of a non-orthogonal multiple access (NOMA) based scheduling scheme which chooses a pair of users in each time slot for scheduling and we show that the opportunistic schemes outperform the NOMA-based scheme. We also derive a closed-form expression for the SOP of a decode-and-forward (DF) relay-aided scheduling scheme in order to compare it with that of the RIS-aided system. It is found that the RIS-aided system outperforms the relay-aided systems when the number of RIS elements is sufficiently large. An increased number of RIS elements is required to outperform the relay-aided system at higher operating frequencies.
- [171] arXiv:2403.07156 (replaced) [pdf, other]
-
Title: On the Uniqueness of Participation Factors in Nonlinear Dynamical SystemsComments: AcceptedJournal-ref: Journal of Control Theory and Applications, 2025Subjects: Dynamical Systems (math.DS); Systems and Control (eess.SY)
In the modal analysis and control of nonlinear dynamical systems, the participation factors of state variables with respect to a critical or selected mode serve as a pivotal tool for simplifying stability studies by focusing on a subset of highly influential state variables. For linear systems, the participation factors of state variables regarding a mode are uniquely determined by the mode's composition and shape, defined by the system's left and right eigenvectors, respectively. However, the uniqueness of other types of participation factors necessitates further investigation. This paper establishes a sufficient condition for the uniqueness of nonlinear participation factors and five other variants of participation factors, accounting for uncertain scaling factors in a mode's shape and composition. These scaling factors arise from variations in the selection of physical units or the value ranges of state variables when analyzing and controlling real-world dynamical systems. Understanding the sufficient condition of the uniqueness is therefore crucial for the correct application of participation factors in practical scenarios. Additionally, the paper explores the relationship between perturbation magnitudes in state variables and the selection of optimal scaling factors.
- [172] arXiv:2405.01558 (replaced) [pdf, html, other]
-
Title: Configurable Holography: Towards Display and Scene AdaptationComments: 11 pages, 9 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Optics (physics.optics)
Emerging learned holography approaches have enabled faster and high-quality hologram synthesis, setting a new milestone toward practical holographic displays. However, these learned models require training a dedicated model for each set of display-scene parameters. To address this shortcoming, our work introduces a highly configurable learned model structure, synthesizing 3D holograms interactively while supporting diverse display-scene parameters. Our family of models relying on this structure can be conditioned continuously for varying novel scene parameters, including input images, propagation distances, volume depths, peak brightnesses, and novel display parameters of pixel pitches and wavelengths. Uniquely, our findings unearth a correlation between depth estimation and hologram synthesis tasks in the learning domain, leading to a learned model that unlocks accurate 3D hologram generation from 2D images across varied display-scene parameters. We validate our models by synthesizing high-quality 3D holograms in simulations and also verify our findings with two different holographic display prototypes. Moreover, our family of models can synthesize holograms with a 2x speed-up compared to the state-of-the-art learned holography approaches in the literature.
- [173] arXiv:2405.01919 (replaced) [pdf, html, other]
-
Title: Channel Orthogonalization in Panel-Based LISComments: 6 pages, 3 figures. This work was presented at IEEE WCNC 2025, copyright has been transferred to IEEESubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Large intelligent surface (LIS) has gained momentum as a potential 6G-enabling technology that expands the benefits of massive multiple-input multiple-output (MIMO). On the other hand, orthogonal space-division multiplexing (OSDM) may give a promising direction for efficient exploitation of the spatial resources, analogous as what is achieved with orthogonal frequency-division multiplexing (OFDM) in the frequency domain. To this end, we study how to enforce channel orthogonality in a panel-based LIS (P-LIS) scenario. Our proposed method consists of having a subset of active LIS-panels coherently serving a set of users, and another subset of LIS-panels operating in a novel low-power mode by implementing a receive and re-transmit (RRTx) process. This results in an inter-symbol interference (ISI) channel, where we characterize the RRTx processing required to achieve simultaneous orthogonality in time and space. We then employ the remaining degrees of freedom (DoFs) from the orthogonality constraint to minimize the RRTx processing power, where we derive a closed-form global minimizer, allowing for efficient implementation of the proposed scheme.
- [174] arXiv:2407.02264 (replaced) [pdf, html, other]
-
Title: SOAF: Scene Occlusion-aware Neural Acoustic FieldSubjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This paper tackles the problem of novel view audio-visual synthesis along an arbitrary trajectory in an indoor scene, given the audio-video recordings from other known trajectories of the scene. Existing methods often overlook the effect of room geometry, particularly wall occlusions on sound propagation, making them less accurate in multi-room environments. In this work, we propose a new approach called Scene Occlusion-aware Acoustic Field (SOAF) for accurate sound generation. Our approach derives a global prior for the sound field using distance-aware parametric sound-propagation modeling and then transforms it based on the scene structure learned from the input video. We extract features from the local acoustic field centered at the receiver using a Fibonacci Sphere to generate binaural audio for novel views with a direction-aware attention mechanism. Extensive experiments on the real dataset RWAVS and the synthetic dataset SoundSpaces demonstrate that our method outperforms previous state-of-the-art techniques in audio generation.
- [175] arXiv:2407.16623 (replaced) [pdf, html, other]
-
Title: Inverse Particle FilterComments: 16 pages, 5 figures, 4 tablesSubjects: Optimization and Control (math.OC); Signal Processing (eess.SP); Systems and Control (eess.SY); Machine Learning (stat.ML)
In cognitive systems, recent emphasis has been placed on studying the cognitive processes of the subject whose behavior was the primary focus of the system's cognitive response. This approach, known as inverse cognition, arises in counter-adversarial applications and has motivated the development of inverse Bayesian filters. In this context, a cognitive adversary, such as a radar, uses a forward Bayesian filter to track its target of interest. An inverse filter is then employed to infer the adversary's estimate of the target's or defender's state. Previous studies have addressed this inverse filtering problem by introducing methods like the inverse Kalman filter (KF), inverse extended KF, and inverse unscented KF. However, these filters typically assume additive Gaussian noise models and/or rely on local approximations of non-linear dynamics at the state estimates, limiting their practical application. In contrast, this paper adopts a global filtering approach and presents the development of an inverse particle filter (I-PF). The particle filter framework employs Monte Carlo (MC) methods to approximate arbitrary posterior distributions. Moreover, under mild system-level conditions, the proposed I-PF demonstrates convergence to the optimal inverse filter. Additionally, we propose the differentiable I-PF to address scenarios where system information is unknown to the defender. Using the recursive Cramer-Rao lower bound and non-credibility index (NCI), our numerical experiments for different systems demonstrate the estimation performance and time complexity of the proposed filter.
- [176] arXiv:2408.04837 (replaced) [pdf, html, other]
-
Title: Multi-User MISO with Stacked Intelligent Metasurfaces: A DRL-Based Sum-Rate Optimization ApproachComments: 15 pages, 13 figures, 3 tables. arXiv admin note: text overlap with arXiv:2402.09006Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Stacked intelligent metasurfaces (SIMs) represent a novel signal processing paradigm that enables over-the-air processing of electromagnetic waves at the speed of light. Their multi-layer architecture exhibits customizable computational capabilities compared to conventional single-layer reconfigurable intelligent surfaces and metasurface lenses. In this paper, we deploy SIM to improve the performance of multi-user multiple-input single-output (MISO) wireless systems through a low complexity manner with reduced numbers of transmit radio frequency chains. In particular, an optimization formulation for the joint design of the SIM phase shifts and the transmit power allocation is presented, which is efficiently tackled via a customized deep reinforcement learning (DRL) approach that systematically explores pre-designed states of the SIM-parametrized smart wireless environment. The presented performance evaluation results demonstrate the proposed method's capability to effectively learn from the wireless environment, while consistently outperforming conventional precoding schemes under low transmit power conditions. Furthermore, the implementation of hyperparameter tuning and whitening process significantly enhance the robustness of the proposed DRL framework.
- [177] arXiv:2409.15132 (replaced) [pdf, html, other]
-
Title: FusionRF: High-Fidelity Satellite Neural Radiance Fields from Multispectral and Panchromatic AcquisitionsSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
We introduce FusionRF, a novel framework for digital surface reconstruction from satellite multispectral and panchromatic images. Current work has demonstrated the increased accuracy of neural photogrammetry for surface reconstruction from optical satellite images compared to algorithmic methods. Common satellites produce both a panchromatic and multispectral image, which contain high spatial and spectral information respectively. Current neural reconstruction methods require multispectral images to be upsampled with a pansharpening method using the spatial data in the panchromatic image. However, these methods may introduce biases and hallucinations due to domain gaps. FusionRF introduces joint image fusion during optimization through a novel cross-resolution kernel that learns to resolve spatial resolution loss present in multispectral images. As input, FusionRF accepts the original multispectral and panchromatic data, eliminating the need for image preprocessing. FusionRF also leverages multimodal appearance embeddings that encode the image characteristics of each modality and view within a uniform representation. By optimizing on both modalities, FusionRF learns to fuse image modalities while performing reconstruction tasks and eliminates the need for a pansharpening preprocessing step. We evaluate our method on multispectral and panchromatic satellite images from the WorldView-3 satellite in various locations, and show that FusionRF provides an average of 17% improvement in depth reconstruction accuracy, and renders sharp training and novel views.
- [178] arXiv:2409.16663 (replaced) [pdf, html, other]
-
Title: Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World ModelsAlexander Popov, Alperen Degirmenci, David Wehr, Shashank Hegde, Ryan Oldja, Alexey Kamenev, Bertrand Douillard, David Nistér, Urs Muller, Ruchi Bhargava, Stan Birchfield, Nikolai SmolyanskiyComments: 8 pages, 6 figures, updated in March 2025, original published in September 2024, for ICRA 2025 submission, for associated video file, see this https URLSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
We propose the use of latent space generative world models to address the covariate shift problem in autonomous driving. A world model is a neural network capable of predicting an agent's next state given past states and actions. By leveraging a world model during training, the driving policy effectively mitigates covariate shift without requiring an excessive amount of training data. During end-to-end training, our policy learns how to recover from errors by aligning with states observed in human demonstrations, so that at runtime it can recover from perturbations outside the training distribution. Additionally, we introduce a novel transformer-based perception encoder that employs multi-view cross-attention and a learned scene query. We present qualitative and quantitative results, demonstrating significant improvements upon prior state of the art in closed-loop testing in the CARLA simulator, as well as showing the ability to handle perturbations in both CARLA and NVIDIA's DRIVE Sim.
- [179] arXiv:2410.08856 (replaced) [pdf, other]
-
Title: FlowMRI-Net: A Generalizable Self-Supervised 4D Flow MRI Reconstruction networkSubjects: Medical Physics (physics.med-ph); Image and Video Processing (eess.IV)
Background: Image reconstruction from highly undersampled 4D flow MRI data can be very time consuming and may result in significant underestimation of velocities depending on regularization, thereby limiting the applicability of the method. The objective of the present work was to develop a generalizable self-supervised deep learning-based framework for fast and accurate reconstruction of highly undersampled 4D flow MRI and to demonstrate the utility of the framework for aortic and cerebrovascular applications.
Methods: The proposed deep-learning-based framework, called FlowMRI-Net, employs physics-driven unrolled optimization using a complex-valued convolutional recurrent neural network and is trained in a self-supervised manner. The generalizability of the framework is evaluated using aortic and cerebrovascular 4D flow MRI acquisitions acquired on systems from two different vendors for various undersampling factors (R=8,16,24) and compared to state-of-the-art compressed sensing (CS-LLR) and deep learning-based (FlowVN) reconstructions. Evaluation includes an ablation study and a qualitative and quantitative analysis of image and velocity magnitudes.
Results: FlowMRI-Net outperforms CS-LLR and FlowVN for aortic 4D flow MRI reconstruction, resulting in significantly lower vectorial normalized root mean square error and mean directional errors for velocities in the thoracic aorta. Furthermore, the feasibility of FlowMRI-Net's generalizability is demonstrated for cerebrovascular 4D flow MRI reconstruction, where no FlowVN can be trained due to the lack of high-quality reference data. Reconstruction times ranged from 3 to 7 minutes on commodity CPU/GPU hardware.
Conclusion: FlowMRI-Net enables fast and accurate reconstruction of highly undersampled aortic and cerebrovascular 4D flow MRI, with possible applications to other vascular territories. - [180] arXiv:2410.10741 (replaced) [pdf, html, other]
-
Title: SensorBench: Benchmarking LLMs in Coding-Based Sensor ProcessingSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems. Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in signal-processing tools. However, recent works show that Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems.
To explore this potential, we construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective. The benchmark incorporates diverse real-world sensor datasets for various tasks. The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks with parameter selections compared to engineering experts. Additionally, we investigate four prompting strategies for sensor processing and show that self-verification can outperform all other baselines in 48% of tasks. Our study provides a comprehensive benchmark and prompting analysis for future developments, paving the way toward an LLM-based sensor processing copilot. - [181] arXiv:2410.17081 (replaced) [pdf, html, other]
-
Title: Continuous Speech Tokenizer in Text To SpeechComments: NAACL 2025 Findings PosterSubjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
The fusion of speech and language in the era of large language models has garnered significant attention. Discrete speech token is often utilized in text-to-speech tasks for speech compression and portability, which is convenient for joint training with text and have good compression efficiency. However, we found that the discrete speech tokenizer still suffers from information loss. Therefore, we propose a simple yet effective continuous speech tokenizer named Cont-SPT, and a text-to-speech model based on continuous speech tokens. Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS). This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain. The code and resources for Cont-SPT can be found in this https URL
- [182] arXiv:2411.19417 (replaced) [pdf, html, other]
-
Title: Any-Resolution AI-Generated Image Detection by Spectral LearningComments: CVPR2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Recent works have established that AI models introduce spectral artifacts into generated images and propose approaches for learning to capture them using labeled data. However, the significant differences in such artifacts among different generative models hinder these approaches from generalizing to generators not seen during training. In this work, we build upon the key idea that the spectral distribution of real images constitutes both an invariant and highly discriminative pattern for AI-generated image detection. To model this under a self-supervised setup, we employ masked spectral learning using the pretext task of frequency reconstruction. Since generated images constitute out-of-distribution samples for this model, we propose spectral reconstruction similarity to capture this divergence. Moreover, we introduce spectral context attention, which enables our approach to efficiently capture subtle spectral inconsistencies in images of any resolution. Our spectral AI-generated image detection approach (SPAI) achieves a 5.5% absolute improvement in AUC over the previous state-of-the-art across 13 recent generative approaches, while exhibiting robustness against common online perturbations. Code is available on this https URL.
- [183] arXiv:2412.00162 (replaced) [pdf, html, other]
-
Title: Dynamic High-Order Control Barrier Functions with Diffuser for Safety-Critical Trajectory Planning at Signal-Free IntersectionsComments: 11 figures, 5 tables, 15 pagesSubjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Planning safe and efficient trajectories through signal-free intersections presents significant challenges for autonomous vehicles (AVs), particularly in dynamic, multi-task environments with unpredictable interactions and an increased possibility of conflicts. This study aims to address these challenges by developing a unified, robust, adaptive framework to ensure safety and efficiency across three distinct intersection movements: left-turn, right-turn, and straight-ahead. Existing methods often struggle to reliably ensure safety and effectively learn multi-task behaviors from demonstrations in such environments. This study proposes a safety-critical planning method that integrates Dynamic High-Order Control Barrier Functions (DHOCBF) with a diffusion-based model, called Dynamic Safety-Critical Diffuser (DSC-Diffuser). The DSC-Diffuser leverages task-guided planning to enhance efficiency, allowing the simultaneous learning of multiple driving tasks from real-world expert demonstrations. Moreover, the incorporation of goal-oriented constraints significantly reduces displacement errors, ensuring precise trajectory execution. To further ensure driving safety in dynamic environments, the proposed DHOCBF framework dynamically adjusts to account for the movements of surrounding vehicles, offering enhanced adaptability and reduce the conservatism compared to traditional control barrier functions. Validity evaluations of DHOCBF, conducted through numerical simulations, demonstrate its robustness in adapting to variations in obstacle velocities, sizes, uncertainties, and locations, effectively maintaining driving safety across a wide range of complex and uncertain scenarios. Comprehensive performance evaluations demonstrate that DSC-Diffuser generates realistic, stable, and generalizable policies, providing flexibility and reliable safety assurance in complex multi-task driving scenarios.
- [184] arXiv:2412.08247 (replaced) [pdf, html, other]
-
Title: MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual CuesSubjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate the speech of a specific target speaker from an audio mixture using time-synchronized visual cues. In real-world scenarios, visual cues are not always available due to various impairments, which undermines the stability of AV-TSE. Despite this challenge, humans can maintain attentional momentum over time, even when the target speaker is not visible. In this paper, we introduce the Momentum Multi-modal target Speaker Extraction (MoMuSE), which retains a speaker identity momentum in memory, enabling the model to continuously track the target speaker. Designed for real-time inference, MoMuSE extracts the current speech window with guidance from both visual cues and dynamically updated speaker momentum. Experimental results demonstrate that MoMuSE exhibits significant improvement, particularly in scenarios with severe impairment of visual cues.
- [185] arXiv:2412.10625 (replaced) [pdf, html, other]
-
Title: Certainty-Equivalence Model Predictive Control: Stability, Performance, and BeyondComments: 16 pages with some proofs omitted for brevity; simulation is included. Submitted to IEEE Transactions on Automatic ControlSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
Handling model mismatch is a common challenge in model-based controller design, particularly in model predictive control (MPC). While robust MPC is effective in managing uncertainties, its conservatism often makes it less desirable in practice. Certainty-equivalence MPC (CE-MPC), which relies on a nominal model, offers an appealing alternative due to its design simplicity and low computational requirements. Contrary to the existing analyses where MPC has access to the true model, this paper investigates CE-MPC for uncertain nonlinear systems with input constraints and parametric uncertainty. The primary contributions of the paper are two-fold. First, a novel perturbation analysis of the MPC value function is provided, without relying on the common assumption of Lipschitz continuity of the stage cost, better tailoring the popular quadratic cost and having broader applicability to value function approximation, online model learning in MPC, and performance-driven MPC design. Second, the stability and performance analysis of CE-MPC are provided, with a quantification of the suboptimality of CE-MPC compared to the infinite-horizon optimal controller with perfect model knowledge. The results provide valuable insights in how the prediction horizon and model mismatch jointly affect stability and performance. Furthermore, the general results are specialized to linear quadratic control, and a competitive ratio bound is derived, serving as the first competitive-ratio bound for MPC of uncertain linear systems with input constraints and multiplicative uncertainty.
- [186] arXiv:2501.10629 (replaced) [pdf, html, other]
-
Title: Prompt-Enabled Large AI Models for CSI FeedbackSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Artificial intelligence (AI) has emerged as a promising tool for channel state information (CSI) feedback. While recent research primarily focuses on improving feedback accuracy on a specific dataset through novel architectures, the underlying mechanism of AI-based CSI feedback remains unclear. This study explores the mechanism through analyzing performance across diverse datasets, with findings suggesting that superior feedback performance stems from AI models' strong fitting capabilities and their ability to leverage environmental knowledge. Building on these findings, we propose a prompt-enabled large AI model (LAM) for CSI feedback. The LAM employs powerful transformer blocks and is trained on extensive datasets from various scenarios. To further enhance reconstruction quality, the channel distribution (environmental knowledge) -- represented as the mean of channel magnitude in the angular domain -- is incorporated as a prompt within the decoder. Simulation results confirm that the proposed prompt-enabled LAM significantly improves feedback accuracy and generalization performance while reducing data collection requirements in new scenarios.
- [187] arXiv:2502.12005 (replaced) [pdf, html, other]
-
Title: Feasibility Evaluation of Quadratic Programs for Constrained ControlComments: Submitted to CDC 2025Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
This paper presents a computationally-efficient method for evaluating the feasibility of Quadratic Programs (QPs) for online constrained control. Based on the duality principle, we first show that the feasibility of a QP can be determined by the solution of a properly-defined Linear Program (LP). Our analysis yields a LP that can be solved more efficiently compared to the original QP problem, and more importantly, is simpler in form and can be solved more efficiently compared to existing methods that assess feasibility via LPs. The computational efficiency of the proposed method compared to existing methods for feasibility evaluation is demonstrated in comparative case studies as well as a feasible-constraint selection problem, indicating its promise for online feasibility evaluation of optimization-based controllers.
- [188] arXiv:2502.15849 (replaced) [pdf, html, other]
-
Title: Deriving Representative Structure from Music CorporaIlana Shapiro, Ruanqianqian Huang, Zachary Novack, Cheng-i Wang, Hao-Wen Dong, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Sorin LernerComments: 12 pages, 8 figures, 7 tablesSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Audio and Speech Processing (eess.AS)
Western music is an innately hierarchical system of interacting levels of structure, from fine-grained melody to high-level form. In order to analyze music compositions holistically and at multiple granularities, we propose a unified, hierarchical meta-representation of musical structure called the structural temporal graph (STG). For a single piece, the STG is a data structure that defines a hierarchy of progressively finer structural musical features and the temporal relationships between them. We use the STG to enable a novel approach for deriving a representative structural summary of a music corpus, which we formalize as a dually NP-hard combinatorial optimization problem extending the Generalized Median Graph problem. Our approach first applies simulated annealing to develop a measure of structural distance between two music pieces rooted in graph isomorphism. Our approach then combines the formal guarantees of SMT solvers with nested simulated annealing over structural distances to produce a structurally sound, representative centroid STG for an entire corpus of STGs from individual pieces. To evaluate our approach, we conduct experiments verifying that structural distance accurately differentiates between music pieces, and that derived centroids accurately structurally characterize their corpora.
- [189] arXiv:2502.20225 (replaced) [pdf, html, other]
-
Title: DIN-CTS: Low-Complexity Depthwise-Inception Neural Network with Contrastive Training Strategy for Deepfake Speech DetectionLam Pham, Dat Tran, Phat Lam, Florian Skopik, Alexander Schindler, Silvia Poletti, David Fischinger, Martin BoyerSubjects: Sound (cs.SD); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
In this paper, we propose a deep neural network approach for deepfake speech detection (DSD) based on a lowcomplexity Depthwise-Inception Network (DIN) trained with a contrastive training strategy (CTS). In this framework, input audio recordings are first transformed into spectrograms using Short-Time Fourier Transform (STFT) and Linear Filter (LF), which are then used to train the DIN. Once trained, the DIN processes bonafide utterances to extract audio embeddings, which are used to construct a Gaussian distribution representing genuine speech. Deepfake detection is then performed by computing the distance between a test utterance and this distribution to determine whether the utterance is fake or bonafide. To evaluate our proposed systems, we conducted extensive experiments on the benchmark dataset of ASVspoof 2019 LA. The experimental results demonstrate the effectiveness of combining the Depthwise-Inception Network with the contrastive learning strategy in distinguishing between fake and bonafide utterances. We achieved Equal Error Rate (EER), Accuracy (Acc.), F1, AUC scores of 4.6%, 95.4%, 97.3%, and 98.9% respectively using a single, low-complexity DIN with just 1.77 M parameters and 985 M FLOPS on short audio segments (4 seconds). Furthermore, our proposed system outperforms the single-system submissions in the ASVspoof 2019 LA challenge, showcasing its potential for real-time applications.
- [190] arXiv:2503.06924 (replaced) [pdf, other]
-
Title: Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency HandlingComments: 26 pages, 10 figuresSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Automatic speech recognition (ASR) has been an essential component of computer assisted language learning (CALL) and computer assisted language testing (CALT) for many years. As this technology continues to develop rapidly, it is important to evaluate the accuracy of current ASR systems for language learning applications. This study assesses five cutting-edge ASR systems' recognition of non-native accented English speech using recordings from the L2-ARCTIC corpus, featuring speakers from six different L1 backgrounds (Arabic, Chinese, Hindi, Korean, Spanish, and Vietnamese), in the form of both read and spontaneous speech. The read speech consisted of 2,400 single sentence recordings from 24 speakers, while the spontaneous speech included narrative recordings from 22 speakers. Results showed that for read speech, Whisper and AssemblyAI achieved the best accuracy with mean Match Error Rates (MER) of 0.054 and 0.056 respectively, approaching human-level accuracy. For spontaneous speech, RevAI performed best with a mean MER of 0.063. The study also examined how each system handled disfluencies such as filler words, repetitions, and revisions, finding significant variation in performance across systems and disfluency types. While processing speed varied considerably between systems, longer processing times did not necessarily correlate with better accuracy. By detailing the performance of several of the most recent, widely-available ASR systems on non-native English speech, this study aims to help language instructors and researchers understand the strengths and weaknesses of each system and identify which may be suitable for specific use cases.
- [191] arXiv:2503.13268 (replaced) [pdf, html, other]
-
Title: Channel Estimation for Pinching-Antenna Systems (PASS)Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Pinching Antennas (PAs) represent a revolutionary flexible antenna technology that leverages dielectric waveguides and electromagnetic coupling to mitigate large-scale path loss. This letter is the first to explore channel estimation for Pinching-Antenna SyStems (PASS), addressing their uniquely ill-conditioned and underdetermined channel characteristics. In particular, two efficient deep learning-based channel estimators are proposed. 1) PAMoE: This estimator incorporates dynamic padding, feature embedding, fusion, and mixture of experts (MoE) modules, which effectively leverage the positional information of PAs and exploit expert diversity. 2) PAformer: This Transformer-style estimator employs the self-attention mechanism to predict channel coefficients in a per-antenna manner, which offers more flexibility to adaptively deal with dynamic numbers of PAs in practical deployment. Numerical results demonstrate that 1) the proposed deep learning-based channel estimators outperform conventional methods and exhibit excellent zero-shot learning capabilities, and 2) PAMoE delivers higher channel estimation accuracy via MoE specialization, while PAformer natively handles an arbitrary number of PAs, trading self-attention complexity for superior scalability.
- [192] arXiv:2503.22558 (replaced) [pdf, html, other]
-
Title: Algorithmic analysis of systems with affine input and polynomial stateComments: technical reportSubjects: Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO); Systems and Control (eess.SY)
The goal of this paper is to provide exact and terminating algorithms for the formal analysis of deterministic continuous-time control systems with affine input and polynomial state dynamics (in short, polynomial systems). We consider the following semantic properties: zeroness and equivalence, input independence, linearity, and analyticity. Our approach is based on Chen-Fliess series, which provide a unique representation of the dynamics of such systems via their formal generating series.
Our starting point is Fliess' seminal work showing how the semantic properties above are mirrored by corresponding combinatorial properties on generating series. Next, we observe that the generating series of polynomial systems coincide with the class of shuffle-finite series, a nonlinear generalisation of Schützenberger's rational series which has recently been studied in the context of automata theory and enumerative combinatorics. We exploit and extend recent results in the algorithmic analysis of shuffle-finite series (such as zeroness, equivalence, and commutativity) to show that the semantic properties above can be decided exactly and in finite time for polynomial systems. Some of our analyses rely on a novel technical contribution, namely that shuffle-finite series are closed under support restrictions with commutative regular languages, a result of independent interest.