Electrical Engineering and Systems Science
See recent articles
Showing new listings for Friday, 28 March 2025
- [1] arXiv:2503.20789 [pdf, other]
-
Title: Neuro-Informed Adaptive Learning (NIAL) Algorithm: A Hybrid Deep Learning Approach for ECG Signal ClassificationComments: 1 figure ,2 pagesSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
The detection of cardiac abnormalities using electrocardiogram (ECG) signals is crucial for early diagnosis and intervention in cardiovascular diseases. Traditional deep learning models often lack adaptability to varying signal patterns. This study introduces the Neuro-Informed Adaptive Learning (NIAL) algorithm, a hybrid approach integrating convolutional neural networks (CNNs) and transformer-based attention mechanisms to enhance ECG signal classification. The algorithm dynamically adjusts learning rates based on real-time validation performance, ensuring efficient convergence. Using the MIT-BIH Arrhythmia and PTB Diagnostic ECG datasets, our model achieves high classification accuracy, outperforming conventional approaches. These findings highlight the potential of NIAL in real-time cardiovascular monitoring applications.
- [2] arXiv:2503.20815 [pdf, html, other]
-
Title: D2SA: Dual-Stage Distribution and Slice Adaptation for Efficient Test-Time Adaptation in MRI ReconstructionLipei Zhang, Rui Sun, Zhongying Deng, Yanqi Cheng, Carola-Bibiane Schönlieb, Angelica I Aviles-RiveroComments: 9 pages, 10 pages (supplementary)Subjects: Image and Video Processing (eess.IV)
Variations in Magnetic resonance imaging (MRI) scanners and acquisition protocols cause distribution shifts that degrade reconstruction performance on unseen data. Test-time adaptation (TTA) offers a promising solution to address this discrepancies. However, previous single-shot TTA approaches are inefficient due to repeated training and suboptimal distributional models. Self-supervised learning methods are also limited by scarce date scenarios. To address these challenges, we propose a novel Dual-Stage Distribution and Slice Adaptation (D2SA) via MRI implicit neural representation (MR-INR) to improve MRI reconstruction performance and efficiency, which features two stages. In the first stage, an MR-INR branch performs patient-wise distribution adaptation by learning shared representations across slices and modelling patient-specific shifts with mean and variance adjustments. In the second stage, single-slice adaptation refines the output from frozen convolutional layers with a learnable anisotropic diffusion module, preventing over-smoothing and reducing computation. Experiments across four MRI distribution shifts demonstrate that our method can integrate well with various self-supervised learning (SSL) framework, improving performance and accelerating convergence under diverse conditions.
- [3] arXiv:2503.20822 [pdf, html, other]
-
Title: Synthetic Video Enhances Physical Fidelity in Video SynthesisSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. These rendered videos respect real-world physics, such as maintaining 3D consistency, and serve as a valuable resource that can potentially improve video generation models. To harness this potential, we propose a solution that curates and integrates synthetic data while introducing a method to transfer its physical realism to the model, significantly reducing unwanted artifacts. Through experiments on three representative tasks emphasizing physical consistency, we demonstrate its efficacy in enhancing physical fidelity. While our model still lacks a deep understanding of physics, our work offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis. Website: this https URL
- [4] arXiv:2503.20824 [pdf, html, other]
-
Title: Exploiting Temporal State Space Sharing for Video Semantic SegmentationSyed Ariff Syed Hesham, Yun Liu, Guolei Sun, Henghui Ding, Jing Yang, Ender Konukoglu, Xue Geng, Xudong JiangComments: IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool. By processing spatial patches independently and incorporating shifted operation, TV3S supports highly parallel computation in both training and inference stages, which reduces the delay in sequential state space processing and improves the scalability for long video sequences. Moreover, TV3S incorporates information from prior frames during inference, achieving long-range temporal coherence and superior adaptability to extended sequences. Evaluations on the VSPW and Cityscapes datasets reveal that our approach outperforms current state-of-the-art methods, establishing a new standard for VSS with consistent results across long video sequences. By achieving a good balance between accuracy and efficiency, TV3S shows a significant advancement in spatiotemporal modeling, paving the way for efficient video analysis. The code is publicly available at this https URL.
- [5] arXiv:2503.20826 [pdf, html, other]
-
Title: Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic SegmentationComments: CVPR2025Subjects: Image and Video Processing (eess.IV)
Weakly Supervised Semantic Segmentation (WSSS) with image-level labels aims to achieve pixel-level predictions using Class Activation Maps (CAMs). Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced in WSSS. However, recent methods primarily focus on image-text alignment for CAM generation, while CLIP's potential in patch-text alignment remains unexplored. In this work, we propose ExCEL to explore CLIP's dense knowledge via a novel patch-text alignment paradigm for WSSS. Specifically, we propose Text Semantic Enrichment (TSE) and Visual Calibration (VC) modules to improve the dense alignment across both text and vision modalities. To make text embeddings semantically informative, our TSE module applies Large Language Models (LLMs) to build a dataset-wide knowledge base and enriches the text representations with an implicit attribute-hunting process. To mine fine-grained knowledge from visual features, our VC module first proposes Static Visual Calibration (SVC) to propagate fine-grained knowledge in a non-parametric manner. Then Learnable Visual Calibration (LVC) is further proposed to dynamically shift the frozen features towards distributions with diverse semantics. With these enhancements, ExCEL not only retains CLIP's training-free advantages but also significantly outperforms other state-of-the-art methods with much less training cost on PASCAL VOC and MS COCO.
- [6] arXiv:2503.20838 [pdf, other]
-
Title: Channel impulse response peak clustering using neural networksPetr Horky, Ales Prokes, Radek Zavorka, Josef Vychodil, Jan M. Kelner, Cezary Ziolkowski, Aniruddha ChandraComments: 7 pages, 10 figures, 5 tablesJournal-ref: Proceedings of the 2023 6th International Conference on Advanced Communication Technologies and Networking (CommNet), Rabat, Morocco, 11-13 Dec. 2023, pp. 1-7Subjects: Signal Processing (eess.SP)
This paper introduces an approach to process channel sounder data acquired from Channel Impulse Response (CIR) of 60GHz and 80GHz channel sounder systems, through the integration of Long Short-Term Memory (LSTM) Neural Network (NN) and Fully Connected Neural Network (FCNN). The primary goal is to enhance and automate cluster detection within peaks from noised CIR data. The study initially compares the performance of LSTM NN and FCNN across different input sequence lengths. Notably, LSTM surpasses FCNN due to its incorporation of memory cells, which prove beneficial for handling longer this http URL, the paper investigates the robustness of LSTM NN through various architectural configurations. The findings suggest that robust neural networks tend to closely mimic the input function, whereas smaller neural networks are better at generalizing trends in time series data, which is desirable for anomaly detection, where function peaks are regarded as this http URL, the selected LSTM NN is compared with traditional signal filters, including Butterworth, Savitzky-Golay, Bessel/Thomson, and median filters. Visual observations indicate that the most effective methods for peak detection within channel impulse response data are either the LSTM NN or median filter, as they yield similar results.
- [7] arXiv:2503.20907 [pdf, html, other]
-
Title: Generalized Ray Tracing with Basis functions for Tomographic ProjectionsSubjects: Image and Video Processing (eess.IV)
This work aims at the precise and efficient computation of the x-ray projection of an image represented by a linear combination of general shifted basis functions that typically overlap. We achieve this with a suitable adaptation of ray tracing, which is one of the most efficient methods to compute line integrals. In our work, the cases in which the image is expressed as a spline are of particular relevance. The proposed implementation is applicable to any projection geometry as it computes the forward and backward operators over a collection of arbitrary lines. We validate our work with experiments in the context of inverse problems for image reconstruction and maximize the image quality for a given resolution of the reconstruction grid.
- [8] arXiv:2503.20966 [pdf, html, other]
-
Title: Filtered Multi-Tone Spread Spectrum with Overlapping SubbandsComments: 13 pages, 7 figures. Submitted to IEEE Open Journal of CommunicationsSubjects: Signal Processing (eess.SP)
A new form of the filter bank multi-carrier spread spectrum (FBMC-SS) waveform is presented. This new waveform modifies the filtered multi-tone spread spectrum (FMT-SS) system, and is intended to whiten the power spectral density (PSD) of the transmit signal. In the conventional FMT-SS, subcarrier bands are non-overlapping, leaving a spectral null between the adjacent subcarrier bands. To make FMT-SS more appealing for a broader set of applications than those studied in the past, we propose adding additional subcarriers centered at these nulls and thoroughly explore the impact of the added subcarriers on the system performance. This modified form of FMT-SS is referred to as overlapped FMT-SS (OFMT-SS). We explore the conditions required for maximally flattening the PSD of the synthesized OFMT-SS signal and for cancelling the interference caused by overlapping subbands. We also explore the choices of spreading gains that result in a low peak-to-average power ratio (PAPR) for a number of different scenarios. Further reduction of the PAPR of the synthesized signal through clipping methods is also explored. Additionally, we propose methods of multi-coding for increasing the data rate of the OFMT-SS waveform, while minimally impacting its PAPR.
- [9] arXiv:2503.21007 [pdf, html, other]
-
Title: Bounds on Deep Neural Network Partial Derivatives with Respect to ParametersComments: 8 pagesSubjects: Systems and Control (eess.SY)
Deep neural networks (DNNs) have emerged as a powerful tool with a growing body of literature exploring Lyapunov-based approaches for real-time system identification and control. These methods depend on establishing bounds for the second partial derivatives of DNNs with respect to their parameters, a requirement often assumed but rarely addressed explicitly. This paper provides rigorous mathematical formulations of polynomial bounds on both the first and second partial derivatives of DNNs with respect to their parameters. We present lemmas that characterize these bounds for fully-connected DNNs, while accommodating various classes of activation function including sigmoidal and ReLU-like functions. Our analysis yields closed-form expressions that enable precise stability guarantees for Lyapunov-based deep neural networks (Lb-DNNs). Furthermore, we extend our results to bound the higher-order terms in first-order Taylor approximations of DNNs, providing important tools for convergence analysis in gradient-based learning algorithms. The developed theoretical framework develops explicit, computable expressions, for previously assumed bounds, thereby strengthening the mathematical foundation of neural network applications in safety-critical control systems.
- [10] arXiv:2503.21021 [pdf, html, other]
-
Title: RIS-Enabled Self-Localization with FMCW RadarHyowon Kim, NavidAmani, Musa Furkan Keskin, Zhongxia Simon He, Jorge Gil, Gonzalo-Seco Granados, Henk WymeerschSubjects: Signal Processing (eess.SP)
In the upcoming vehicular networks, reconfigurable intelligent surfaces (RISs) are considered as a key enabler of user self-localization without the intervention of the access points (APs). In this paper, we investigate the feasibility of RIS-enabled self-localization with no APs. We first develop a digital signal processing (DSP) unit for estimating the geometric parameters such as the angle, distance, and velocity and for RIS-enabled self-localization. Second, we set up an experimental testbed consisting of a Texas Instrument frequency modulated continuous wave (FMCW) radar for the user and SilversIMA module for the RIS. Our results confirm the validity of the developed DSP unit and demonstrate the feasibility of RIS-enabled self-localization.
- [11] arXiv:2503.21040 [pdf, html, other]
-
Title: Local Stability and Stabilization of Quadratic-Bilinear Systems using Petersen's LemmaSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Quadratic-bilinear (QB) systems arise in many areas of science and engineering. In this paper, we present a scalable approach for designing locally stabilizing state-feedback control laws and certifying the local stability of QB systems. Sufficient conditions are established for local stability and stabilization based on quadratic Lyapunov functions, which also provide ellipsoidal inner-estimates for the region of attraction and region of stabilizability of an equilibrium point. Our formulation exploits Petersen's Lemma to convert the problem of certifying the sign-definiteness of the Lyapunov condition into a line search over a single scalar parameter. The resulting linear matrix inequality (LMI) conditions scale quadratically with the state dimension for both stability analysis and control synthesis, thus enabling analysis and control of QB systems with hundreds of state variables without resorting to specialized implementations. We demonstrate the approach on three benchmark problems from the existing literature. In all cases, we find our formulation yields comparable approximations of stability domains as determined by other established tools that are otherwise restricted to systems with up to tens of state variables.
- [12] arXiv:2503.21042 [pdf, html, other]
-
Title: Dissipativity-Based Distributed Control and Communication Topology Co-Design for DC Microgrids with ZIP LoadsSubjects: Systems and Control (eess.SY)
This paper presents a novel dissipativity-based distributed droop-free control approach for voltage regulation, current sharing, and Constant Power Load (CPL) stability in DC microgrids (MGs). We describe the closed-loop DC MG as a networked system where DGs, lines, and nonlinear loads (including destabilizing CPLs) are interconnected via a static interconnection matrix. Each DG has a local controller and a distributed global controller, designed using dissipativity properties and sector-bounded techniques. For controller synthesis, we formulate a Linear Matrix Inequality (LMI) problem that simultaneously addresses voltage regulation, current sharing, and CPL stability guarantees. To support the feasibility of this problem, we propose a sector-bounded approach that characterizes CPL nonlinearities and integrates them into the dissipativity framework through S-procedure techniques. Our approach provides a unified framework for co-designing distributed controllers and communication topologies that ensure stability despite the presence of destabilizing CPL effects. The effectiveness of the proposed solution was verified by simulating an islanded DC MG under different scenarios, demonstrating superior performance compared to traditional control approaches when handling CPLs.
- [13] arXiv:2503.21054 [pdf, html, other]
-
Title: Operating Room Workflow Analysis via Reasoning Segmentation over Digital TwinsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Analyzing operating room (OR) workflows to derive quantitative insights into OR efficiency is important for hospitals to maximize patient care and financial sustainability. Prior work on OR-level workflow analysis has relied on end-to-end deep neural networks. While these approaches work well in constrained settings, they are limited to the conditions specified at development time and do not offer the flexibility necessary to accommodate the OR workflow analysis needs of various OR scenarios (e.g., large academic center vs. rural provider) without data collection, annotation, and retraining. Reasoning segmentation (RS) based on foundation models offers this flexibility by enabling automated analysis of OR workflows from OR video feeds given only an implicit text query related to the objects of interest. Due to the reliance on large language model (LLM) fine-tuning, current RS approaches struggle with reasoning about semantic/spatial relationships and show limited generalization to OR video due to variations in visual characteristics and domain-specific terminology. To address these limitations, we first propose a novel digital twin (DT) representation that preserves both semantic and spatial relationships between the various OR components. Then, building on this foundation, we propose ORDiRS (Operating Room Digital twin representation for Reasoning Segmentation), an LLM-tuning-free RS framework that reformulates RS into a "reason-retrieval-synthesize" paradigm. Finally, we present ORDiRS-Agent, an LLM-based agent that decomposes OR workflow analysis queries into manageable RS sub-queries and generates responses by combining detailed textual explanations with supporting visual evidence from RS. Experimental results on both an in-house and a public OR dataset demonstrate that our ORDiRS achieves a cIoU improvement of 6.12%-9.74% compared to the existing state-of-the-arts.
- [14] arXiv:2503.21057 [pdf, html, other]
-
Title: Validation and Calibration of Energy Models with Real Vehicle Data from Chassis Dynamometer ExperimentsJoy Carpio, Sulaiman Almatrudi, Nour Khoudari, Zhe Fu, Kenneth Butts, Jonathan Lee, Benjamin Seibold, Alexandre BayenSubjects: Systems and Control (eess.SY)
Accurate estimation of vehicle fuel consumption typically requires detailed modeling of complex internal powertrain dynamics, often resulting in computationally intensive simulations. However, many transportation applications-such as traffic flow modeling, optimization, and control-require simplified models that are fast, interpretable, and easy to implement, while still maintaining fidelity to physical energy behavior. This work builds upon a recently developed model reduction pipeline that derives physics-like energy models from high-fidelity Autonomie vehicle simulations. These reduced models preserve essential vehicle dynamics, enabling realistic fuel consumption estimation with minimal computational overhead. While the reduced models have demonstrated strong agreement with their Autonomie counterparts, previous validation efforts have been confined to simulation environments. This study extends the validation by comparing the reduced energy model's outputs against real-world vehicle data. Focusing on the MidSUV category, we tune the baseline Autonomie model to closely replicate the characteristics of a Toyota RAV4. We then assess the accuracy of the resulting reduced model in estimating fuel consumption under actual drive conditions. Our findings suggest that, when the reference Autonomie model is properly calibrated, the simplified model produced by the reduction pipeline can provide reliable, semi-principled fuel rate estimates suitable for large-scale transportation applications.
- [15] arXiv:2503.21070 [pdf, html, other]
-
Title: Cubature Kalman Filter as a Robust State Estimator Against Model Uncertainty and Cyber Attacks in Power SystemsSubjects: Systems and Control (eess.SY)
It is known that the conventional estimators such as extended Kalman filter (EKF) and unscented Kalman filter (UKF) may provide favorable performance; However, they may not guarantee the robustness against model uncertainty and cyber attacks. In this paper, we compare the performance of cubature Kalman filter (CKF) to the conventional nonlinear estimator, the EKF, under the affect of model uncertainty and cyber-attack. We show that the CKF has better estimation accuracy than the EKF under some conditions. In order to verify our claim, we have tested the performance various nonlinear estimators on the single machine infinite-bus (SMIB) system under different scenarios. We show that (1) the CKF provides better estimation results than the EKF; (2) the CKF is able to detect different types of cyber attacks reliably which is superior to the EKF.
- [16] arXiv:2503.21102 [pdf, html, other]
-
Title: Amplitude-Domain Reflection Modulation for Active RIS-Assisted Wireless CommunicationsSubjects: Signal Processing (eess.SP)
In this paper, we propose a novel active reconfigurable intelligent surface (RIS)-assisted amplitude-domain reflection modulation (ADRM) transmission scheme, termed as ARIS-ADRM. This innovative approach leverages the additional degree of freedom (DoF) provided by the amplitude domain of the active RIS to perform index modulation (IM), thereby enhancing spectral efficiency (SE) without increasing the costs associated with additional radio frequency (RF) chains. Specifically, the ARIS-ADRM scheme transmits information bits through both the modulation symbol and the index of active RIS amplitude allocation patterns (AAPs). To evaluate the performance of the proposed ARIS-ADRM scheme, we provide an achievable rate analysis and derive a closed-form expression for the upper bound on the average bit error probability (ABEP). Furthermore, we formulate an optimization problem to construct the AAP codebook, aiming to minimize the ABEP. Simulation results demonstrate that the proposed scheme significantly improves error performance under the same SE conditions compared to its benchmarks. This improvement is due to its ability to flexibly adapt the transmission rate by fully exploiting the amplitude domain DoF provided by the active RIS.
- [17] arXiv:2503.21107 [pdf, html, other]
-
Title: In-situ Physical Adjoint Computing in multiple-scattering electromagnetic environments for wave controlSubjects: Signal Processing (eess.SP); Chaotic Dynamics (nlin.CD); Optics (physics.optics)
Controlling electromagnetic wave propagation in multiple scattering systems is a challenging endeavor due to the extraordinary sensitivity generated by strong multi-path contributions at any given location. Overcoming such complexity has emerged as a central research theme in recent years, motivated both by a wide range of applications -- from wireless communications and imaging to optical micromanipulations -- and by the fundamental principles underlying these efforts. Here, we show that an {\it in-situ} manipulation of the myriad scattering events, achieved through time- and energy-efficient adjoint optimization (AO) methodologies, enables {\it real time} wave-driven functionalities such as targeted channel emission, coherent perfect absorption, and camouflage. Our paradigm shift exploits the highly multi-path nature of these complex environments, where repeated wave-scattering dramatically amplifies small local AO-informed system variations. Our approach can be immediately applied to in-door wireless technologies and incorporated into diverse wave-based frameworks including imaging, power electronic and optical neural networks.
- [18] arXiv:2503.21110 [pdf, html, other]
-
Title: Fundamental Limit of Angular Resolution in Partly Calibrated Arrays with Position ErrorsSubjects: Signal Processing (eess.SP)
We consider high angular resolution detection using distributed mobile platforms implemented with so-called partly calibrated arrays, where position errors between subarrays exist and the counterparts within each subarray are ideally calibrated. Since position errors between antenna arrays affect the coherent processing of measurements from these arrays, it is commonly believed that its angular resolution is influenced. A key question is whether and how much the angular resolution of partly calibrated arrays is affected by the position errors, in comparison with ideally calibrated arrays. To address this fundamental problem, we theoretically illustrate that partly calibrated arrays approximately achieve high angular resolution. Our analysis uses a special characteristic of Cramer-Rao lower bound (CRB) w.r.t. the source separation: When the source separation increases, the CRB first declines rapidly, then plateaus out, and the turning point is close to the angular resolution limit. This means that the turning point of CRB can be used to indicate angular resolution. We then theoretically analyze the declining and plateau phases of CRB, and explain that the turning point of CRB in partly calibrated arrays is close to the angular resolution limit of distributed arrays without errors, demonstrating high resolution ability. This work thus provides a theoretical guarantee for the high-resolution performance of distributed antenna arrays in mobile platforms.
- [19] arXiv:2503.21142 [pdf, html, other]
-
Title: Expressive Timing in Hindustani Vocal MusicSubjects: Audio and Speech Processing (eess.AS)
Temporal dynamics are among the cues to expres siveness in music performance in different cultures. In the case
of Hindustani music, it is well known that expert vocalists
often take liberties with the beat, intentionally not aligning their
singing precisely with the relatively steady beat provided by
the accompanying tabla. This becomes evident when comparing
performances of the same composition such as a bandish. We
present a methodology for the quantitative study of differences
across performed pieces using computational techniques. This is
applied to small study of two performances of a popular bandish
in raga Yaman, to demonstrate how we can effectively capture the
nuances of timing variations that bring out stylistic constraints
along with the individual signature of a performer. This work
articulates an important step towards the broader goals of music
analysis and generative modelling for Indian classical music
performance. - [20] arXiv:2503.21165 [pdf, html, other]
-
Title: Extending Silicon Lifetime: A Review of Design Techniques for Reliable Integrated CircuitsComments: This work is under review by ACMSubjects: Systems and Control (eess.SY); Hardware Architecture (cs.AR)
Reliability has become an increasing concern in modern computing. Integrated circuits (ICs) are the backbone of modern computing devices across industries, including artificial intelligence (AI), consumer electronics, healthcare, automotive, industrial, and aerospace. Moore Law has driven the semiconductor IC industry toward smaller dimensions, improved performance, and greater energy efficiency. However, as transistors shrink to atomic scales, aging-related degradation mechanisms such as Bias Temperature Instability (BTI), Hot Carrier Injection (HCI), Time-Dependent Dielectric Breakdown (TDDB), Electromigration (EM), and stochastic aging-induced variations have become major reliability threats. From an application perspective, applications like AI training and autonomous driving require continuous and sustainable operation to minimize recovery costs and enhance safety. Additionally, the high cost of chip replacement and reproduction underscores the need for extended lifespans. These factors highlight the urgency of designing more reliable ICs. This survey addresses the critical aging issues in ICs, focusing on fundamental degradation mechanisms and mitigation strategies. It provides a comprehensive overview of aging impact and the methods to counter it, starting with the root causes of aging and summarizing key monitoring techniques at both circuit and system levels. A detailed analysis of circuit-level mitigation strategies highlights the distinct aging characteristics of digital, analog, and SRAM circuits, emphasizing the need for tailored solutions. The survey also explores emerging software approaches in design automation, aging characterization, and mitigation, which are transforming traditional reliability optimization. Finally, it outlines the challenges and future directions for improving aging management and ensuring the long-term reliability of ICs across diverse applications.
- [21] arXiv:2503.21202 [pdf, html, other]
-
Title: System-wide Instrument Transformer Calibration and Line Parameter Estimation Using PMU DataSubjects: Systems and Control (eess.SY)
Uncalibrated instrument transformers (ITs) can degrade the performance of downstream applications that rely on the voltage and current measurements that ITs provide. It is also well-known that phasor measurement unit (PMU)-based system-wide IT calibration and line parameter estimation (LPE) are interdependent problems. In this paper, we present a statistical framework for solving the simultaneous LPE and IT calibration (SLIC) problem using synchrophasor data. The proposed approach not only avoids the need for a perfect IT by judiciously placing a revenue quality meter (which is an expensive but non-perfect IT), but also accounts for the variations typically occurring in the line parameters. The results obtained using the IEEE 118-bus system as well as actual power system data demonstrate the high accuracy, robustness, and practical utility of the proposed approach.
- [22] arXiv:2503.21239 [pdf, other]
-
Title: The Optimal Tradeoff Between PAPR and Ambiguity Functions for Generalized OFDM Waveform Set in ISAC SystemsComments: 13 pages, 8 figuresSubjects: Signal Processing (eess.SP)
Integrated sensing and communications (ISAC) has been identified as one of the six usage scenarios for IMT-2030. Compared with communication performance, sensing performance is much more vulnerable to interference, and the received backscattered sensing signal with target information is usually too weak to be detected. It is interesting to understand the optimal tradeoff between interference rejection and signal strength improvement for the best sensing performance, but unfortunately it still remains unknown. In this paper, the trinity of auto-ambiguity function (AF), cross-AF and peak-to-average-power ratio (PAPR) is proposed to describe the interference and coverage related aspects for ISAC systems where multi-carrier waveform is usually assumed. We extend the existing orthogonal frequency division multiplexing (OFDM) waveforms in 5G to a generalized OFDM waveform set with some new members and a unified parametric representation. Then the optimal Pareto tradeoff between PAPR, auto-AF and cross-AF (i.e., the union bound) is developed for the generalized OFDM waveform set. To achieve the optimal Pareto union bound with reasonable computational complexity, we further propose a framework to optimize waveform parameters and sequences jointly. Finally, some practical design examples are provided and numerical results reveal that significant improvements can be achieved compared to the state-of-the-art 5G waveforms and sequences.
- [23] arXiv:2503.21242 [pdf, html, other]
-
Title: PLAIN: Scalable Estimation Architecture for Integrated Sensing and CommunicationComments: Submitted to the IEEE Transactions on Wireless Communications. Code available at GitHub: this https URLSubjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
Integrated sensing and communication (ISAC) is envisioned be to one of the paradigms upon which next-generation mobile networks will be built, extending localization and tracking capabilities, as well as giving birth to environment-aware wireless access. A key aspect of sensing integration is parameter estimation, which involves extracting information about the surrounding environment, such as the direction, distance, and velocity of various objects within. This is typically of a high-dimensional nature, which leads to significant computational complexity, if performed jointly across multiple sensing dimensions, such as space, frequency, and time. Additionally, due to the incorporation of sensing on top of the data transmission, the time window available for sensing is likely to be short, resulting in an estimation problem where only a single snapshot is accessible. In this work, we propose PLAIN, a tensor-based estimation architecture that flexibly scales with multiple sensing dimensions and can handle high dimensionality, limited measurement time, and super-resolution requirements. It consists of three stages: a compression stage, where the high dimensional input is converted into lower dimensionality, without sacrificing resolution; a decoupled estimation stage, where the parameters across the different dimensions are estimated in parallel with low complexity; an input-based fusion stage, where the decoupled parameters are fused together to form a paired multidimensional estimate. We investigate the performance of the architecture for different configurations and compare it against practical sequential and joint estimation baselines, as well as theoretical bounds. Our results show that PLAIN, using tools from tensor algebra, subspace-based processing, and compressed sensing, can scale flexibly with dimensionality, while operating with low complexity and maintaining super-resolution.
- [24] arXiv:2503.21282 [pdf, other]
-
Title: Low-Cost Phase Precoding for Short-Reach Fiber Links with Direct DetectionSubjects: Signal Processing (eess.SP)
Low-cost analog phase precoding is used to compensate chromatic dispersion (CD) in fibers with intensity modulation and direct detection (IM/DD). In contrast to conventional precoding with an in-phase and quadrature (IQ) Mach-Zehnder modulator (MZM), only a single additional phase modulator (PM) is required at the transmitter. Depending on the CD, the PM generates a periodic phase modulation that is modelled by a Fourier series and optimized via a mean squared error (MSE) cost criterion. Numerical results compare achievable information rates (AIRs) for 4- and 6-PAM. With the additional PM, energy gains of up to 3 dB are achieved for moderate fiber lengths.
- [25] arXiv:2503.21298 [pdf, other]
-
Title: G{é}n{é}ration de Matrices de Corr{é}lation avec des Structures de Graphe par Optimisation ConvexeAli Fahkar (STATIFY, LJK), Kévin Polisano (SVH, LJK), Irène Gannaz (G-SCOP\_GROG, G-SCOP), Sophie Achard (STATIFY, LJK)Comments: in French languageSubjects: Signal Processing (eess.SP); Optimization and Control (math.OC); Statistics Theory (math.ST); Methodology (stat.ME)
This work deals with the generation of theoretical correlation matrices with specific sparsity patterns, associated to graph structures. We present a novel approach based on convex optimization, offering greater flexibility compared to existing techniques, notably by controlling the mean of the entry distribution in the generated correlation matrices. This allows for the generation of correlation matrices that better represent realistic data and can be used to benchmark statistical methods for graph inference.
- [26] arXiv:2503.21433 [pdf, other]
-
Title: On Learning-Based Traffic Monitoring With a Swarm of DronesComments: Extended version of the paper accepted for presentation at the 23rd IEEE European Control Conference (ECC 2025), Thessaloniki, GreeceSubjects: Systems and Control (eess.SY)
Efficient traffic monitoring is crucial for managing urban transportation networks, especially under congested and dynamically changing traffic conditions. Drones offer a scalable and cost-effective alternative to fixed sensor networks. However, deploying fleets of low-cost drones for traffic monitoring poses challenges in adaptability, scalability, and real-time operation. To address these issues, we propose a learning-based framework for decentralized traffic monitoring with drone swarms, targeting the uneven and unpredictable distribution of monitoring needs across urban areas. Our approach introduces a semi-decentralized reinforcement learning model, which trains a single Q-function using the collective experience of the swarm. This model supports full scalability, flexible deployment, and, when hardware allows, the online adaptation of each drone's action-selection mechanism. We first train and evaluate the model in a synthetic traffic environment, followed by a case study using real traffic data from Shenzhen, China, to validate its performance and demonstrate its potential for real-world applications in complex urban monitoring tasks.
- [27] arXiv:2503.21469 [pdf, html, other]
-
Title: Embedding Compression Distortion in Video Coding for MachinesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis. However, existing codecs are primarily optimized for pixel-domain and HVS-perception metrics rather than the needs of machine vision tasks. To address this issue, we propose a Compression Distortion Representation Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models, addressing the information lost during compression and improving task performance. Specifically, to better analyze the machine-perception-related distortion, we design a compression-sensitive extractor that identifies compression degradation in the feature domain. For efficient transmission, a lightweight distortion codec is introduced to compress the distortion information into a compact representation. Subsequently, the representation is progressively embedded into the downstream model, enabling it to be better informed about compression degradation and enhancing performance. Experiments across various codecs and downstream tasks demonstrate that our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of bitrate, execution time, and number of parameters. Our codes and supplementary materials are released in this https URL.
- [28] arXiv:2503.21487 [pdf, html, other]
-
Title: On Tensor-based Polynomial Hamiltonian SystemsSubjects: Systems and Control (eess.SY)
It is known that a linear system with a system matrix A constitutes a Hamiltonian system with a quadratic Hamiltonian if and only if A is a Hamiltonian matrix. This provides a straightforward method to verify whether a linear system is Hamiltonian or whether a given Hamiltonian function corresponds to a linear system. These techniques fundamentally rely on the properties of Hamiltonian matrices. Building on recent advances in tensor algebra, this paper generalizes such results to a broad class of polynomial systems. As the systems of interest can be naturally represented in tensor forms, we name them tensor-based polynomial systems. Our main contribution is that we formally define Hamiltonian cubical tensors and characterize their properties. Crucially, we demonstrate that a tensor-based polynomial system is a Hamiltonian system with a polynomial Hamiltonian if and only if all associated system tensors are Hamiltonian cubical tensors-a direct parallel to the linear case. Additionally, we establish a computationally tractable stability criterion for tensor-based polynomial Hamiltonian systems. Finally, we validate all theoretical results through numerical examples and provide a further intuitive discussion.
- [29] arXiv:2503.21498 [pdf, html, other]
-
Title: Distributed Forgetting-factor Regret-based Online Optimization over Undirected Connected NetworksComments: 11 pages,6 figuresSubjects: Systems and Control (eess.SY)
The evaluation of final-iteration tracking performance is a formidable obstacle in distributed online optimization algorithms. To address this issue, this paper proposes a novel evaluation metric named distributed forgetting-factor regret (DFFR). It incorporates a weight into the loss function at each iteration, which progressively reduces the weights of historical loss functions while enabling dynamic weights allocation across optimization horizon. Furthermore, we develop two distributed online optimization algorithms based on DFFR over undirected connected networks: the Distributed Online Gradient-free Algorithm for bandit-feedback problems and the Distributed Online Projection-free Algorithm for high-dimensional problems. Through theoretical analysis, we derive the upper bounds of DFFR for both algorithms and further prove that under mild conditions, DFFR either converges to zero or maintains a tight upper bound as iterations approach infinity. Experimental simulation demonstrates the effectiveness of the algorithms and the superior performance of DFFR.
- [30] arXiv:2503.21501 [pdf, html, other]
-
Title: Double Blind Imaging with Generative ModelingSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Blind inverse problems in imaging arise from uncertainties in the system used to collect (noisy) measurements of images. Recovering clean images from these measurements typically requires identifying the imaging system, either implicitly or explicitly. A common solution leverages generative models as priors for both the images and the imaging system parameters (e.g., a class of point spread functions). To learn these priors in a straightforward manner requires access to a dataset of clean images as well as samples of the imaging system. We propose an AmbientGAN-based generative technique to identify the distribution of parameters in unknown imaging systems, using only unpaired clean images and corrupted measurements. This learned distribution can then be used in model-based recovery algorithms to solve blind inverse problems such as blind deconvolution. We successfully demonstrate our technique for learning Gaussian blur and motion blur priors from noisy measurements and show their utility in solving blind deconvolution with diffusion posterior sampling.
- [31] arXiv:2503.21502 [pdf, html, other]
-
Title: ALADIN-$β$: A Distributed Optimization Algorithm for Solving MPCC ProblemsSubjects: Systems and Control (eess.SY)
Mathematical Programs with Complementarity Constraints (MPCC) are critical in various real-world applications but notoriously challenging due to non-smoothness and degeneracy from complementarity constraints. The $\ell_1$-Exact Penalty-Barrier enhanced \texttt{IPOPT} improves performance and robustness by introducing additional inequality constraints and decision variables. However, this comes at the cost of increased computational complexity due to the higher dimensionality and additional constraints introduced in the centralized formulation. To mitigate this, we propose a distributed structure-splitting reformulation that decomposes these inequality constraints and auxiliary variables into independent sub-problems. Furthermore, we introduce Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN)-$\beta$, a novel approach that integrates the $\ell_1$-Exact Penalty-Barrier method with ALADIN to efficiently solve the distributed reformulation. Numerical experiments demonstrate that even without a globalization strategy, the proposed distributed approach achieves fast convergence while maintaining high precision.
- [32] arXiv:2503.21503 [pdf, html, other]
-
Title: Distributed observer-based leak detection in pipe flow with nonlinear frictionComments: 4 pages, 3 figures, article was presented at IFAC CMWRS2022 (this https URL) in the "Extended Abstract" category and is not available anywhere elseSubjects: Systems and Control (eess.SY)
The problem of leak detection in a pipeline with nonlinear friction is considered. A distributed observer-based method is proposed which applies a linearised, distributed adaptive observer design to the nonlinear model. The methodology is tested in simulations for two different operating points.
- [33] arXiv:2503.21529 [pdf, html, other]
-
Title: Physics-Informed Neural Network-Based Control for Grid-Forming Converter's Stability Under Overload ConditionsSubjects: Systems and Control (eess.SY)
Grid-forming converters (GFCs) are pivotal in maintaining frequency and voltage stability in modern distribution systems. However, a critical challenge arises when these converters encounter sudden power demands that exceed their rated capacity. Although GFCs are designed to manage DC source saturation and limit excessive AC currents, their ability to ensure sufficient power delivery under such constraints remains a significant concern. Existing studies often overlook this limitation, potentially compromising system stability during high-demand scenarios. This paper proposes a control strategy based on a physics-informed neural network (PINN) to improve GFC performance under overloaded conditions, effectively preventing switch failures and mitigating DC source saturation. The proposed approach outperforms conventional methods by maintaining stable voltage and frequency, even under significant load increases where traditional droop control alone proves inadequate. The post-disturbance operating point of GFCs remains unchanged using PINN-based control. Peak voltage deviation observed during transient reduced to 42.85\%. Furthermore, the proposed method ensures that the rate of change of frequency (ROCOF) and the rate of change of voltage (ROCOV) remain within acceptable limits, significantly improving system resilience in inertia-less power networks.
- [34] arXiv:2503.21537 [pdf, html, other]
-
Title: Polarization-Aware Antenna Selection for Joint Radar and Communication in XL-MIMO SystemsComments: 11 pages, submitted to IEEE JournalSubjects: Signal Processing (eess.SP)
A key challenge in dual-polarized multiplexing for joint radar and communication (JRC) systems is cross-polarization (cross-pol) leakage caused by depolarization. In conventional MIMO systems, depolarization arises solely from the channel; however, in XL-MIMO systems, non-stationary properties of the array cause additional polarization shifts at each antenna element, further degrading JRC performance. This paper introduces a channel model incorporating polarization shifts due to the propagation channel and antenna elements in the near-field. We also propose an antenna selection (AS) scheme that dynamically chooses antennas based on polarization imbalance and cross-pol leakage, enhancing spectral efficiency, symbol error rate, and radar detection probability. Simulations show that the proposed AS significantly outperforms traditional methods, providing scalable benefits for XL-MIMO JRC systems.
- [35] arXiv:2503.21542 [pdf, html, other]
-
Title: Shape Adaptive Reconfigurable Holographic SurfacesSubjects: Signal Processing (eess.SP)
Reconfigurable Intelligent Surfaces (RIS) have emerged as a key solution to dynamically adjust wireless propagation by tuning the reflection coefficients of large arrays of passive elements. Reconfigurable Holographic Surfaces (RHS) build on the same foundation as RIS but extend it by employing holographic principles for finer-grained wave manipulation | that is, applying higher spatial control over the reflected signals for more precise beam steering. In this paper, we investigate shape-adaptive RHS deployments in a multi-user network. Rather than treating each RHS as a uniform reflecting surface, we propose a selective element activation strategy that dynamically adapts the spatial arrangement of deployed RHS regions to a subset of predefined shapes. In particular, we formulate a system throughput maximization problem that optimizes the shape of the selected RHS elements, active beamforming at the access point (AP), and passive beamforming at the RHS to enhance coverage and mitigate signal blockage. The resulting problem is non-convex and becomes even more challenging to solve as the number of RHS and users increases; to tackle this, we introduce an alternating optimization (AO) approach that efficiently finds near-optimal solutions irrespective of the number or spatial configuration of RHS. Numerical results demonstrate that shape adaptation enables more efficient resource distribution, enhancing the effectiveness of multi-RHS deployments as the network scales.
- [36] arXiv:2503.21548 [pdf, html, other]
-
Title: Combining Graph Attention Networks and Distributed Optimization for Multi-Robot Mixed-Integer Convex ProgrammingComments: submitted to CDC 2025Subjects: Systems and Control (eess.SY)
In this paper, we develop a fast mixed-integer convex programming (MICP) framework for multi-robot navigation by combining graph attention networks and distributed optimization. We formulate a mixed-integer optimization problem for receding horizon motion planning of a multi-robot system, taking into account the surrounding obstacles. To address the resulting multi-agent MICP problem in real time, we propose a framework that utilizes heterogeneous graph attention networks to learn the latent mapping from problem parameters to optimal binary solutions. Furthermore, we apply a distributed proximal alternating direction method of multipliers algorithm for solving the convex continuous optimization problem. We demonstrate the effectiveness of our proposed framework through experiments conducted on a robotic testbed.
- [37] arXiv:2503.21552 [pdf, html, other]
-
Title: Real-time Tracking System with partially coupled sourcesSubjects: Systems and Control (eess.SY); Signal Processing (eess.SP)
We consider a pull-based real-time tracking system consisting of multiple partially coupled sources and a sink. The sink monitors the sources in real-time and can request one source for an update at each time instant. The sources send updates over an unreliable wireless channel. The sources are partially coupled, and updates about one source can provide partial knowledge about other sources. We study the problem of minimizing the sum of an average distortion function and a transmission cost. Since the controller is at the sink side, the controller (sink) has only partial knowledge about the source states, and thus, we model the problem as a partially observable Markov decision process (POMDP) and then cast it as a belief-MDP problem. Using the relative value iteration algorithm, we solve the problem and propose a control policy. Simulation results show the proposed policy's effectiveness and superiority compared to a baseline policy.
- [38] arXiv:2503.21594 [pdf, html, other]
-
Title: AUTOBargeSim: MATLAB(R) toolbox for the design and analysis of the guidance and control system for autonomous inland vesselsAbhishek Dhyani, Amirreza Haqshenas Mojaveri, Chengqian Zhang, Dhanika Mahipala, Hoang Anh Tran, Yan-Yun Zhang, Zhongbi Luo, Vasso ReppaSubjects: Systems and Control (eess.SY)
This paper introduces AUTOBargeSim, a simulation toolbox for autonomous inland vessel guidance and control system design. AUTOBargeSim is developed using MATLAB and provides an easy-to-use introduction to various aspects of autonomous inland navigation, including mapping, modelling, control design, and collision avoidance, through examples and extensively documented code. Applying modular design principles in the simulator structure allows it to be easily modified according to the user's requirements. Furthermore, a GUI interface facilitates a simple and quick execution. Key performance indices for evaluating the performance of the controller and collision avoidance method in confined space are also provided. The current version of AUTOBargeSim attempts to improve reproducibility in the design and simulation of marine systems while serving as a foundation for simulating and evaluating vessel behaviour considering operational, system, and environmental constraints.
- [39] arXiv:2503.21599 [pdf, html, other]
-
Title: Leveraging Line-of-Sight Propagation for Near-Field Beamfocusing in Cell-Free NetworksSubjects: Signal Processing (eess.SP)
Cell-free (CF) massive multiple-input multiple-output (MIMO) is a promising approach for next-generation wireless networks, enabling scalable deployments of multiple small access points (APs) to enhance coverage and service for multiple user equipments (UEs). While most existing research focuses on low-frequency bands with Rayleigh fading models, emerging 5G trends are shifting toward higher frequencies, where geometric channel models and line-of-sight (LoS) propagation become more relevant. In this work, we explore how distributed massive MIMO in the LoS regime can achieve near-field-like conditions by forming artificially large arrays through coordinated AP deployments. We investigate centralized and decentralized CF architectures, leveraging structured channel estimation (SCE) techniques that exploit the line-of-sight properties of geometric channels. Our results demonstrate that dense distributed AP deployments significantly improve system performance w.r.t. the case of a co-located array, even in highly populated UE scenarios, while SCE approaches the performance of perfect CSI.
- [40] arXiv:2503.21666 [pdf, html, other]
-
Title: Economy and sustainability analysis with a novel modular configurable multi-modal white-box building modelSubjects: Systems and Control (eess.SY)
This paper presents a novel modeling approach for building performance simulation, characterized as a white-box model with a high degree of modularity and flexibility, enabling direct integration into complex large-scale energy system co-simulations. The introduced model is described in detail, with a focus on its modular structure, and proposes various configurations that include various building insulation, heating methods, occupancy patterns, and weather data to analyze different scenarios, and the energy consumption, CO2 emissions, and heating costs are compared and analyzed across 36 introduced scenarios. The thermodynamic behavior of the model is shown to be consistent with real-world conditions, and the comparison of the scenarios concludes that the use of heat pumps for indoor heating in well-insulated buildings has significant economic and sustainability benefits, whereas the use of natural gas-fueled boilers is more cost-effective for buildings with low energy ratings.
- [41] arXiv:2503.21667 [pdf, other]
-
Title: The Construction of Asymptotic Bode Plots: A New Direct MethodSubjects: Systems and Control (eess.SY)
Bode plots represent an essential tool in control and systems engineering. In order to perform an initial qualitative analysis of the considered systems, the construction of asymptotic Bode plots is often sufficient. The standard methods for constructing asymptotic Bode plots are characterized by the same drawbacks: they are not systematic, may be not precise and time-consuming. This is because they require the detailed analysis of the different factors composing the considered transfer function, meaning that more and more intermediate steps are required as the number of factors increases. In this paper, a new method for the construction of asymptotic Bode plots is proposed, which is based on the systematic calculations of the so-called generalized approximating functions and on the use of well defined properties. The proposed method is referred to as a direct method since it allows to directly draw the asymptotic Bode magnitude and phase plots of the complete transfer function without requiring the detailed analysis nor the plots construction of each factor. This latter feature also makes the proposed direct method more systematic, potentially more precise and less time-consuming compared to standard methods, especially when dealing with a large number of factors in the transfer function. The comparison of the proposed direct method with the standard approaches is performed, in order to examine the benefits offered by the direct method.
New submissions (showing 41 of 41 entries)
- [42] arXiv:2503.20819 (cross-list from cs.GR) [pdf, html, other]
-
Title: Reflections on Diversity: A Real-time Virtual Mirror for Inclusive 3D Face TransformationsSubjects: Graphics (cs.GR); Image and Video Processing (eess.IV)
Real-time 3D face manipulation has significant applications in virtual reality, social media and human-computer interaction. This paper introduces a novel system, which we call Mirror of Diversity (MOD), that combines Generative Adversarial Networks (GANs) for texture manipulation and 3D Morphable Models (3DMMs) for facial geometry to achieve realistic face transformations that reflect various demographic characteristics, emphasizing the beauty of diversity and the universality of human features. As participants sit in front of a computer monitor with a camera positioned above, their facial characteristics are captured in real time and can further alter their digital face reconstruction with transformations reflecting different demographic characteristics, such as gender and ethnicity (e.g., a person from Africa, Asia, Europe). Another feature of our system, which we call Collective Face, generates an averaged face representation from multiple participants' facial data. A comprehensive evaluation protocol is implemented to assess the realism and demographic accuracy of the transformations. Qualitative feedback is gathered through participant questionnaires, which include comparisons of MOD transformations with similar filters on platforms like Snapchat and TikTok. Additionally, quantitative analysis is conducted using a pretrained Convolutional Neural Network that predicts gender and ethnicity, to validate the accuracy of demographic transformations.
- [43] arXiv:2503.20839 (cross-list from cs.RO) [pdf, html, other]
-
Title: TAR: Teacher-Aligned Representations via Contrastive Learning for Quadrupedal LocomotionComments: This work has been submitted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025 for reviewSubjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Quadrupedal locomotion via Reinforcement Learning (RL) is commonly addressed using the teacher-student paradigm, where a privileged teacher guides a proprioceptive student policy. However, key challenges such as representation misalignment between the privileged teacher and the proprioceptive-only student, covariate shift due to behavioral cloning, and lack of deployable adaptation lead to poor generalization in real-world scenarios. We propose Teacher-Aligned Representations via Contrastive Learning (TAR), a framework that leverages privileged information with self-supervised contrastive learning to bridge this gap. By aligning representations to a privileged teacher in simulation via contrastive objectives, our student policy learns structured latent spaces and exhibits robust generalization to Out-of-Distribution (OOD) scenarios, surpassing the fully privileged "Teacher". Results showed accelerated training by 2x compared to state-of-the-art baselines to achieve peak performance. OOD scenarios showed better generalization by 40 percent on average compared to existing methods. Additionally, TAR transitions seamlessly into learning during deployment without requiring privileged states, setting a new benchmark in sample-efficient, adaptive locomotion and enabling continual fine-tuning in real-world scenarios. Open-source code and videos are available at this https URL.
- [44] arXiv:2503.21056 (cross-list from cs.CV) [pdf, html, other]
-
Title: Online Reasoning Video Segmentation with Just-in-Time Digital TwinsSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where -- given an implicit query -- a LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as "just-in-time" because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity.
- [45] arXiv:2503.21058 (cross-list from physics.optics) [pdf, html, other]
-
Title: 5.7 Tb/s Transmission Over a 4.6 km Field-Deployed Free-Space Optical Link in Urban EnvironmentComments: Accepted for presentation at Optical Fiber Communication (OFC) Conference 2025Subjects: Optics (physics.optics); Signal Processing (eess.SP)
We transmitted 5.7 Tb/s over a 4.6 km free-space optical link in an urban environment, spanning the city of Eindhoven, the Netherlands, using a 1.1 THz wide wavelength-division multiplexed signal.
- [46] arXiv:2503.21168 (cross-list from cs.RO) [pdf, html, other]
-
Title: TAGA: A Tangent-Based Reactive Approach for Socially Compliant Robot Navigation Around Human GroupsComments: 6 pages, 3 figures. Submitted as a conference paper in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Robot navigation in densely populated environments presents significant challenges, particularly regarding the interplay between individual and group dynamics. Current navigation models predominantly address interactions with individual pedestrians while failing to account for human groups that naturally form in real-world settings. Conversely, the limited models implementing group-aware navigation typically prioritize group dynamics at the expense of individual interactions, both of which are essential for socially appropriate navigation. This research extends an existing simulation framework to incorporate both individual pedestrians and human groups. We present Tangent Action for Group Avoidance (TAGA), a modular reactive mechanism that can be integrated with existing navigation frameworks to enhance their group-awareness capabilities. TAGA dynamically modifies robot trajectories using tangent action-based avoidance strategies while preserving the underlying model's capacity to navigate around individuals. Additionally, we introduce Group Collision Rate (GCR), a novel metric to quantitatively assess how effectively robots maintain group integrity during navigation. Through comprehensive simulation-based benchmarking, we demonstrate that integrating TAGA with state-of-the-art navigation models (ORCA, Social Force, DS-RNN, and AG-RL) reduces group intrusions by 45.7-78.6% while maintaining comparable success rates and navigation efficiency. Future work will focus on real-world implementation and validation of this approach.
- [47] arXiv:2503.21254 (cross-list from cs.CV) [pdf, html, other]
-
Title: Vision-to-Music Generation: A SurveyZhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue LiaoSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies on vision-to-music generation from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and promising directions for future research. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications. To follow latest works and foster further innovation in this field, we are continuously maintaining a GitHub repository at this https URL.
- [48] arXiv:2503.21335 (cross-list from cs.AR) [pdf, html, other]
-
Title: A Low-Power Streaming Speech Enhancement Accelerator For Edge DevicesJournal-ref: in IEEE Open Journal of Circuits and Systems, vol. 5, pp. 128-140, 2024Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9\% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.
- [49] arXiv:2503.21337 (cross-list from cs.AR) [pdf, html, other]
-
Title: A 71.2-$μ$W Speech Recognition Accelerator with Recurrent Spiking Neural NetworkJournal-ref: in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 71, no. 7, pp. 3203-3213, July 2024Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
This paper introduces a 71.2-$\mu$W speech recognition accelerator designed for edge devices' real-time applications, emphasizing an ultra low power design. Achieved through algorithm and hardware co-optimizations, we propose a compact recurrent spiking neural network with two recurrent layers, one fully connected layer, and a low time step (1 or 2). The 2.79-MB model undergoes pruning and 4-bit fixed-point quantization, shrinking it by 96.42\% to 0.1 MB. On the hardware front, we take advantage of \textit{mixed-level pruning}, \textit{zero-skipping} and \textit{merged spike} techniques, reducing complexity by 90.49\% to 13.86 MMAC/S. The \textit{parallel time-step execution} addresses inter-time-step data dependencies and enables weight buffer power savings through weight sharing. Capitalizing on the sparse spike activity, an input broadcasting scheme eliminates zero computations, further saving power. Implemented on the TSMC 28-nm process, the design operates in real time at 100 kHz, consuming 71.2 $\mu$W, surpassing state-of-the-art designs. At 500 MHz, it has 28.41 TOPS/W and 1903.11 GOPS/mm$^2$ in energy and area efficiency, respectively.
- [50] arXiv:2503.21401 (cross-list from cs.RO) [pdf, html, other]
-
Title: AcL: Action Learner for Fault-Tolerant Quadruped Locomotion ControlTianyu Xu (1), Yaoyu Cheng (2), Pinxi Shen (2), Lin Zhao (1) (1)Electrical, Computer Engineering, National University of Singapore, Singapore, (2)Mechanical Engineering, National University of Singapore, SingaporeSubjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Quadrupedal robots can learn versatile locomotion skills but remain vulnerable when one or more joints lose power. In contrast, dogs and cats can adopt limping gaits when injured, demonstrating their remarkable ability to adapt to physical conditions. Inspired by such adaptability, this paper presents Action Learner (AcL), a novel teacher-student reinforcement learning framework that enables quadrupeds to autonomously adapt their gait for stable walking under multiple joint faults. Unlike conventional teacher-student approaches that enforce strict imitation, AcL leverages teacher policies to generate style rewards, guiding the student policy without requiring precise replication. We train multiple teacher policies, each corresponding to a different fault condition, and subsequently distill them into a single student policy with an encoder-decoder architecture. While prior works primarily address single-joint faults, AcL enables quadrupeds to walk with up to four faulty joints across one or two legs, autonomously switching between different limping gaits when faults occur. We validate AcL on a real Go2 quadruped robot under single- and double-joint faults, demonstrating fault-tolerant, stable walking, smooth gait transitions between normal and lamb gaits, and robustness against external disturbances.
- [51] arXiv:2503.21414 (cross-list from q-bio.NC) [pdf, html, other]
-
Title: Brain Age Group Classification Based on Resting State Functional Connectivity MetricsSubjects: Neurons and Cognition (q-bio.NC); Signal Processing (eess.SP)
This study investigated age-related changes in functional connectivity using resting-state fMRI and explored the efficacy of traditional deep learning for classifying brain developmental stages (BDS). Functional connectivity was assessed using Seed-Based Phase Synchronization (SBPS) and Pearson correlation across 160 ROIs. Clustering was performed using t-SNE, and network topology was analyzed through graph-theoretic metrics. Adaptive learning was implemented to classify the age group by extracting bottleneck features through mobileNetV2. These deep features were embedded and classified using Random Forest and PCA. Results showed a shift in phase synchronization patterns from sensory-driven networks in youth to more distributed networks with aging. t-SNE revealed that SBPS provided the most distinct clustering of BDS. Global efficiency and participation coefficient followed an inverted U-shaped trajectory, while clustering coefficient and modularity exhibited a U-shaped pattern. MobileNet outperformed other models, achieving the highest classification accuracy for BDS. Aging was associated with reduced global integration and increased local connectivity, indicating functional network reorganization. While this study focused solely on functional connectivity from resting-state fMRI and a limited set of connectivity features, deep learning demonstrated superior classification performance, highlighting its potential for characterizing age-related brain changes.
- [52] arXiv:2503.21491 (cross-list from cs.RO) [pdf, html, other]
-
Title: Data-Driven Contact-Aware Control Method for Real-Time Deformable Tool Manipulation: A Case Study in the Environmental SwabbingComments: Submitted for Journal ReviewSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Deformable Object Manipulation (DOM) remains a critical challenge in robotics due to the complexities of developing suitable model-based control strategies. Deformable Tool Manipulation (DTM) further complicates this task by introducing additional uncertainties between the robot and its environment. While humans effortlessly manipulate deformable tools using touch and experience, robotic systems struggle to maintain stability and precision. To address these challenges, we present a novel State-Adaptive Koopman LQR (SA-KLQR) control framework for real-time deformable tool manipulation, demonstrated through a case study in environmental swab sampling for food safety. This method leverages Koopman operator-based control to linearize nonlinear dynamics while adapting to state-dependent variations in tool deformation and contact forces. A tactile-based feedback system dynamically estimates and regulates the swab tool's angle, contact pressure, and surface coverage, ensuring compliance with food safety standards. Additionally, a sensor-embedded contact pad monitors force distribution to mitigate tool pivoting and deformation, improving stability during dynamic interactions. Experimental results validate the SA-KLQR approach, demonstrating accurate contact angle estimation, robust trajectory tracking, and reliable force regulation. The proposed framework enhances precision, adaptability, and real-time control in deformable tool manipulation, bridging the gap between data-driven learning and optimal control in robotic interaction tasks.
- [53] arXiv:2503.21538 (cross-list from math.OC) [pdf, other]
-
Title: Formation Shape Control using the Gromov-Wasserstein MetricComments: To appear in the proceedings of Learning for Dynamics and Control (L4DC) conference, PMLR, 2025Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
This article introduces a formation shape control algorithm, in the optimal control framework, for steering an initial population of agents to a desired configuration via employing the Gromov-Wasserstein distance. The underlying dynamical system is assumed to be a constrained linear system and the objective function is a sum of quadratic control-dependent stage cost and a Gromov-Wasserstein terminal cost. The inclusion of the Gromov-Wasserstein cost transforms the resulting optimal control problem into a well-known NP-hard problem, making it both numerically demanding and difficult to solve with high accuracy. Towards that end, we employ a recent semi-definite relaxation-driven technique to tackle the Gromov-Wasserstein distance. A numerical example is provided to illustrate our results.
- [54] arXiv:2503.21546 (cross-list from q-bio.GN) [pdf, html, other]
-
Title: consexpressionR: an R package for consensus differential gene expression analysisSubjects: Genomics (q-bio.GN); Systems and Control (eess.SY)
Motivation: Bulk RNA-Seq is a widely used method for studying gene expression across a variety of contexts. The significance of RNA-Seq studies has grown with the advent of high-throughput sequencing technologies. Computational methods have been developed for each stage of the identification of differentially expressed genes. Nevertheless, there are few studies exploring the association between different types of methods. In this study, we evaluated the impact of the association of methodologies in the results of differential expression analysis. By adopting two data sets with qPCR data (to gold-standard reference), seven methods were implemented and assessed in R packages (EBSeq, edgeR, DESeq2, limma, SAMseq, NOISeq, and Knowseq), which was performed and assessed separately and in association. The results were evaluated considering the adopted qPCR data. Results: Here, we introduce consexpressionR, an R package that automates differential expression analysis using consensus of at least seven methodologies, producing more assertive results with a significant reduction in false positives. Availability: consexpressionR is an R package available via source code and support are available at GitHub (this https URL).
- [55] arXiv:2503.21571 (cross-list from cs.SD) [pdf, html, other]
-
Title: Magnitude-Phase Dual-Path Speech Enhancement Network based on Self-Supervised Embedding and Perceptual Contrast Stretch BoostingComments: Main paper (6 pages). Accepted for publication by ICME 2025Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Speech self-supervised learning (SSL) has made great progress in various speech processing tasks, but there is still room for improvement in speech enhancement (SE). This paper presents BSP-MPNet, a dual-path framework that combines self-supervised features with magnitude-phase information for SE. The approach starts by applying the perceptual contrast stretching (PCS) algorithm to enhance the magnitude-phase spectrum. A magnitude-phase 2D coarse (MP-2DC) encoder then extracts coarse features from the enhanced spectrum. Next, a feature-separating self-supervised learning (FS-SSL) model generates self-supervised embeddings for the magnitude and phase components separately. These embeddings are fused to create cross-domain feature representations. Finally, two parallel RNN-enhanced multi-attention (REMA) mask decoders refine the features, apply them to the mask, and reconstruct the speech signal. We evaluate BSP-MPNet on the VoiceBank+DEMAND and WHAMR! datasets. Experimental results show that BSP-MPNet outperforms existing methods under various noise conditions, providing new directions for self-supervised speech enhancement research. The implementation of the BSP-MPNet code is available online\footnote[2]{this https URL. \label{s1}}
Cross submissions (showing 14 of 14 entries)
- [56] arXiv:2401.10389 (replaced) [pdf, html, other]
-
Title: Inverse Problem Approach to Aberration Correction for in vivo Transcranial Imaging Based on a Sparse Representation of Contrast-enhanced Ultrasound DataSubjects: Image and Video Processing (eess.IV); Medical Physics (physics.med-ph)
Transcranial ultrasound imaging is currently limited by attenuation and aberration induced by the skull. First used in contrast-enhanced ultrasound (CEUS), highly echoic microbubbles allowed for the development of novel imaging modalities such as ultrasound localization microscopy (ULM). Herein, we develop an inverse problem approach to aberration correction (IPAC) that leverages the sparsity of microbubble signals. We propose to use the \textit{a priori} knowledge of the medium based upon microbubble localization and wave propagation to build a forward model to link the measured signals directly to the aberration function. A standard least-squares inversion is then used to retrieve the aberration function. We first validated IPAC on simulated data of a vascular network using plane wave as well as divergent wave emissions. We then evaluated the reproducibility of IPAC \textit{in vivo} in 5 mouse brains. We showed that aberration correction improved the contrast of CEUS images by 4.6 dB. For ULM images, IPAC yielded sharper vessels, reduced vessel duplications, and improved the resolution from 21.1 $\mu$m to 18.3 $\mu$m. Aberration correction also improved hemodynamic quantification for velocity magnitude and flow direction.
- [57] arXiv:2403.16711 (replaced) [pdf, html, other]
-
Title: Predictable Interval MDPs through Entropy RegularizationComments: This paper has been presented at the 2024 63rd IEEE Conference on Decision and Control (CDC)Subjects: Systems and Control (eess.SY)
Regularization of control policies using entropy can be instrumental in adjusting predictability of real-world systems. Applications benefiting from such approaches range from, e.g., cybersecurity, which aims at maximal unpredictability, to human-robot interaction, where predictable behavior is highly desirable. In this paper, we consider entropy regularization for interval Markov decision processes (IMDPs). IMDPs are uncertain MDPs, where transition probabilities are only known to belong to intervals. Lately, IMDPs have gained significant popularity in the context of abstracting stochastic systems for control design. In this work, we address robust minimization of the linear combination of entropy and a standard cumulative cost in IMDPs, thereby establishing a trade-off between optimality and predictability. We show that optimal deterministic policies exist, and devise a value-iteration algorithm to compute them. The algorithm solves a number of convex programs at each step. Finally, through an illustrative example we show the benefits of penalizing entropy in IMDPs.
- [58] arXiv:2404.01901 (replaced) [pdf, html, other]
-
Title: Learning-based model augmentation with LFRsComments: Accepted for ECC 2025Subjects: Systems and Control (eess.SY)
Nonlinear system identification (NL-SI) has proven to be effective in obtaining accurate models for highly complex systems. In particular, recent encoder-based methods for artificial neural networks state-space (ANN-SS) models have achieved state-of-the-art performance on various benchmarks, while offering consistency and computational efficiency. Inclusion of prior knowledge of the system can be exploited to increase (i) estimation speed, (ii) accuracy, and (iii) interpretability of the resulting models. This paper proposes an encoder-based model augmentation method that incorporates prior knowledge from first-principles (FP) models. We introduce a novel \linear-fractional-representation (LFR) model structure that allows for the unified representation of various augmentation structures including the ones that are commonly used in the literature, and an identification algorithm for estimating the proposed structure together with appropriate initialization methods. The performance and generalization capabilities of the proposed method are demonstrated in a hardening mass-spring-damper simulation.
- [59] arXiv:2407.11828 (replaced) [pdf, html, other]
-
Title: Vibravox: A Dataset of French Speech Captured with Body-conduction Audio SensorsJulien Hauret, Malo Olivier, Thomas Joubaud, Christophe Langrenne, Sarah Poirée, Véronique Zimpfer, Éric BavuComments: 23 pages, 42 figuresSubjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 hours per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.
- [60] arXiv:2408.12691 (replaced) [pdf, html, other]
-
Title: Quantization-aware Matrix Factorization for Low Bit Rate Image CompressionPooya Ashtari, Pourya Behmandpoor, Fateme Nateghi Haredasht, Jonathan H. Chen, Panagiotis Patrinos, Sabine Van HuffelComments: 22 pages, 6 figures, 1 table, 1 algorithmSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
Lossy image compression is essential for efficient transmission and storage. Traditional compression methods mainly rely on discrete cosine transform (DCT) or singular value decomposition (SVD), both of which represent image data in continuous domains and, therefore, necessitate carefully designed quantizers. Notably, these methods consider quantization as a separate step, where quantization errors cannot be incorporated into the compression process. The sensitivity of these methods, especially SVD-based ones, to quantization errors significantly degrades reconstruction quality. To address this issue, we introduce a quantization-aware matrix factorization (QMF) to develop a novel lossy image compression method. QMF provides a low-rank representation of the image data as a product of two smaller factor matrices, with elements constrained to bounded integer values, thereby effectively integrating quantization with low-rank approximation. We propose an efficient, provably convergent iterative algorithm for QMF using a block coordinate descent (BCD) scheme, with subproblems having closed-form solutions. Our experiments on the Kodak and CLIC 2024 datasets demonstrate that our QMF compression method consistently outperforms JPEG at low bit rates below 0.25 bits per pixel (bpp) and remains comparable at higher bit rates. We also assessed our method's capability to preserve visual semantics by evaluating an ImageNet pre-trained classifier on compressed images. Remarkably, our method improved top-1 accuracy by over 5 percentage points compared to JPEG at bit rates under 0.25 bpp. The project is available at this https URL .
- [61] arXiv:2410.00068 (replaced) [pdf, other]
-
Title: Denoising VAE as an Explainable Feature Reduction and Diagnostic Pipeline for Autism Based on Resting state fMRIXinyuan Zheng, Orren Ravid, Robert A.J. Barry, Yoojean Kim, Qian Wang, Young-geun Kim, Xi Zhu, Xiaofu HeSubjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Applications (stat.AP)
Autism spectrum disorders (ASDs) are developmental conditions characterized by restricted interests and difficulties in communication. The complexity of ASD has resulted in a deficiency of objective diagnostic biomarkers. Deep learning methods have gained recognition for addressing these challenges in neuroimaging analysis, but finding and interpreting such diagnostic biomarkers are still challenging computationally. Here, we propose a feature reduction pipeline using resting-state fMRI data. We used Craddock atlas and Power atlas to extract functional connectivity data from rs-fMRI, resulting in over 30 thousand features. By using a denoising variational autoencoder, our proposed pipeline further compresses the connectivity features into 5 latent Gaussian distributions, providing is a low-dimensional representation of the data to promote computational efficiency and interpretability. To test the method, we employed the extracted latent representations to classify ASD using traditional classifiers such as SVM on a large multi-site dataset. The 95% confidence interval for the prediction accuracy of SVM is [0.63, 0.76] after site harmonization using the extracted latent distributions. Without using DVAE for dimensionality reduction, the prediction accuracy is 0.70, which falls within the interval. The DVAE successfully encoded the diagnostic information from rs-fMRI data without sacrificing prediction performance. The runtime for training the DVAE and obtaining classification results from its extracted latent features was 7 times shorter compared to training classifiers directly on the raw data. Our findings suggest that the Power atlas provides more effective brain connectivity insights for diagnosing ASD than Craddock atlas. Additionally, we visualized the latent representations to gain insights into the brain networks contributing to the differences between ASD and neurotypical brains.
- [62] arXiv:2410.15660 (replaced) [pdf, html, other]
-
Title: SPARC: Prediction-Based Safe Control for Coupled Controllable and Uncontrollable Agents with Conformal PredictionsSubjects: Systems and Control (eess.SY)
We investigate the problem of safe control synthesis for systems operating in environments with uncontrollable agents whose dynamics are unknown but coupled with those of the controlled system. This scenario naturally arises in various applications, such as autonomous driving and human-robot collaboration, where the behavior of uncontrollable agents, like pedestrians, cannot be directly controlled but is influenced by the actions of the autonomous vehicle or robot. In this paper, we present SPARC (Safe Prediction-Based Robust Controller for Coupled Agents), a novel framework designed to ensure safe control in the presence of coupled uncontrollable agents. SPARC leverages conformal prediction to quantify uncertainty in data-driven prediction of agent behavior. Particularly, we introduce a joint distribution-based approach to account for the coupled dynamics of the controlled system and uncontrollable agents. By integrating the control barrier function (CBF) technique, SPARC provides provable safety guarantees at a high confidence level. We illustrate our framework with a case study involving an autonomous driving scenario with walking pedestrians.
- [63] arXiv:2411.04844 (replaced) [pdf, html, other]
-
Title: Discretized Gaussian Representation for Tomographic ReconstructionShaokai Wu, Yuxiang Lu, Wei Ji, Suizhi Huang, Fengyu Yang, Shalayiding Sirejiding, Qichen He, Jing Tong, Yanbiao Ji, Yue Ding, Hongtao LuSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Computed Tomography (CT) is a widely used imaging technique that provides detailed cross-sectional views of objects. Over the past decade, Deep Learning-based Reconstruction (DLR) methods have led efforts to enhance image quality and reduce noise, yet they often require large amounts of data and are computationally intensive. Inspired by recent advancements in scene reconstruction, some approaches have adapted NeRF and 3D Gaussian Splatting (3DGS) techniques for CT reconstruction. However, these methods are not ideal for direct 3D volume reconstruction. In this paper, we propose a novel Discretized Gaussian Representation (DGR) for CT reconstruction, which directly reconstructs the 3D volume using a set of discretized Gaussian functions in an end-to-end manner. To further enhance computational efficiency, we introduce a Fast Volume Reconstruction technique that aggregates the contributions of these Gaussians into a discretized volume in a highly parallelized fashion. Our extensive experiments on both real-world and synthetic datasets demonstrate that DGR achieves superior reconstruction quality and significantly improved computational efficiency compared to existing DLR and instance reconstruction methods. Our code has been provided for review purposes and will be made publicly available upon publication.
- [64] arXiv:2411.05989 (replaced) [pdf, html, other]
-
Title: Filter-Banks for Ultra-Wideband for Communications, Sensing, and LocalizationComments: 7 pages, 8 figures, accepted in IEEE International Communications Conference Workshop 5Subjects: Signal Processing (eess.SP)
Recently, filter-bank multicarrier spread spectrum (FBMC-SS) has been proposed as a candidate waveform for ultra-wideband (UWB) communications, sensing, and localization. It has been noted that FBMC-SS is a perfect match to this application, leading to a trivial method of matching to the required spectral mask at different regions of the world. FBMC-SS also allows easy rejection of high-power interfering signals that may appear over different parts of the UWB spectral band. Moreover, passing the received signal through a matched filter provides precise information for sensing and localization. In this paper, we concentrate on the use of staggered multitone spread spectrum (SMT-SS) for UWB communications. SMT makes use of offset quadrature amplitude modulation (OQAM) to transmit data symbols over overlapping subcarrier bands. This form of FBMC-SS is well-suited to UWB communications because it has good spectral efficiency and a flat power spectral density (PSD), resulting in good utilization of the UWB spectral mask.
- [65] arXiv:2411.08904 (replaced) [pdf, html, other]
-
Title: Generalized Scattering Matrix of Antenna: Moment Solution, Compression Storage and ApplicationSubjects: Signal Processing (eess.SP)
This paper presents a computation method of generalized scattering matrix (GSM) based on integral equations and the method of moments (MoM), specifically designed for antennas excited through waveguide ports. By leveraging two distinct formulations -- magnetic-type and electric-type integral equations -- we establish concise algebraic relations linking the GSM directly to the impedance matrices obtained from MoM. To address practical challenges in storing GSM data across wide frequency bands and multiple antenna scenarios, we propose a efficient compression scheme. This approach alleviates memory demands by selectively storing the dominant eigencomponents that govern scattering behavior. Numerical validation examples confirm the accuracy of our method by comparisons with full-wave simulation results. Furthermore, we introduce an efficient iterative procedure to predict antenna array performance, highlighting remarkable improvements in computational speed compared to conventional numerical methods. These results collectively demonstrate the GSM framework's strong potential for antenna-array design processes.
- [66] arXiv:2412.07428 (replaced) [pdf, html, other]
-
Title: Latency Minimization for UAV-Enabled Federated Learning: Trajectory Design and Resource AllocationComments: This manuscript has been submitted to IEEESubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Federated learning (FL) has become a transformative paradigm for distributed machine learning across wireless networks. However, the performance of FL is often hindered by the unreliable communication links between resource-constrained Internet of Things (IoT) devices and the central server. To overcome this challenge, we propose a novel framework that employs an unmanned aerial vehicle (UAV) as a mobile server to enhance the FL training process. By capitalizing on the UAV's mobility, we establish strong line-of-sight connections with IoT devices, thereby enhancing communication reliability and capacity. To maximize training efficiency, we formulate a latency minimization problem that jointly optimizes bandwidth allocation, computing frequencies, transmit power for both the UAV and IoT devices, and the UAV's flight trajectory. Subsequently, we analyze the required rounds of the IoT devices training and the UAV aggregation for FL convergence. Based on the convergence constraint, we transform the problem into three subproblems and develop an efficient alternating optimization algorithm to solve this problem effectively. Additionally, we provide a thorough analysis of the algorithm's convergence and computational complexity. Extensive numerical results demonstrate that our proposed scheme not only surpasses existing benchmark schemes in reducing latency up to 15.29%, but also achieves training efficiency that nearly matches the ideal scenario.
- [67] arXiv:2501.15128 (replaced) [pdf, html, other]
-
Title: MAP-based Problem-Agnostic diffusion model for Inverse ProblemsComments: 17 pages, 10 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Diffusion models have indeed shown great promise in solving inverse problems in image processing. In this paper, we propose a novel, problem-agnostic diffusion model called the maximum a posteriori (MAP)-based guided term estimation method for inverse problems. To leverage unconditionally pretrained diffusion models to address conditional generation tasks, we divide the conditional score function into two terms according to Bayes' rule: an unconditional score function (approximated by a pretrained score network) and a guided term, which is estimated using a novel MAP-based method that incorporates a Gaussian-type prior of natural images. This innovation allows us to better capture the intrinsic properties of the data, leading to improved performance. Numerical results demonstrate that our method preserves contents more effectively compared to state-of-the-art methods--for example, maintaining the structure of glasses in super-resolution tasks and producing more coherent results in the neighborhood of masked regions during inpainting.
- [68] arXiv:2502.18924 (replaced) [pdf, html, other]
-
Title: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech SynthesisZiyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou ZhaoSubjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{MegaTTS 3}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at this https URL.
- [69] arXiv:2503.02892 (replaced) [pdf, html, other]
-
Title: Segmenting Bi-Atrial Structures Using ResNext Based FrameworkSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Atrial fibrillation (AF) is the most common cardiac arrhythmia, significantly contributing to mortality, particularly in older populations. While pulmonary vein isolation is a standard treatment, its effectiveness is limited in patients with persistent AF. Recent research highlights the importance of targeting additional atrial regions, particularly fibrotic areas identified via late gadolinium-enhanced MRI (LGE-MRI). However, existing manual segmentation methods are time-consuming and prone to variability. Deep learning techniques, particularly convolutional neural networks (CNNs), have shown promise in automating segmentation. However, most studies focus solely on the left atrium (LA) and rely on small datasets, limiting generalizability. In this paper, we propose a novel two-stage framework incorporating ResNeXt encoders and a cyclic learning rate to segment both the right atrium (RA) and LA walls and cavities in LGE-MRIs. Our method aims to improve the segmentation of challenging small structures, such as atrial walls while maintaining high performance in larger regions like the atrial cavities. The results demonstrate that our approach offers superior segmentation accuracy and robustness compared to traditional architectures, particularly for imbalanced class structures.
- [70] arXiv:2503.05988 (replaced) [pdf, html, other]
-
Title: Physics-Informed Generative Approaches for Wireless Channel ModelingSatyavrat Wagle, Akshay Malhotra, Shahab Hamidi-Rad, Aditya Sant, David J. Love, Christopher G. BrintonSubjects: Signal Processing (eess.SP)
In recent years, machine learning (ML) methods have become increasingly popular in wireless communication systems for several applications. A critical bottleneck for designing ML systems for wireless communications is the availability of realistic wireless channel datasets, which are extremely resource intensive to produce. To this end, the generation of realistic wireless channels plays a key role in the subsequent design of effective ML algorithms for wireless communication systems. Generative models have been proposed to synthesize channel matrices, but outputs produced by such methods may not correspond to geometrically viable channels and do not provide any insight into the scenario of interest. In this work, we aim to address both these issues by integrating a parametric, physics-based geometric channel (PBGC) modeling framework with generative methods. To address limitations with gradient flow through the PBGC model, a linearized reformulation is presented, which ensures smooth gradient flow during generative model training, while also capturing insights about the underlying physical environment. We evaluate our model against prior baselines by comparing the generated samples in terms of the 2-Wasserstein distance and through the utility of generated data when used for downstream compression tasks.
- [71] arXiv:2503.14222 (replaced) [pdf, html, other]
-
Title: Stacked-Residual PINN for State Reconstruction of Hyperbolic SystemsSubjects: Systems and Control (eess.SY)
In a more connected world, modeling multi-agent systems with hyperbolic partial differential equations (PDEs) offers a potential solution to the curse of dimensionality. However, classical control tools need adaptation for these complex systems. Physics-informed neural networks (PINNs) provide a powerful framework to fix this issue by inferring solutions to PDEs by embedding governing equations into the neural network. A major limitation of original PINNs is their inability to capture steep gradients and discontinuities in hyperbolic PDEs. This paper proposes a stacked residual PINN method enhanced with a vanishing viscosity mechanism. Initially, a basic PINN with a small viscosity coefficient provides a stable, low-fidelity solution. Residual correction blocks with learnable scaling parameters then iteratively refine this solution, progressively decreasing the viscosity coefficient to transition from parabolic to hyperbolic PDEs. Applying this method to traffic state reconstruction improved results by an order of magnitude in relative $\mathcal{L}^2$ error, demonstrating its potential to accurately estimate solutions where original PINNs struggle with instability and low fidelity.
- [72] arXiv:2305.15364 (replaced) [pdf, html, other]
-
Title: LQG Risk-Sensitive Single-Agent and Major-Minor Mean-Field Game Systems: A Variational FrameworkSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY); Probability (math.PR); Mathematical Finance (q-fin.MF); Risk Management (q-fin.RM)
We develop a variational approach to address risk-sensitive optimal control problems with an exponential-of-integral cost functional in a general linear-quadratic-Gaussian (LQG) single-agent setup, offering new insights into such problems. Our analysis leads to the derivation of a nonlinear necessary and sufficient condition of optimality, expressed in terms of martingale processes. Subject to specific conditions, we find an equivalent risk-neutral measure, under which a linear state feedback form can be obtained for the optimal control. It is then shown that the obtained feedback control is consistent with the imposed condition and remains optimal under the original measure. Building upon this development, we (i) propose a variational framework for general LQG risk-sensitive mean-field games (MFGs) and (ii) advance the LQG risk-sensitive MFG theory by incorporating a major agent in the framework. The major agent interacts with a large number of minor agents, and unlike the minor agents, its influence on the system remains significant even with an increasing number of minor agents. We derive the Markovian closed-loop best-response strategies of agents in the limiting case where the number of agents goes to infinity. We establish that the set of obtained best-response strategies yields a Nash equilibrium in the limiting case and an $\varepsilon$-Nash equilibrium in the finite-player case.
- [73] arXiv:2310.04722 (replaced) [pdf, html, other]
-
Title: A Holistic Evaluation of Piano Sound QualityComments: 15 pages, 9 figuresJournal-ref: Proceedings of the 10th Conference on Sound and Music Technology. CSMT 2023. Lecture Notes in Electrical Engineering, vol 1268. Springer, SingaporeSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
This paper aims to develop a holistic evaluation method for piano sound quality to assist in purchasing decisions. Unlike previous studies that focused on the effect of piano performance techniques on sound quality, this study evaluates the inherent sound quality of different pianos. To derive quality evaluation systems, the study uses subjective questionnaires based on a piano sound quality dataset. The method selects the optimal piano classification models by comparing the fine-tuning results of different pre-training models of Convolutional Neural Networks (CNN). To improve the interpretability of the models, the study applies Equivalent Rectangular Bandwidth (ERB) analysis. The results reveal that musically trained individuals are better able to distinguish between the sound quality differences of different pianos. The best fine-tuned CNN pre-trained backbone achieves a high accuracy of 98.3% as the piano classifier. However, the dataset is limited, and the audio is sliced to increase its quantity, resulting in a lack of diversity and balance, so we use focal loss to reduce the impact of data imbalance. To optimize the method, the dataset will be expanded, or few-shot learning techniques will be employed in future research.
- [74] arXiv:2312.00206 (replaced) [pdf, html, other]
-
Title: SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian SplattingComments: Version accepted to 3DV 2025. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
3D Gaussian Splatting (3DGS) has recently enabled real-time rendering of unbounded 3D scenes for novel view synthesis. However, this technique requires dense training views to accurately reconstruct 3D geometry. A limited number of input views will significantly degrade reconstruction quality, resulting in artifacts such as "floaters" and "background collapse" at unseen viewpoints. In this work, we introduce SparseGS, an efficient training pipeline designed to address the limitations of 3DGS in scenarios with sparse training views. SparseGS incorporates depth priors, novel depth rendering techniques, and a pruning heuristic to mitigate floater artifacts, alongside an Unseen Viewpoint Regularization module to alleviate background collapses. Our extensive evaluations on the Mip-NeRF360, LLFF, and DTU datasets demonstrate that SparseGS achieves high-quality reconstruction in both unbounded and forward-facing scenarios, with as few as 12 and 3 input images, respectively, while maintaining fast training and real-time rendering capabilities.
- [75] arXiv:2402.13901 (replaced) [pdf, other]
-
Title: Broadening Target Distributions for Accelerated Diffusion Models via a Novel Analysis ApproachSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
Accelerated diffusion models hold the potential to significantly enhance the efficiency of standard diffusion processes. Theoretically, these models have been shown to achieve faster convergence rates than the standard $\mathcal O(1/\epsilon^2)$ rate of vanilla diffusion models, where $\epsilon$ denotes the target accuracy. However, current theoretical studies have established the acceleration advantage only for restrictive target distribution classes, such as those with smoothness conditions imposed along the entire sampling path or with bounded support. In this work, we significantly broaden the target distribution classes with a new accelerated stochastic DDPM sampler. In particular, we show that it achieves accelerated performance for three broad distribution classes not considered before. Our first class relies on the smoothness condition posed only to the target density $q_0$, which is far more relaxed than the existing smoothness conditions posed to all $q_t$ along the entire sampling path. Our second class requires only a finite second moment condition, allowing for a much wider class of target distributions than the existing finite-support condition. Our third class is Gaussian mixture, for which our result establishes the first acceleration guarantee. Moreover, among accelerated DDPM type samplers, our results specialized for bounded-support distributions show an improved dependency on the data dimension $d$. Our analysis introduces a novel technique for establishing performance guarantees via constructing a tilting factor representation of the convergence error and utilizing Tweedie's formula to handle Taylor expansion terms. This new analytical framework may be of independent interest.
- [76] arXiv:2403.05944 (replaced) [pdf, html, other]
-
Title: Model-Predictive Trajectory Generation for Aerial Search and CoverageSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
This paper introduces a trajectory planning algorithm for search and coverage missions with an Unmanned Aerial Vehicle (UAV) based on an uncertainty map that represents prior knowledge of the target region, modeled by a Gaussian Mixture Model (GMM). The trajectory planning problem is formulated as an Optimal Control Problem (OCP), which aims to maximize the uncertainty reduction within a specified mission duration. However, this results in an intractable OCP whose objective functional cannot be expressed in closed form. To address this, we propose a Model Predictive Control (MPC) algorithm based on a relaxed formulation of the objective function to approximate the optimal solutions. This relaxation promotes efficient map exploration by penalizing overlaps in the UAV's visibility regions along the trajectory. The algorithm can produce efficient and smooth trajectories, and it can be efficiently implemented using standard Nonlinear Programming solvers, being suitable for real-time planning. Unlike traditional methods, which often rely on discretizing the mission space and using complex mixed-integer formulations, our approach is computationally efficient and easier to implement. The MPC algorithm is initially assessed in MATLAB, followed by Gazebo simulations and actual experimental tests conducted in an outdoor environment. The results demonstrate that the proposed strategy can generate efficient and smooth trajectories for search and coverage missions.
- [77] arXiv:2406.02166 (replaced) [pdf, html, other]
-
Title: Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic SupervisionComments: Accepted by IEEE-TASLPSubjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pretraining with phonetic or graphemic transcription, and self-supervised pretraining. We find that pretraining with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency. It is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we release the code, models and data for the entire pipeline of Whistle at this https URL.
- [78] arXiv:2407.00258 (replaced) [pdf, html, other]
-
Title: Topological Graph Simplification Solutions to the Street Intersection Miscount ProblemJournal-ref: Transactions in GIS, 2025Subjects: Physics and Society (physics.soc-ph); Discrete Mathematics (cs.DM); Systems and Control (eess.SY); Computation (stat.CO)
Street intersection counts and densities are ubiquitous measures in transport geography and planning. However, typical street network data and typical street network analysis tools can substantially overcount them. This article explains the three main reasons why this happens and presents solutions to each. It contributes algorithms to automatically simplify spatial graphs of urban street networks -- via edge simplification and node consolidation -- resulting in faster parsimonious models and more accurate network measures like intersection counts and densities, street segment lengths, and node degrees. These algorithms' information compression improves downstream graph analytics' memory and runtime efficiency, boosting analytical tractability without loss of model fidelity. Finally, this article validates these algorithms and empirically assesses intersection count biases worldwide to demonstrate the problem's widespread prevalence. Without consolidation, traditional methods would overestimate the median urban area intersection count by 14%. However, this bias varies drastically across regions, underscoring these algorithms' importance for consistent comparative empirical analyses.
- [79] arXiv:2407.05608 (replaced) [pdf, html, other]
-
Title: A Benchmark for Multi-speaker AnonymizationComments: Accepted by TIFSSubjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Privacy-preserving voice protection approaches primarily suppress privacy-related information derived from paralinguistic attributes while preserving the linguistic content. Existing solutions focus particularly on single-speaker scenarios. However, they lack practicality for real-world applications, i.e., multi-speaker scenarios. In this paper, we present an initial attempt to provide a multi-speaker anonymization benchmark by defining the task and evaluation protocol, proposing benchmarking solutions, and discussing the privacy leakage of overlapping conversations. The proposed benchmark solutions are based on a cascaded system that integrates spectral-clustering-based speaker diarization and disentanglement-based speaker anonymization using a selection-based anonymizer. To improve utility, the benchmark solutions are further enhanced by two conversation-level speaker vector anonymization methods. The first method minimizes the differential similarity across speaker pairs in the original and anonymized conversations, which maintains original speaker relationships in the anonymized version. The other minimizes the aggregated similarity across anonymized speakers, which achieves better differentiation between this http URL conducted on both non-overlap simulated and real-world datasets demonstrate the effectiveness of the multi-speaker anonymization system with the proposed speaker anonymizers. Additionally, we analyzed overlapping speech regarding privacy leakage and provided potential solutions
- [80] arXiv:2408.16315 (replaced) [pdf, other]
-
Title: Passenger hazard perception based on EEG signals for highly automated driving vehiclesAshton Yu Xuan Tan, Yingkai Yang, Xiaofei Zhang, Bowen Li, Xiaorong Gao, Sifa Zheng, Jianqiang Wang, Xinyu Gu, Jun Li, Yang Zhao, Yuxin Zhang, Tania StathakiComments: We have decided to withdraw this submission due to ongoing revisions and further refinements in our research. A revised version may be resubmitted in the future. We appreciate the feedback and interest from the communitySubjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
Enhancing the safety of autonomous vehicles is crucial, especially given recent accidents involving automated systems. As passengers in these vehicles, humans' sensory perception and decision-making can be integrated with autonomous systems to improve safety. This study explores neural mechanisms in passenger-vehicle interactions, leading to the development of a Passenger Cognitive Model (PCM) and the Passenger EEG Decoding Strategy (PEDS). Central to PEDS is a novel Convolutional Recurrent Neural Network (CRNN) that captures spatial and temporal EEG data patterns. The CRNN, combined with stacking algorithms, achieves an accuracy of $85.0\% \pm 3.18\%$. Our findings highlight the predictive power of pre-event EEG data, enhancing the detection of hazardous scenarios and offering a network-driven framework for safer autonomous vehicles.
- [81] arXiv:2410.12399 (replaced) [pdf, html, other]
-
Title: SF-Speech: Straightened Flow for Zero-Shot Voice CloneComments: Accepted by IEEE Transactions on Audio, Speech and Language ProcessingSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recently, neural ordinary differential equations (ODE) models trained with flow matching have achieved impressive performance on the zero-shot voice clone task. Nevertheless, postulating standard Gaussian noise as the initial distribution of ODE gives rise to numerous intersections within the fitted targets of flow matching, which presents challenges to model training and enhances the curvature of the learned generated trajectories. These curved trajectories restrict the capacity of ODE models for generating desirable samples with a few steps. This paper proposes SF-Speech, a novel voice clone model based on ODE and in-context learning. Unlike the previous works, SF-Speech adopts a lightweight multi-stage module to generate a more deterministic initial distribution for ODE. Without introducing any additional loss function, we effectively straighten the curved reverse trajectories of the ODE model by jointly training it with the proposed module. Experiment results on datasets of various scales show that SF-Speech outperforms the state-of-the-art zero-shot TTS methods and requires only a quarter of the solver steps, resulting in a generation speed approximately 3.7 times that of Voicebox and E2 TTS. Audio samples are available at the demo page\footnote{[Online] Available: this https URL}.
- [82] arXiv:2410.21897 (replaced) [pdf, html, other]
-
Title: Semi-Supervised Self-Learning Enhanced Music Emotion RecognitionComments: 12 pages, 2 figuresJournal-ref: Proceedings of the 11th Conference on Sound and Music Technology. CSMT 2024. Lecture Notes in Electrical Engineering. Springer, SingaporeSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Music emotion recognition (MER) aims to identify the emotions conveyed in a given musical piece. However, currently, in the field of MER, the available public datasets have limited sample sizes. Recently, segment-based methods for emotion-related tasks have been proposed, which train backbone networks on shorter segments instead of entire audio clips, thereby naturally augmenting training samples without requiring additional resources. Then, the predicted segment-level results are aggregated to obtain the entire song prediction. The most commonly used method is that the segment inherits the label of the clip containing it, but music emotion is not constant during the whole clip. Doing so will introduce label noise and make the training easy to overfit. To handle the noisy label issue, we propose a semi-supervised self-learning (SSSL) method, which can differentiate between samples with correct and incorrect labels in a self-learning manner, thus effectively utilizing the augmented segment-level data. Experiments on three public emotional datasets demonstrate that the proposed method can achieve better or comparable performance.
- [83] arXiv:2412.02798 (replaced) [pdf, html, other]
-
Title: Grayscale to Hyperspectral at Any Resolution Using a Phase-Only LensSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optics (physics.optics)
We consider the problem of reconstructing a HxWx31 hyperspectral image from a HxW grayscale snapshot measurement that is captured using only a single diffractive optic and a filterless panchromatic photosensor. This problem is severely ill-posed, but we present the first model that produces high-quality results. We make efficient use of limited data by training a conditional denoising diffusion model that operates on small patches in a shift-invariant manner. During inference, we synchronize per-patch hyperspectral predictions using guidance derived from the optical point spread function. Surprisingly, our experiments reveal that patch sizes as small as the PSFs support achieve excellent results, and they show that local optical cues are sufficient to capture full spectral information. Moreover, by drawing multiple samples, our model provides per-pixel uncertainty estimates that strongly correlate with reconstruction error. Our work lays the foundation for a new class of high-resolution snapshot hyperspectral imagers that are compact and light-efficient.
- [84] arXiv:2412.06602 (replaced) [pdf, html, other]
-
Title: Towards Controllable Speech Synthesis in the Era of Large Language Models: A SurveyComments: A comprehensive survey on controllable TTS, 26 pages, 7 tables, 6 figures, 317 references. Under reviewSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that aims to generate natural-sounding human speech from text. Recently, with the increasing industrial demand, TTS technologies have evolved beyond synthesizing human-like speech to enabling controllable speech generation. This includes fine-grained control over various attributes of synthesized speech such as emotion, prosody, timbre, and duration. In addition, advancements in deep learning, such as diffusion and large language models, have significantly enhanced controllable TTS over the past several years. In this work, we conduct a comprehensive survey of controllable TTS, covering approaches ranging from basic control techniques to methods utilizing natural language prompts, aiming to provide a clear understanding of the current state of research. We examine the general controllable TTS pipeline, challenges, model architectures, and control strategies, offering a comprehensive and clear taxonomy of existing methods. Additionally, we provide a detailed summary of datasets and evaluation metrics and shed some light on the applications and future directions of controllable TTS. To the best of our knowledge, this survey paper provides the first comprehensive review of emerging controllable TTS methods, which can serve as a beneficial resource for both academic researchers and industrial practitioners.
- [85] arXiv:2412.17667 (replaced) [pdf, html, other]
-
Title: VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and MusicJiatong Shi, Hye-jin Shim, Jinchuan Tian, Siddhant Arora, Haibin Wu, Darius Petermann, Jia Qi Yip, You Zhang, Yuxun Tang, Wangyou Zhang, Dareen Safar Alharthi, Yichen Huang, Koichi Saito, Jionghao Han, Yiwen Zhao, Chris Donahue, Shinji WatanabeSubjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at this https URL.
- [86] arXiv:2503.12101 (replaced) [pdf, html, other]
-
Title: MUSE: A Real-Time Multi-Sensor State Estimator for Quadruped RobotsComments: Accepted for publication in IEEE Robotics and Automation LettersSubjects: Robotics (cs.RO); Signal Processing (eess.SP)
This paper introduces an innovative state estimator, MUSE (MUlti-sensor State Estimator), designed to enhance state estimation's accuracy and real-time performance in quadruped robot navigation. The proposed state estimator builds upon our previous work presented in [1]. It integrates data from a range of onboard sensors, including IMUs, encoders, cameras, and LiDARs, to deliver a comprehensive and reliable estimation of the robot's pose and motion, even in slippery scenarios. We tested MUSE on a Unitree Aliengo robot, successfully closing the locomotion control loop in difficult scenarios, including slippery and uneven terrain. Benchmarking against Pronto [2] and VILENS [3] showed 67.6% and 26.7% reductions in translational errors, respectively. Additionally, MUSE outperformed DLIO [4], a LiDAR-inertial odometry system in rotational errors and frequency, while the proprioceptive version of MUSE (P-MUSE) outperformed TSIF [5], with a 45.9% reduction in absolute trajectory error (ATE).
- [87] arXiv:2503.20646 (replaced) [pdf, html, other]
-
Title: Immersive and Wearable Thermal Rendering for Augmented RealitySubjects: Human-Computer Interaction (cs.HC); Robotics (cs.RO); Systems and Control (eess.SY)
In augmented reality (AR), where digital content is overlaid onto the real world, realistic thermal feedback has been shown to enhance immersion. Yet current thermal feedback devices, heavily influenced by the needs of virtual reality, often hinder physical interactions and are ineffective for immersion in AR. To bridge this gap, we have identified three design considerations relevant for AR thermal feedback: indirect feedback to maintain dexterity, thermal passthrough to preserve real-world temperature perception, and spatiotemporal rendering for dynamic sensations. We then created a unique and innovative thermal feedback device that satisfies these criteria. Human subject experiments assessing perceptual sensitivity, object temperature matching, spatial pattern recognition, and moving thermal stimuli demonstrated the impact of our design, enabling realistic temperature discrimination, virtual object perception, and enhanced immersion. These findings demonstrate that carefully designed thermal feedback systems can bridge the sensory gap between physical and virtual interactions, enhancing AR realism and usability.