Electrical Engineering and Systems Science
See recent articles
Showing new listings for Monday, 20 January 2025
- [1] arXiv:2501.09759 [pdf, other]
-
Title: A wideband amplifying and filtering reconfigurable intelligent surface for wireless relayLijie Wu, Qun Yan Zhou, Jun Yan Dai, Siran Wang, Junwei Zhang, Zhen Jie Qi, Hanqing Yang, Ruizhe Jiang, Zheng Xing Wang, Huidong Li, Zhen Zhang, Jiang Luo, Qiang Cheng, Tie Jun CuiSubjects: Signal Processing (eess.SP); Applied Physics (physics.app-ph)
Programmable metasurfaces have garnered significant attention due to their exceptional ability to manipulate electromagnetic (EM) waves in real time, leading to the emergence of a prominent area in wireless communication, namely reconfigurable intelligent surfaces (RISs), to control the signal propagation and coverage. However, the existing RISs usually suffer from limited operating distance and band interference, which hinder their practical applications in wireless relay and communication systems. To overcome the limitations, we propose an amplifying and filtering RIS (AF-RIS) to enhance the in-band signal energy and filter the out-of-band signal of the incident EM waves, ensuring the miniaturization of the RIS array and enabling its anti-interference ability. In addition, each AF-RIS element is equipped with a 2-bit phase control capability, further endowing the entire array with great beamforming performance. An elaborately designed 4*8 AF-RIS array is presented by integrating the power dividing and combining networks, which substantially reduces the number of amplifiers and filters, thereby reducing the hardware costs and power consumption. Experimental results showcase the powerful capabilities of AF-RIS in beam-steering, frequency selectivity, and signal amplification. Therefore, the proposed AF-RIS holds significant promise for critical applications in wireless relay systems by offering an efficient solution to improve frequency selectivity, enhance signal coverage, and reduce hardware size.
- [2] arXiv:2501.09761 [pdf, other]
-
Title: VERITAS: Verifying the Performance of AI-native Transceiver Actions in Base-StationsComments: This work has been submitted to the IEEE for possible publicationSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Artificial Intelligence (AI)-native receivers prove significant performance improvement in high noise regimes and can potentially reduce communication overhead compared to the traditional receiver. However, their performance highly depends on the representativeness of the training dataset. A major issue is the uncertainty of whether the training dataset covers all test environments and waveform configurations, and thus, whether the trained model is robust in practical deployment conditions. To this end, we propose a joint measurement-recovery framework for AI-native transceivers post deployment, called VERITAS, that continuously looks for distribution shifts in the received signals and triggers finite re-training spurts. VERITAS monitors the wireless channel using 5G pilots fed to an auxiliary neural network that detects out-of-distribution channel profile, transmitter speed, and delay spread. As soon as such a change is detected, a traditional (reference) receiver is activated, which runs for a period of time in parallel to the AI-native receiver. Finally, VERTIAS compares the bit probabilities of the AI-native and the reference receivers for the same received data inputs, and decides whether or not a retraining process needs to be initiated. Our evaluations reveal that VERITAS can detect changes in the channel profile, transmitter speed, and delay spread with 99%, 97%, and 69% accuracies, respectively, followed by timely initiation of retraining for 86%, 93.3%, and 94.8% of inputs in channel profile, transmitter speed, and delay spread test sets, respectively.
- [3] arXiv:2501.09799 [pdf, html, other]
-
Title: Scan-Adaptive MRI Undersampling Using Neighbor-based Optimization (SUNO)Subjects: Image and Video Processing (eess.IV)
Accelerated MRI involves collecting partial k-space measurements to reduce acquisition time, patient discomfort, and motion artifacts, and typically uses regular undersampling patterns or hand-designed schemes. Recent works have studied population-adaptive sampling patterns that are learned from a group of patients (or scans) based on population-specific metrics. However, such a general sampling pattern can be sub-optimal for any specific scan since it may lack scan or slice adaptive details. To overcome this issue, we propose a framework for jointly learning scan-adaptive Cartesian undersampling patterns and a corresponding reconstruction model from a training set. We use an alternating algorithm for learning the sampling patterns and reconstruction model where we use an iterative coordinate descent (ICD) based offline optimization of scan-adaptive k-space sampling patterns for each example in the training set. A nearest neighbor search is then used to select the scan-adaptive sampling pattern at test time from initially acquired low-frequency k-space information. We applied the proposed framework (dubbed SUNO) to the fastMRI multi-coil knee and brain datasets, demonstrating improved performance over currently used undersampling patterns at both 4x and 8x acceleration factors in terms of both visual quality and quantitative metrics. The code for the proposed framework is available at this https URL.
- [4] arXiv:2501.09832 [pdf, other]
-
Title: Crossover-BPSO Driven Multi-Agent Technology for Managing Local Energy SystemsSubjects: Systems and Control (eess.SY)
This article presents a new hybrid algorithm, crossover binary particle swarm optimization (crBPSO), for allocating resources in local energy systems via multi-agent (MA) technology. Initially, a hierarchical MA-based architecture in a grid-connected local energy setup is presented. In this architecture, task specific agents operate in a master-slave manner. Where, the master runs a well-formulated optimization routine aiming at minimizing costs of energy procurement, battery degradation, and load scheduling delay. The slaves update the master on their current status and receive optimal action plans accordingly. Simulation results demonstrate that the proposed algorithm outperforms selected existing ones by 21\% in terms average energy system costs while satisfying customers' energy demand and maintaining the required quality of service.
- [5] arXiv:2501.09837 [pdf, html, other]
-
Title: Complex-Valued Neural Networks for Ultra-Reliable Massive MIMOSubjects: Signal Processing (eess.SP); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
In the evolving landscape of 5G and 6G networks, the demands extend beyond high data rates, ultra-low latency, and extensive coverage, increasingly emphasizing the need for reliability. This paper proposes an ultra-reliable multiple-input multiple-output (MIMO) scheme utilizing quasi-orthogonal space-time block coding (QOSTBC) combined with singular value decomposition (SVD) for channel state information (CSI) correction, significantly improving performance over QOSTBC and traditional orthogonal STBC (OSTBC) when analyzing spectral efficiency. Although QOSTBC enhances spectral efficiency, it also increases computational complexity at the maximum likelihood (ML) decoder. To address this, a neural network-based decoding scheme using phase-transmittance radial basis function (PT-RBF) architecture is also introduced to manage QOSTBC's complexity. Simulation results demonstrate improved system robustness and performance, making this approach a potential candidate for ultra-reliable communication in next-generation networks.
- [6] arXiv:2501.09853 [pdf, html, other]
-
Title: Greening the Grid: Electricity Market Clearing with Consumer-Based Carbon CostComments: 10 pages, 8 figuresSubjects: Systems and Control (eess.SY)
To enhance decarbonization efforts in electric power systems, we propose a novel electricity market clearing model that internalizes the allocation of emissions from generations to loads and allows for consideration of consumer-side carbon costs. Specifically, consumers can not only bid for power but also assign a cost to the carbon emissions incurred by their electricity use. These carbon costs provide consumers, ranging from carbon-agnostic to carbon-sensitive, with a tool to actively manage their roles in carbon emission mitigation. By incorporating carbon allocation and consumer-side carbon costs, the market clearing is influenced not solely by production and demand dynamics but also by the allocation of carbon emission responsibilities. To demonstrate the effect of our proposed model, we conduct a case study comparing market clearing outcomes across various percentages of carbon-sensitive consumers with differing carbon costs.
- [7] arXiv:2501.09857 [pdf, html, other]
-
Title: Efficient Probabilistic Assessment of Power System Resilience Using the Polynomial Chaos Expansion Method with Enhanced StabilityComments: Submitted to IEEE PESGM 2025Subjects: Systems and Control (eess.SY)
Increasing frequency and intensity of extreme weather events motivates the assessment of power system resilience. The random nature of these events and the resulting failures mandates probabilistic resilience assessment, but state-of-the-art methods (e.g., Monte Carlo simulation) are computationally inefficient. This paper leverages the polynomial chaos expansion (PCE) method to efficiently quantify uncertainty in power system resilience. To address repeatability issues arising from PCE computation with different sample sets, we propose the integration of the Maximin-LHS experiment design method with the PCE method. Numerical studies on the IEEE 39-bus system illustrate the improved repeatability and convergence of the proposed method. The enhanced PCE method is then used to assess the resilience of the system and propose adaptation measures to improve it.
- [8] arXiv:2501.09863 [pdf, other]
-
Title: Detection of Vascular Leukoencephalopathy in CT ImagesJournal-ref: Artificial Intelligence XLI. SGAI 2024. Lecture Notes in Computer Science, vol 15446. Springer, Cham (2025)Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Artificial intelligence (AI) has seen a significant surge in popularity, particularly in its application to medicine. This study explores AI's role in diagnosing leukoencephalopathy, a small vessel disease of the brain, and a leading cause of vascular dementia and hemorrhagic strokes. We utilized a dataset of approximately 1200 patients with axial brain CT scans to train convolutional neural networks (CNNs) for binary disease classification. Addressing the challenge of varying scan dimensions due to different patient physiologies, we processed the data to a uniform size and applied three preprocessing methods to improve model accuracy. We compared four neural network architectures: ResNet50, ResNet50 3D, ConvNext, and Densenet. The ConvNext model achieved the highest accuracy of 98.5% without any preprocessing, outperforming models with 3D convolutions. To gain insights into model decision-making, we implemented Grad-CAM heatmaps, which highlighted the focus areas of the models on the scans. Our results demonstrate that AI, particularly the ConvNext architecture, can significantly enhance diagnostic accuracy for leukoencephalopathy. This study underscores AI's potential in advancing diagnostic methodologies for brain diseases and highlights the effectiveness of CNNs in medical imaging applications.
- [9] arXiv:2501.09877 [pdf, html, other]
-
Title: CLAP-S: Support Set Based Adaptation for Downstream Fiber-optic Acoustic RecognitionComments: Accepted to ICASSP 2025Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
Contrastive Language-Audio Pretraining (CLAP) models have demonstrated unprecedented performance in various acoustic signal recognition tasks. Fiber-optic-based acoustic recognition is one of the most important downstream tasks and plays a significant role in environmental sensing. Adapting CLAP for fiber-optic acoustic recognition has become an active research area. As a non-conventional acoustic sensor, fiber-optic acoustic recognition presents a challenging, domain-specific, low-shot deployment environment with significant domain shifts due to unique frequency response and noise characteristics. To address these challenges, we propose a support-based adaptation method, CLAP-S, which linearly interpolates a CLAP Adapter with the Support Set, leveraging both implicit knowledge through fine-tuning and explicit knowledge retrieved from memory for cross-domain generalization. Experimental results show that our method delivers competitive performance on both laboratory-recorded fiber-optic ESC-50 datasets and a real-world fiber-optic gunshot-firework dataset. Our research also provides valuable insights for other downstream acoustic recognition tasks. The code and gunshot-firework dataset are available at this https URL.
- [10] arXiv:2501.09889 [pdf, html, other]
-
Title: Learning port maneuvers from data for automatic guidance of Unmanned Surface VehiclesComments: Preprint submitted to journal (under review). 25 pages, 13 figures, 3 tablesSubjects: Systems and Control (eess.SY)
At shipping ports, some repetitive maneuvering tasks such as entering/leaving port, transporting goods inside it or just making surveillance activities, can be efficiently and quickly carried out by a domestic pilot according to his experience. This know-how can be seized by Unmanned Surface Vehicles (USV) in order to autonomously replicate the same tasks. However, the inherent nonlinearity of ship trajectories and environmental perturbations as wind or marine currents make it difficult to learn a model and its respective control. We therefore present a data-driven learning and control methodology for USV, which is based on Gaussian Mixture Model, Gaussian Mixture Regression and the Sontag's universal formula. Our approach is capable to learn the nonlinear dynamics as well as guarantee the convergence toward the target with a robust controller. Real data have been collected through experiments with a vessel at the port of Ceuta. The complex trajectories followed by an expert have been learned including the robust controller. The effect of the controller over noise/perturbations are presented, a measure of error is used to compare estimates and real data trajectories, and finally, an analysis of computational complexity is performed.
- [11] arXiv:2501.09935 [pdf, html, other]
-
Title: Physics-informed DeepCT: Sinogram Wavelet Decomposition Meets Masked DiffusionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
Diffusion model shows remarkable potential on sparse-view computed tomography (SVCT) reconstruction. However, when a network is trained on a limited sample space, its generalization capability may be constrained, which degrades performance on unfamiliar data. For image generation tasks, this can lead to issues such as blurry details and inconsistencies between regions. To alleviate this problem, we propose a Sinogram-based Wavelet random decomposition And Random mask diffusion Model (SWARM) for SVCT reconstruction. Specifically, introducing a random mask strategy in the sinogram effectively expands the limited training sample space. This enables the model to learn a broader range of data distributions, enhancing its understanding and generalization of data uncertainty. In addition, applying a random training strategy to the high-frequency components of the sinogram wavelet enhances feature representation and improves the ability to capture details in different frequency bands, thereby improving performance and robustness. Two-stage iterative reconstruction method is adopted to ensure the global consistency of the reconstructed image while refining its details. Experimental results demonstrate that SWARM outperforms competing approaches in both quantitative and qualitative performance across various datasets.
- [12] arXiv:2501.09944 [pdf, html, other]
-
Title: Minimum-Time Sequential Traversal by a Team of Small Unmanned Aerial Vehicles in an Unknown Environment with WindsComments: Draft submitted to the 2025 American Control ConferenceSubjects: Systems and Control (eess.SY)
We consider the problem of transporting multiple packages from an initial location to a destination location in a windy urban environment using a team of SUAVs. Each SUAV carries one package. We assume that the wind field is unknown, but wind speed can be measured by SUAVs during flight. The SUAVs fly sequentially one after the other, measure wind speeds along their trajectories, and report the measurements to a central computer. The overall objective is to minimize the total travel time of all SUAVs, which is in turn related to the number of SUAV traversals through the environment. For a discretized environment modeled by a graph, we describe a method to estimate wind speeds and the time of traversal for each SUAV path. Each SUAV traverses a minimum-time path planned based on the current wind field estimate. We study cases of static and time-varying wind fields with and without measurement noise. For each case, we demonstrate via numerical simulation that the proposed method finds the optimal path after a minimal number of traversals.
- [13] arXiv:2501.09948 [pdf, other]
-
Title: AI Explainability for Power Electronics: From a Lipschitz Continuity PerspectiveSubjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Lifecycle management of power converters continues to thrive with emerging artificial intelligence (AI) solutions, yet AI mathematical explainability remains unexplored in power electronics (PE) community. The lack of theoretical rigor challenges adoption in mission-critical applications. Therefore, this letter proposes a generic framework to evaluate mathematical explainability, highlighting inference stability and training convergence from a Lipschitz continuity perspective. Inference stability governs consistent outputs under input perturbations, essential for robust real-time control and fault diagnosis. Training convergence guarantees stable learning dynamics, facilitating accurate modeling in PE contexts. Additionally, a Lipschitz-aware learning rate selection strategy is introduced to accelerate convergence while mitigating overshoots and oscillations. The feasibility of the proposed Lipschitz-oriented framework is demonstrated by validating the mathematical explainability of a state-of-the-art physics-in-architecture neural network, and substantiated through empirical case studies on dual-active-bridge converters. This letter serves as a clarion call for the PE community to embrace mathematical explainability, heralding a transformative era of trustworthy and explainable AI solutions that potentially redefine the future of power electronics.
- [14] arXiv:2501.09992 [pdf, other]
-
Title: A Novel Modulation Scheme Based on the Kramers--Kronig Relations for Optical IM/DD SystemsSubjects: Signal Processing (eess.SP)
The ever-growing demand for higher data rates in optical communication systems necessitates the development of advanced modulation formats capable of significantly enhancing system performance. In this work, we propose a novel modulation format derived from the Kramers--Kronig relations. This scheme effectively reduces the complexity of digital filtering and alleviates the demands on the digital-to-analog converter, offering a practical solution for high speed optical communication. The proposed modulation format was rigorously validated through experimental investigations using an optical wireless link. The results demonstrate a notable improvement in bit error rate (BER) performance and receiver sensitivity compared to PAM-4 and CAP-16 modulation schemes, with enhancements of 0.6 dB and 1.5 dB in receiver sensitivity, respectively. These improvements enable higher data transmission rates, positioning the Kramers--Kronig relations-based modulation format as a promising alternative to existing modulation techniques. Its potential to enhance the efficiency and capacity of optical communication systems is clearly evident. Future work will focus on extending its application to more complex scenarios, such as high-speed underwater optical communication systems, where advanced modulation formats are critical for overcoming bandwidth limitations.
- [15] arXiv:2501.10030 [pdf, html, other]
-
Title: Informativity Conditions for Multiple Signals: Properties, Experimental Design, and ApplicationsSubjects: Systems and Control (eess.SY); Information Theory (cs.IT)
Recent studies highlight the importance of persistently exciting condition in single signal sequence for model identification and data-driven control methodologies. However, maintaining prolonged excitation in control signals introduces significant challenges, as continuous excitation can reduce the lifetime of mechanical devices. In this paper, we introduce three informativity conditions for various types of multi-signal data, each augmented by weight factors. We explore the interrelations between these conditions and their rank properties in linear time-invariant systems. Furthermore, we introduce open-loop experimental design methods tailored to each of the three conditions, which can synthesize the required excitation conditions either offline or online, even in the presence of limited information within each signal segment. We demonstrate the effectiveness of these informativity conditions in least-squares identification. Additionally, all three conditions can extend Willems' fundamental lemma and are utilized to assess the properties of the system. Illustrative examples confirm that these conditions yield satisfactory outcomes in both least-squares identification and the construction of data-driven controllers.
- [16] arXiv:2501.10063 [pdf, html, other]
-
Title: Hybrid Parallel Collaborative Simulation Framework Integrating Device Physics with Circuit Dynamics for PDAE-Modeled Power Electronic EquipmentSubjects: Systems and Control (eess.SY)
Optimizing high-performance power electronic equipment, such as power converters, requires multiscale simulations that incorporate the physics of power semiconductor devices and the dynamics of other circuit components, especially in conducting Design of Experiments (DoEs), defining the safe operating area of devices, and analyzing failures related to semiconductor devices. However, current methodologies either overlook the intricacies of device physics or do not achieve satisfactory computational speeds. To bridge this gap, this paper proposes a Hybrid-Parallel Collaborative (HPC) framework specifically designed to analyze the Partial Differential Algebraic Equation (PDAE) modeled power electronic equipment, integrating the device physics and circuit dynamics. The HPC framework employs a dynamic iteration to tackle the challenges inherent in solving the coupled nonlinear PDAE system, and utilizes a hybrid-parallel computing strategy to reduce computing time. Physics-based system partitioning along with hybrid-process-thread parallelization on shared and distributed memory are employed, facilitating the simulation of hundreds of partial differential equations (PDEs)-modeled devices simultaneously without compromising speed. Experiments based on the hybrid line commutated converter and reverse-blocking integrated gate-commutated thyristors are conducted under 3 typical real-world scenarios: semiconductor device optimization for the converter; converter design optimization; and device failure analysis. The HPC framework delivers simulation speed up to 60 times faster than the leading commercial software, while maintaining carrier-level accuracy in the experiments. This shows great potential for comprehensive analysis and collaborative optimization of devices and electronic power equipment, particularly in extreme conditions and failure scenarios.
- [17] arXiv:2501.10068 [pdf, other]
-
Title: The R-Vessel-X ProjectAbir Affane (IP), Mohamed Amine Chetoui (IP), Jonas Lamy (LIRIS), Guillaume Lienemann (IP), Raphaël Peron (IP), P. Beaurepaire (IP), Guillaume Dollé (LMR), Marie-Ange Lèbre (IP), Benoit Magnin (IP), Odyssée Merveille (CREATIS), Mathilde Morvan (IP), Phuc Ngo (LORIA), Thibault Pelletier, Hugo Rositi (LORIA), Stéphanie Salmon (LMR), Julien Finet, Bertrand Kerautret (LIRIS), Nicolas Passat (CRESTIC), Antoine Vacavant (IP)Comments: Innovation and Research in BioMedical engineering, In pressSubjects: Image and Video Processing (eess.IV)
1) Objectives: This technical report presents a synthetic summary and the principal outcomes of the project R-Vessel-X ("Robust vascular network extraction and understanding within hepatic biomedical images") funded by the French Agence Nationale de la Recherche, and developed between 2019 and 2023. 2) Material and methods: We used datasets and tools publicly available such as IRCAD, Bullitt or VascuSynth toobtain real or synthetic angiographic images. The main contributions lie in the field of 3D angiographic image analysis: filtering, segmentation, modeling and simulation, with a specific focus on the liver. 3) Results: We paid a particular attention to open-source software diffusion of the developed methods, by means of 3D Slicer plugins for the liver anatomy segmentation (SlicerRVXLiverSegmentation) and vesselness filtering (Slicer-RVXVesselnessFilters), and an online demo for the generation of synthetic and realistic vessels in 2D and 3D (OpenCCO). 4) Conclusion: The R-Vessel-X project provided extensive research outcomes, covering various topics related to 3D angiographic image analysis, such as filtering, segmentation, modeling and simulation. We also developed open-source and free softwares so that the research communities in biomedical engineering can use these results in their future research.
- [18] arXiv:2501.10097 [pdf, html, other]
-
Title: Decomposition and Quantification of SOTIF Requirements for Perception Systems of Autonomous VehiclesComments: 14pages,13figures,4tables,Journal ArticleSubjects: Systems and Control (eess.SY)
Ensuring the safety of autonomous vehicles (AVs) is paramount before they can be introduced to the market.
More specifically, securing the Safety of the Intended Functionality (SOTIF) poses a notable challenge; while ISO 21448 outlines numerous activities to refine the performance of AVs, it offers minimal quantitative guidance. This paper endeavors to decompose the acceptance criterion into quantitative perception requirements, aiming to furnish developers with requirements that are not only understandable but also actionable. This paper introduces a risk decomposition methodology to derive SOTIF requirements for perception. More explicitly, for subsystemlevel safety requirements, we define a collision severity model to establish requirements for state uncertainty and present a Bayesian model to discern requirements for existence uncertainty.
For component-level safety requirements, we proposed a decomposition method based on the Shapley value. Our findings indicate that these methods can effectively decompose the system-level safety requirements into quantitative perception requirements, potentially facilitating the safety verification of various AV components. - [19] arXiv:2501.10128 [pdf, html, other]
-
Title: FECT: Classification of Breast Cancer Pathological Images Based on Fusion FeaturesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Breast cancer is one of the most common cancers among women globally, with early diagnosis and precise classification being crucial. With the advancement of deep learning and computer vision, the automatic classification of breast tissue pathological images has emerged as a research focus. Existing methods typically rely on singular cell or tissue features and lack design considerations for morphological characteristics of challenging-to-classify categories, resulting in suboptimal classification performance. To address these problems, we proposes a novel breast cancer tissue classification model that Fused features of Edges, Cells, and Tissues (FECT), employing the ResMTUNet and an attention-based aggregator to extract and aggregate these features. Extensive testing on the BRACS dataset demonstrates that our model surpasses current advanced methods in terms of classification accuracy and F1 scores. Moreover, due to its feature fusion that aligns with the diagnostic approach of pathologists, our model exhibits interpretability and holds promise for significant roles in future clinical applications.
- [20] arXiv:2501.10136 [pdf, html, other]
-
Title: Two-Stage Distributed Beamforming Design in Cell-Free Massive MIMO ISAC SystemsSubjects: Signal Processing (eess.SP)
Integrating radio-sensing functionalities into future cell-free (CF) wireless networks promises efficient resource utilization and facilitates the seamless roll-out of applications such as public safety and smart infrastructure. While the beamforming design problem for the CF integrated sensing and communication (ISAC) paradigm has been addressed in the literature, existing methods rely on centralized signal processing, leading to fronthaul load and scalability issues. This paper presents a two-stage beamforming design for the CF ISAC paradigm, aiming to significantly reduce the fronthaul load by distributing the signal processing tasks between the central unit (CU) and the access points (APs). The design optimizes the sum signal-to-interference-plus-noise ratio (SINR) for communication users, subject to per-AP power constraints and signal-to-noise ratio (SNR) requirements for radio-sensing purposes. The resulting optimization problems are non-convex and challenging to solve. To address this, we employ a majorization-minimization (MM) approach, which decomposes the problem into simpler convex subproblems. The results show that the two-stage beamforming design achieves performance comparable to centralized methods while substantially reducing the fronthaul load, thus minimizing data transmission requirements over the fronthaul network. This advancement highlights the potential of the proposed method to enhance the efficiency and scalability of cell-free MIMO ISAC systems.
- [21] arXiv:2501.10155 [pdf, html, other]
-
Title: A scalable event-driven spatiotemporal feature extraction circuitHugh Greatorex, Michele Mastella, Ole Richter, Madison Cotteret, Willian Soares Girão, Ella Janotte, Elisabetta ChiccaComments: 4 pages, 7 figuresSubjects: Signal Processing (eess.SP); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
Event-driven sensors, which produce data only when there is a change in the input signal, are increasingly used in applications that require low-latency and low-power real-time sensing, such as robotics and edge devices. To fully achieve the latency and power advantages on offer however, similarly event-driven data processing methods are required. A promising solution is the TDE: an event-based processing element which encodes the time difference between events on different channels into an output event stream. In this work we introduce a novel TDE implementation on CMOS. The circuit is robust to device mismatch and allows the linear integration of input events. This is crucial for enabling a high-density implementation of many TDEs on the same die, and for realising real-time parallel processing of the high-event-rate data produced by event-driven sensors.
- [22] arXiv:2501.10166 [pdf, html, other]
-
Title: Implementing Finite Impulse Response Filters on Quantum ComputersSubjects: Signal Processing (eess.SP); Information Theory (cs.IT); Quantum Physics (quant-ph)
While signal processing is a mature area, its connections with quantum computing have received less attention. In this work, we propose approaches that perform classical discrete-time signal processing using quantum systems. Our approaches encode the classical discrete-time input signal into quantum states, and design unitaries to realize classical concepts of finite impulse response (FIR) filters. We also develop strategies to cascade lower-order filters to realize higher-order filters through designing appropriate unitary operators. Finally, a few directions for processing quantum states on classical systems after converting them to classical signals are suggested for future work.
- [23] arXiv:2501.10196 [pdf, html, other]
-
Title: Pricing Mechanisms versus Non-Pricing Mechanisms for Demand Side Management in MicrogridsSubjects: Systems and Control (eess.SY)
In this paper, we compare pricing and non-pricing mechanisms for implementing demand-side management (DSM) mechanisms in a neighborhood in Helsinki, Finland. We compare load steering based on peak load-reduction using the profile steering method, and load steering based on market price signals, in terms of peak loads, losses, and device profiles. We found that there are significant differences between the two methods; the peak-load reduction control strategies contribute to reducing peak power and improving power flow stability, while strategies primarily based on prices result in higher peaks and increased grid losses. Our results highlight the need to potentially move away from market-price-based DSM to DSM incentivization and control strategies that are based on peak load reductions and other system requirements.
- [24] arXiv:2501.10219 [pdf, html, other]
-
Title: Robust Egoistic Rigid Body LocalizationSubjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
We consider a robust and self-reliant (or "egoistic") variation of the rigid body localization (RBL) problem, in which a primary rigid body seeks to estimate the pose (i.e., location and orientation) of another rigid body (or "target"), relative to its own, without the assistance of external infrastructure, without prior knowledge of the shape of the target, and taking into account the possibility that the available observations are incomplete. Three complementary contributions are then offered for such a scenario. The first is a method to estimate the translation vector between the center point of both rigid bodies, which unlike existing techniques does not require that both objects have the same shape or even the same number of landmark points. This technique is shown to significantly outperform the state-of-the-art (SotA) under complete information, but to be sensitive to data erasures, even when enhanced by matrix completion methods. The second contribution, designed to offer improved performance in the presence of incomplete information, offers a robust alternative to the latter, at the expense of a slight relative loss under complete information. Finally, the third contribution is a scheme for the estimation of the rotation matrix describing the relative orientation of the target rigid body with respect to the primary. Comparisons of the proposed schemes and SotA techniques demonstrate the advantage of the contributed methods in terms of root mean square error (RMSE) performance under fully complete information and incomplete conditions.
- [25] arXiv:2501.10227 [pdf, html, other]
-
Title: Joint Active and Passive Beamforming Optimization for Beyond Diagonal RIS-aided Multi-User CommunicationsSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
Benefiting from its capability to generalize existing reconfigurable intelligent surface (RIS) architectures and provide additional design flexibility via interactions between RIS elements, beyond-diagonal RIS (BD-RIS) has attracted considerable research interests recently. However, due to the symmetric and unitary passive beamforming constraint imposed on BD-RIS, existing joint active and passive beamforming optimization algorithms for BD-RIS either exhibit high computational complexity to achieve near optimal solutions or rely on heuristic algorithms with substantial performance loss. In this paper, we address this issue by proposing an efficient optimization framework for BD-RIS assisted multi-user multi-antenna communication networks. Specifically, we solve the weighted sum rate maximization problem by introducing a novel beamforming optimization algorithm that alternately optimizes active and passive beamforming matrices using iterative closed-form solutions. Numerical results demonstrate that our algorithm significantly reduces computational complexity while ensuring a sub-optimal solution.
- [26] arXiv:2501.10236 [pdf, html, other]
-
Title: Actively Coupled Sensor Configuration and Planning in Unknown Dynamic EnvironmentsComments: Draft submitted to the 2025 American Control ConferenceSubjects: Systems and Control (eess.SY)
We address the problem of path-planning for an autonomous mobile vehicle, called the ego vehicle, in an unknown andtime-varying environment. The objective is for the ego vehicle to minimize exposure to a spatiotemporally-varying unknown scalar field called the threat field. Noisy measurements of the threat field are provided by a network of mobile sensors. Weaddress the problem of optimally configuring (placing) these sensors in the environment. To this end, we propose sensor reconfiguration by maximizing a reward function composed of three different elements. First, the reward includes an informa tion measure that we call context-relevant mutual information (CRMI). Unlike typical sensor placement techniques that maxi mize mutual information of the measurements and environment state, CRMI directly quantifies uncertainty reduction in the ego path cost while it moves in the environment. Therefore, the CRMI introduces active coupling between the ego vehicle and the sensor network. Second, the reward includes a penalty on the distances traveled by the sensors. Third, the reward includes a measure of proximity of the sensors to the ego vehicle. Although we do not consider communication issues in this paper, such proximity is of relevance for future work that addresses communications between the sensors and the ego vehicle. We illustrate and analyze the proposed technique via numerical simulations.
- [27] arXiv:2501.10256 [pdf, html, other]
-
Title: Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASRComments: Accepted at ICASSP 2025 Satellite Workshop: Workshop on Speech Pathology Analysis and DEtection (SPADE)Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
Automatic speech recognition (ASR) systems are well known to perform poorly on dysarthric speech. Previous works have addressed this by speaking rate modification to reduce the mismatch with typical speech. Unfortunately, these approaches rely on transcribed speech data to estimate speaking rates and phoneme durations, which might not be available for unseen speakers. Therefore, we combine unsupervised rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We evaluate the outputs with a large ASR model pre-trained on healthy speech without further fine-tuning and find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria. Code and audio samples are available at this https URL .
- [28] arXiv:2501.10292 [pdf, html, other]
-
Title: Enhancing AI Transparency: XRL-Based Resource Management and RAN Slicing for 6G ORAN ArchitectureSubjects: Signal Processing (eess.SP)
This research introduces an advanced Explainable Artificial Intelligence (XAI) framework designed to elucidate the decision-making processes of Deep Reinforcement Learning (DRL) agents in ORAN architectures. By offering network-oriented explanations, the proposed scheme addresses the critical challenge of understanding and optimizing the control actions of DRL agents for resource management and allocation. Traditional methods, both model-agnostic and model-specific approaches, fail to address the unique challenges presented by XAI in the dynamic and complex environment of RAN slicing. This paper transcends these limitations by incorporating intent-based action steering, allowing for precise embedding and configuration across various operational timescales. This is particularly evident in its integration with xAPP and rAPP sitting at near-real-time and non-real-time RIC, respectively, enhancing the system's adaptability and performance. Our findings demonstrate the framework's significant impact on improving Key Performance Indicator (KPI)-based rewards, facilitated by the ability to make informed multimodal decisions involving multiple control parameters by a DRL agent. Thus, our work marks a significant step forward in the practical application and effectiveness of XAI in optimizing ORAN resource management strategies.
- [29] arXiv:2501.10305 [pdf, html, other]
-
Title: On Ambisonic Source Separation with Spatially Informed Non-negative Tensor FactorizationJournal-ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3238-3255, 2024Subjects: Audio and Speech Processing (eess.AS)
This article presents a Non-negative Tensor Factorization based method for sound source separation from Ambisonic microphone signals. The proposed method enables the use of prior knowledge about the Directions-of-Arrival (DOAs) of the sources, incorporated through a constraint on the Spatial Covariance Matrix (SCM) within a Maximum a Posteriori (MAP) framework. Specifically, this article presents a detailed derivation of four algorithms that are based on two types of cost functions, namely the squared Euclidean distance and the Itakura-Saito divergence, which are then combined with two prior probability distributions on the SCM, that is the Wishart and the Inverse Wishart. The experimental evaluation of the baseline Maximum Likelihood (ML) and the proposed MAP methods is primarily based on first-order Ambisonic recordings, using four different source signal datasets, three with musical pieces and one containing speech utterances. We consider under-determined, determined, as well as over-determined scenarios by separating two, four and six sound sources, respectively. Furthermore, we evaluate the proposed algorithms for different spherical harmonic orders and at different reverberation time levels, as well as in non-ideal prior knowledge conditions, for increasingly more corrupted DOAs. Overall, in comparison with beamforming and a state-of-the-art separation technique, as well as the baseline ML methods, the proposed MAP approach offers superior separation performance in a variety of scenarios, as shown by the analysis of the experimental evaluation results, in terms of the standard objective separation measures, such as the SDR, ISR, SIR and SAR.
- [30] arXiv:2501.10337 [pdf, html, other]
-
Title: Uncertainty-Aware Digital Twins: Robust Model Predictive Control using Time-Series Deep Quantile LearningSubjects: Systems and Control (eess.SY)
Digital Twins, virtual replicas of physical systems that enable real-time monitoring, model updates, predictions, and decision-making, present novel avenues for proactive control strategies for autonomous systems. However, achieving real-time decision-making in Digital Twins considering uncertainty necessitates an efficient uncertainty quantification (UQ) approach and optimization driven by accurate predictions of system behaviors, which remains a challenge for learning-based methods. This paper presents a simultaneous multi-step robust model predictive control (MPC) framework that incorporates real-time decision-making with uncertainty awareness for Digital Twin systems. Leveraging a multistep ahead predictor named Time-Series Dense Encoder (TiDE) as the surrogate model, this framework differs from conventional MPC models that provide only one-step ahead predictions. In contrast, TiDE can predict future states within the prediction horizon in a one-shot, significantly accelerating MPC. Furthermore, quantile regression is employed with the training of TiDE to perform flexible while computationally efficient UQ on data uncertainty. Consequently, with the deep learning quantiles, the robust MPC problem is formulated into a deterministic optimization problem and provides a safety buffer that accommodates disturbances to enhance constraint satisfaction rate. As a result, the proposed method outperforms existing robust MPC methods by providing less-conservative UQ and has demonstrated efficacy in an engineering case study involving Directed Energy Deposition (DED) additive manufacturing. This proactive while uncertainty-aware control capability positions the proposed method as a potent tool for future Digital Twin applications and real-time process control in engineering systems.
New submissions (showing 30 of 30 entries)
- [31] arXiv:2501.09815 (cross-list from cs.CV) [pdf, html, other]
-
Title: Lossy Compression with Pretrained Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
We apply the DiffC algorithm (Theis et al. 2022) to Stable Diffusion 1.5, 2.1, XL, and Flux-dev, and demonstrate that these pretrained models are remarkably capable lossy image compressors. A principled algorithm for lossy compression using pretrained diffusion models has been understood since at least Ho et al. 2020, but challenges in reverse-channel coding have prevented such algorithms from ever being fully implemented. We introduce simple workarounds that lead to the first complete implementation of DiffC, which is capable of compressing and decompressing images using Stable Diffusion in under 10 seconds. Despite requiring no additional training, our method is competitive with other state-of-the-art generative compression methods at low ultra-low bitrates.
- [32] arXiv:2501.09838 (cross-list from cs.CV) [pdf, html, other]
-
Title: CrossModalityDiffusion: Multi-Modal Novel View Synthesis with Unified Intermediate RepresentationComments: Accepted in the 2025 WACV workshop GeoCVSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Geospatial imaging leverages data from diverse sensing modalities-such as EO, SAR, and LiDAR, ranging from ground-level drones to satellite views. These heterogeneous inputs offer significant opportunities for scene understanding but present challenges in interpreting geometry accurately, particularly in the absence of precise ground truth data. To address this, we propose CrossModalityDiffusion, a modular framework designed to generate images across different modalities and viewpoints without prior knowledge of scene geometry. CrossModalityDiffusion employs modality-specific encoders that take multiple input images and produce geometry-aware feature volumes that encode scene structure relative to their input camera positions. The space where the feature volumes are placed acts as a common ground for unifying input modalities. These feature volumes are overlapped and rendered into feature images from novel perspectives using volumetric rendering techniques. The rendered feature images are used as conditioning inputs for a modality-specific diffusion model, enabling the synthesis of novel images for the desired output modality. In this paper, we show that jointly training different modules ensures consistent geometric understanding across all modalities within the framework. We validate CrossModalityDiffusion's capabilities on the synthetic ShapeNet cars dataset, demonstrating its effectiveness in generating accurate and consistent novel views across multiple imaging modalities and perspectives.
- [33] arXiv:2501.09858 (cross-list from cs.LG) [pdf, html, other]
-
Title: From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model ExplanationComments: Accepted to Deployable AI (DAI) Workshop at the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Deep reinforcement learning (RL) has shown remarkable success in complex domains, however, the inherent black box nature of deep neural network policies raises significant challenges in understanding and trusting the decision-making processes. While existing explainable RL methods provide local insights, they fail to deliver a global understanding of the model, particularly in high-stakes applications. To overcome this limitation, we propose a novel model-agnostic approach that bridges the gap between explainability and interpretability by leveraging Shapley values to transform complex deep RL policies into transparent representations. The proposed approach offers two key contributions: a novel approach employing Shapley values to policy interpretation beyond local explanations and a general framework applicable to off-policy and on-policy algorithms. We evaluate our approach with three existing deep RL algorithms and validate its performance in two classic control environments. The results demonstrate that our approach not only preserves the original models' performance but also generates more stable interpretable policies.
- [34] arXiv:2501.09918 (cross-list from cs.AI) [pdf, html, other]
-
Title: GenSC-6G: A Prototype Testbed for Integrated Generative AI, Quantum, and Semantic CommunicationComments: SUBMITTED FOR PUBLICATION IN IEEE COMMUNICATIONS MAGAZINESubjects: Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Quantum Physics (quant-ph)
We introduce a prototyping testbed, GenSC-6G, developed to generate a comprehensive dataset that supports the integration of generative artificial intelligence (AI), quantum computing, and semantic communication for emerging sixth-generation (6G) applications. The GenSC-6G dataset is designed with noise-augmented synthetic data optimized for semantic decoding, classification, and localization tasks, significantly enhancing flexibility for diverse AI-driven communication applications. This adaptable prototype supports seamless modifications across baseline models, communication modules, and goal-oriented decoders. Case studies demonstrate its application in lightweight classification, semantic upsampling, and edge-based language inference under noise conditions. The GenSC-6G dataset serves as a scalable and robust resource for developing goal-oriented communication systems tailored to the growing demands of 6G networks.
- [35] arXiv:2501.09972 (cross-list from cs.SD) [pdf, html, other]
-
Title: GVMGen: A General Video-to-Music Generation Model with Hierarchical AttentionsComments: Accepted by the 39th AAAI Conference on Artificial Intelligence (AAAI-25)Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.
- [36] arXiv:2501.09994 (cross-list from cs.CV) [pdf, html, other]
-
Title: Multi-Modal Attention Networks for Enhanced Segmentation and Depth Estimation of Subsurface Defects in Pulse ThermographyComments: Pulse thermography, infrared thermography, defect segmentation, multi-modal networks, attention mechanismSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
AI-driven pulse thermography (PT) has become a crucial tool in non-destructive testing (NDT), enabling automatic detection of hidden anomalies in various industrial components. Current state-of-the-art techniques feed segmentation and depth estimation networks compressed PT sequences using either Principal Component Analysis (PCA) or Thermographic Signal Reconstruction (TSR). However, treating these two modalities independently constrains the performance of PT inspection models as these representations possess complementary semantic features. To address this limitation, this work proposes PT-Fusion, a multi-modal attention-based fusion network that fuses both PCA and TSR modalities for defect segmentation and depth estimation of subsurface defects in PT setups. PT-Fusion introduces novel feature fusion modules, Encoder Attention Fusion Gate (EAFG) and Attention Enhanced Decoding Block (AEDB), to fuse PCA and TSR features for enhanced segmentation and depth estimation of subsurface defects. In addition, a novel data augmentation technique is proposed based on random data sampling from thermographic sequences to alleviate the scarcity of PT datasets. The proposed method is benchmarked against state-of-the-art PT inspection models, including U-Net, attention U-Net, and 3D-CNN on the Université Laval IRT-PVC dataset. The results demonstrate that PT-Fusion outperforms the aforementioned models in defect segmentation and depth estimation accuracies with a margin of 10%.
- [37] arXiv:2501.09999 (cross-list from cs.CV) [pdf, html, other]
-
Title: Deep Learning for Early Alzheimer Disease Detection with MRI ScansSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Alzheimer's Disease is a neurodegenerative condition characterized by dementia and impairment in neurological function. The study primarily focuses on the individuals above age 40, affecting their memory, behavior, and cognitive processes of the brain. Alzheimer's disease requires diagnosis by a detailed assessment of MRI scans and neuropsychological tests of the patients. This project compares existing deep learning models in the pursuit of enhancing the accuracy and efficiency of AD diagnosis, specifically focusing on the Convolutional Neural Network, Bayesian Convolutional Neural Network, and the U-net model with the Open Access Series of Imaging Studies brain MRI dataset. Besides, to ensure robustness and reliability in the model evaluations, we address the challenge of imbalance in data. We then perform rigorous evaluation to determine strengths and weaknesses for each model by considering sensitivity, specificity, and computational efficiency. This comparative analysis would shed light on the future role of AI in revolutionizing AD diagnostics but also paved ways for future innovation in medical imaging and the management of neurodegenerative diseases.
- [38] arXiv:2501.10045 (cross-list from cs.SD) [pdf, html, other]
-
Title: HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-ResolutionComments: 5 pages, 5 figures, accepted by ICASSP 2025Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these representations into high-resolution waveforms. To enhance high-frequency fidelity, we incorporate a multi-band, multi-scale time-frequency discriminator, along with a multi-scale mel-reconstruction loss in the adversarial training process. HiFi-SR is versatile, capable of upscaling any input speech signal between 4 kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that HiFi-SR significantly outperforms existing speech SR methods across both objective metrics and ABX preference tests, for both in-domain and out-of-domain scenarios (this https URL).
- [39] arXiv:2501.10052 (cross-list from cs.SD) [pdf, html, other]
-
Title: Conditional Latent Diffusion-Based Speech Enhancement Via Dual Context LearningComments: 5 pages, 1 figure, accepted by ICASSP 2025Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on the generation process in high-dimensional waveform or spectral domains, leading to increased generation complexity and slower inference speeds. Additionally, these methods have primarily modelled clean speech distributions, with limited exploration of noise distributions, thereby constraining the discriminative capability of diffusion models for speech enhancement. To address these issues, we propose a novel approach that integrates a conditional latent diffusion model (cLDM) with dual-context learning (DCL). Our method utilizes a variational autoencoder (VAE) to compress mel-spectrograms into a low-dimensional latent space. We then apply cLDM to transform the latent representations of both clean speech and background noise into Gaussian noise by the DCL process, and a parameterized model is trained to reverse this process, conditioned on noisy latent representations and text embeddings. By operating in a lower-dimensional space, the latent representations reduce the complexity of the generation process, while the DCL process enhances the model's ability to handle diverse and unseen noise environments. Our experiments demonstrate the strong performance of the proposed approach compared to existing diffusion-based methods, even with fewer iterative steps, and highlight the superior generalization capability of our models to out-of-domain noise datasets (this https URL).
- [40] arXiv:2501.10093 (cross-list from cs.ET) [pdf, other]
-
Title: An Energy-Aware RIoT System: Analysis, Modeling and Prediction in the SUPERIOT FrameworkMohammud J. Bocus, Juha Hakkinen, Helder Fontes, Marcin Drzewiecki, Senhui Qiu, Kerstin Eder, Robert PiechockiComments: 14 pages, 13 figures, 11 tablesSubjects: Emerging Technologies (cs.ET); Hardware Architecture (cs.AR); Networking and Internet Architecture (cs.NI); Performance (cs.PF); Systems and Control (eess.SY)
This paper presents a comprehensive analysis of the energy consumption characteristics of a Silicon (Si)-based Reconfigurable IoT (RIoT) node developed in the initial phase of the SUPERIOT project, focusing on key operating states, including Bluetooth Low Energy (BLE) communication, Narrow-Band Visible Light Communication (NBVLC), sensing, and E-ink display. Extensive measurements were conducted to establish a detailed energy profile, which serves as a benchmark for evaluating the effectiveness of subsequent optimizations and future node iterations. To minimize the energy consumption, multiple optimizations were implemented at both the software and hardware levels, achieving a reduction of over 60% in total energy usage through software modifications alone. Further improvements were realized by optimizing the E-ink display driving waveform and implementing a very low-power mode for non-communication activities. Based on the measured data, three measurement-based energy consumption models were developed to characterize the energy behavior of the node under: (i) normal, unoptimized operation, (ii) low-power, software-optimized operation, and (iii) very low-power, hardware-optimized operation. These models, validated with new measurement data, achieved an accuracy exceeding 97%, confirming their reliability for predicting energy consumption in diverse configurations.
- [41] arXiv:2501.10111 (cross-list from cs.SD) [pdf, html, other]
-
Title: AI-Generated Music Detection and its ChallengesComments: Accepted for IEEE ICASSP 2025. arXiv admin note: substantial text overlap with arXiv:2405.04181Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. In particular, the ability to create credible minute-long synthetic music in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and artificial reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a AI-music detector, a tool that will help in the regulation of synthetic media. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that getting a good test score is not the end of the story. We expose and discuss several facets that could be problematic with such a deployed detector: robustness to audio manipulation, generalisation to unseen models. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of artificial content checkers.
- [42] arXiv:2501.10182 (cross-list from cs.CR) [pdf, html, other]
-
Title: Secure Semantic Communication With Homomorphic EncryptionRui Meng, Dayu Fan, Haixiao Gao, Yifan Yuan, Bizhu Wang, Xiaodong Xu, Mengying Sun, Chen Dong, Xiaofeng Tao, Ping Zhang, Dusit NiyatoComments: 8 pages, 3 figuresSubjects: Cryptography and Security (cs.CR); Signal Processing (eess.SP)
In recent years, Semantic Communication (SemCom), which aims to achieve efficient and reliable transmission of meaning between agents, has garnered significant attention from both academia and industry. To ensure the security of communication systems, encryption techniques are employed to safeguard confidentiality and integrity. However, traditional cryptography-based encryption algorithms encounter obstacles when applied to SemCom. Motivated by this, this paper explores the feasibility of applying homomorphic encryption to SemCom. Initially, we review the encryption algorithms utilized in mobile communication systems and analyze the challenges associated with their application to SemCom. Subsequently, we employ scale-invariant feature transform to demonstrate that semantic features can be preserved in homomorphic encrypted ciphertext. Based on this finding, we propose a task-oriented SemCom scheme secured through homomorphic encryption. We design the privacy preserved deep joint source-channel coding (JSCC) encoder and decoder, and the frequency of key updates can be adjusted according to service requirements without compromising transmission performance. Simulation results validate that, when compared to plaintext images, the proposed scheme can achieve almost the same classification accuracy performance when dealing with homomorphic ciphertext images. Furthermore, we provide potential future research directions for homomorphic encrypted SemCom.
- [43] arXiv:2501.10199 (cross-list from cs.CV) [pdf, html, other]
-
Title: Adaptive Clustering for Efficient Phenotype Segmentation of UAV Hyperspectral DataComments: accepted WACV 2025 GeoCV workshopSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Unmanned Aerial Vehicles (UAVs) combined with Hyperspectral imaging (HSI) offer potential for environmental and agricultural applications by capturing detailed spectral information that enables the prediction of invisible features like biochemical leaf properties. However, the data-intensive nature of HSI poses challenges for remote devices, which have limited computational resources and storage. This paper introduces an Online Hyperspectral Simple Linear Iterative Clustering algorithm (OHSLIC) framework for real-time tree phenotype segmentation. OHSLIC reduces inherent noise and computational demands through adaptive incremental clustering and a lightweight neural network, which phenotypes trees using leaf contents such as chlorophyll, carotenoids, and anthocyanins. A hyperspectral dataset is created using a custom simulator that incorporates realistic leaf parameters, and light interactions. Results demonstrate that OHSLIC achieves superior regression accuracy and segmentation performance compared to pixel- or window-based methods while significantly reducing inference time. The method`s adaptive clustering enables dynamic trade-offs between computational efficiency and accuracy, paving the way for scalable edge-device deployment in HSI applications.
- [44] arXiv:2501.10201 (cross-list from cs.ET) [pdf, html, other]
-
Title: ODMA-Based Cell-Free Unsourced Random Access with Successive Interference CancellationSubjects: Emerging Technologies (cs.ET); Information Theory (cs.IT); Systems and Control (eess.SY)
We consider the unsourced random access problem with multiple receivers and propose a cell-free type solution for that. In our proposed scheme, the active users transmit their signals to the access points (APs) distributed in a geographical area and connected to a central processing unit (CPU). The transmitted signals are composed of a pilot and polar codeword, where the polar codeword bits occupy a small fraction of the data part of the transmission frame. The receiver operations of pilot detection and channel and symbol estimation take place at the APs, while the actual message bits are detected at the CPU by combining the symbol estimates from the APs forwarded over the fronthaul. The effect of the successfully decoded messages is then subtracted at the APs. Numerical examples illustrate that the proposed scheme can support up to 1400 users with a high energy efficiency, and the distributed structure decreases the error probability by more than two orders of magnitude.
- [45] arXiv:2501.10222 (cross-list from cs.SD) [pdf, html, other]
-
Title: Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music ScoresComments: Accepted by ICASSP 2025Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
This paper presents an integrated system that transforms symbolic music scores into expressive piano performance audio. By combining a Transformer-based Expressive Performance Rendering (EPR) model with a fine-tuned neural MIDI synthesiser, our approach directly generates expressive audio performances from score inputs. To the best of our knowledge, this is the first system to offer a streamlined method for converting score MIDI files lacking expression control into rich, expressive piano performances. We conducted experiments using subsets of the ATEPP dataset, evaluating the system with both objective metrics and subjective listening tests. Our system not only accurately reconstructs human-like expressiveness, but also captures the acoustic ambience of environments such as concert halls and recording studios. Additionally, the proposed system demonstrates its ability to achieve musical expressiveness while ensuring good audio quality in its outputs.
- [46] arXiv:2501.10262 (cross-list from cs.RO) [pdf, html, other]
-
Title: Deployment of an Aerial Multi-agent System for Automated Task Execution in Large-scale Underground Mining EnvironmentsNiklas Dahlquist, Samuel Nordström, Nikolaos Stathoulopoulos, Björn Lindqvist, Akshit Saradagi, George NikolakopoulosComments: Submitted to IEEE Transactions on Field RoboticsSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
In this article, we present a framework for deploying an aerial multi-agent system in large-scale subterranean environments with minimal infrastructure for supporting multi-agent operations. The multi-agent objective is to optimally and reactively allocate and execute inspection tasks in a mine, which are entered by a mine operator on-the-fly. The assignment of currently available tasks to the team of agents is accomplished through an auction-based system, where the agents bid for available tasks, which are used by a central auctioneer to optimally assigns tasks to agents. A mobile Wi-Fi mesh supports inter-agent communication and bi-directional communication between the agents and the task allocator, while the task execution is performed completely infrastructure-free. Given a task to be accomplished, a reliable and modular agent behavior is synthesized by generating behavior trees from a pool of agent capabilities, using a back-chaining approach. The auction system in the proposed framework is reactive and supports addition of new operator-specified tasks on-the-go, at any point through a user-friendly operator interface. The framework has been validated in a real underground mining environment using three aerial agents, with several inspection locations spread in an environment of almost 200 meters. The proposed framework can be utilized for missions involving rapid inspection, gas detection, distributed sensing and mapping etc. in a subterranean environment. The proposed framework and its field deployment contributes towards furthering reliable automation in large-scale subterranean environments to offload both routine and dangerous tasks from human operators to autonomous aerial robots.
Cross submissions (showing 16 of 16 entries)
- [47] arXiv:2202.02300 (replaced) [pdf, html, other]
-
Title: From Semi-Infinite Constraints to Structured Robust Policies: Optimal Gain Selection for Financial SystemsComments: Submitted for possible publicationSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC); Computational Finance (q-fin.CP); Mathematical Finance (q-fin.MF)
This paper studies the robust optimal gain selection problem for financial trading systems, formulated within a \emph{double linear policy} framework, which allocates capital across long and short positions. The key objective is to guarantee \emph{robust positive expected} (RPE) profits uniformly across a range of uncertain market conditions while ensuring risk control. This problem leads to a robust optimization formulation with \emph{semi-infinite} constraints, where the uncertainty is modeled by a bounded set of possible return parameters. We address this by transforming semi-infinite constraints into structured policies -- the \emph{balanced} policy and the \emph{complementary} policy -- which enable explicit characterization of the optimal solution. Additionally, we propose a novel graphical approach to efficiently solve the robust gain selection problem, drastically reducing computational complexity. Empirical validation on historical stock price data demonstrates superior performance in terms of risk-adjusted returns and downside risk compared to conventional strategies. This framework generalizes classical mean-variance optimization by incorporating robustness considerations, offering a systematic and efficient solution for robust trading under uncertainty.
- [48] arXiv:2203.03415 (replaced) [pdf, html, other]
-
Title: Keep It Accurate and Robust: An Enhanced Nuclei Analysis FrameworkSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Accurate segmentation and classification of nuclei in histology images is critical but challenging due to nuclei heterogeneity, staining variations, and tissue complexity. Existing methods often struggle with limited dataset variability, with patches extracted from similar whole slide images (WSI), making models prone to falling into local optima. Here we propose a new framework to address this limitation and enable robust nuclear analysis. Our method leverages dual-level ensemble modeling to overcome issues stemming from limited dataset variation. Intra-ensembling applies diverse transformations to individual samples, while inter-ensembling combines networks of different scales. We also introduce enhancements to the HoVer-Net architecture, including updated encoders, nested dense decoding and model regularization strategy. We achieve state-of-the-art results on public benchmarks, including 1st place for nuclear composition prediction and 3rd place for segmentation/classification in the 2022 Colon Nuclei Identification and Counting (CoNIC) Challenge. This success validates our approach for accurate histological nuclei analysis. Extensive experiments and ablation studies provide insights into optimal network design choices and training techniques. In conclusion, this work proposes an improved framework advancing the state-of-the-art in nuclei analysis. We release our code and models (this https URL) to serve as a toolkit for the community.
- [49] arXiv:2309.00494 (replaced) [pdf, html, other]
-
Title: Multi-stage Deep Learning Artifact Reduction for Pallel-beam Computed TomographySubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Computed Tomography (CT) using synchrotron radiation is a powerful technique that, compared to lab-CT techniques, boosts high spatial and temporal resolution while also providing access to a range of contrast-formation mechanisms. The acquired projection data is typically processed by a computational pipeline composed of multiple stages. Artifacts introduced during data acquisition can propagate through the pipeline, and degrade image quality in the reconstructed images. Recently, deep learning has shown significant promise in enhancing image quality for images representing scientific data. This success has driven increasing adoption of deep learning techniques in CT imaging. Various approaches have been proposed to incorporate deep learning into computational pipelines, but each has limitations in addressing artifacts effectively and efficiently in synchrotron CT, either in properly addressing the specific artifacts, or in computational efficiency.
Recognizing these challenges, we introduce a novel method that incorporates separate deep learning models at each stage of the tomography pipeline-projection, sinogram, and reconstruction-to address specific artifacts locally in a data-driven way. Our approach includes bypass connections that feed both the outputs from previous stages and raw data to subsequent stages, minimizing the risk of error propagation. Extensive evaluations on both simulated and real-world datasets illustrate that our approach effectively reduces artifacts and outperforms comparison methods. - [50] arXiv:2311.03911 (replaced) [pdf, other]
-
Title: Distributed Parameter Estimation with Gaussian Observation Noises in Time-varying DigraphsSubjects: Signal Processing (eess.SP)
In this paper, we consider the problem of distributed parameter estimation in sensor networks. Each sensor makes successive observations of an unknown $d$-dimensional parameter, which might be subject to Gaussian random noises. The sensors aim to infer the true value of the unknown parameter by cooperating with each other. To this end, we first generalize the so-called dynamic regressor extension and mixing (DREM) algorithm to stochastic systems, with which the problem of estimating a $d$-dimensional vector parameter is transformed to that of $d$ scalar ones: one for each of the unknown parameters. For each of the scalar problem, both combine-then-adapt (CTA) and adapt-then-combine (ATC) diffusion-based estimation algorithms are given, where each sensor performs a combination step to fuse the local estimates in its in-neighborhood, alongside an adaptation step to process its streaming observations. Under weak conditions on network topology and excitation of regressors, we show that the proposed estimators guarantee that each sensor infers the true parameter, even if any individual of them cannot by itself. Specifically, it is required that the union of topologies over an interval with fixed length is strongly connected. Moreover, the sensors must collectively satisfy a cooperative persistent excitation (PE) condition, which relaxes the traditional PE condition. Numerical examples are finally provided to illustrate the established results.
- [51] arXiv:2311.11668 (replaced) [pdf, html, other]
-
Title: AIaaS for ORAN-based 6G Networks: Multi-time Scale Slice Resource Management with DRLComments: Updated to reflect acceptance in IEEE ICC 2024: IEEE International Conference on Communications, Denver, CO, USA, 2024, pp. 5407-5412, doi: https://doi.org/10.1109/ICC51166.2024.10622601Journal-ref: ICC 2024 - IEEE International Conference on Communications, Denver, CO, USA, 2024, pp. 5407-5412Subjects: Signal Processing (eess.SP); Systems and Control (eess.SY)
This paper addresses how to handle slice resources for 6G networks at different time scales in an architecture based on an open radio access network (ORAN). The proposed solution includes artificial intelligence (AI) at the edge of the network and applies two control-level loops to obtain optimal performance compared to other techniques. The ORAN facilitates programmable network architectures to support such multi-time scale management using AI approaches. The proposed algorithms analyze the maximum utilization of resources from slice performance to take decisions at the inter-slice level. Inter-slice intelligent agents work at a non-real-time level to reconfigure resources within various slices. Further than meeting the slice requirements, the intra-slice objective must also include the minimization of maximum resource utilization. This enables smart utilization of the resources within each slice without affecting slice performance. Here, each xApp that is an intra-slice agent aims at meeting the optimal quality of service (QoS) of the users, but at the same time, some inter-slice objectives should be included to coordinate intra- and inter-slice agents. This is done without penalizing the main intra-slice objective. All intelligent agents use deep reinforcement learning (DRL) algorithms to meet their objectives. We have presented results for enhanced mobile broadband (eMBB), ultra-reliable low latency (URLLC), and massive machine type communication (mMTC) slice categories.
- [52] arXiv:2403.15780 (replaced) [pdf, html, other]
-
Title: A Fairness-Oriented Reinforcement Learning Approach for the Operation and Control of Shared Micromobility ServicesMatteo Cederle, Luca Vittorio Piron, Marina Ceccon, Federico Chiariotti, Alessandro Fabris, Marco Fabris, Gian Antonio SustoComments: 6 pages, 3 figures, accepted at the 2025 American Control Conference (ACC) on January 17th, 2025Subjects: Systems and Control (eess.SY); Computers and Society (cs.CY); Machine Learning (cs.LG)
As Machine Learning grows in popularity across various fields, equity has become a key focus for the AI community. However, fairness-oriented approaches are still underexplored in smart mobility. Addressing this gap, our study investigates the balance between performance optimization and algorithmic fairness in shared micromobility services providing a novel framework based on Reinforcement Learning. Exploiting Q-learning, the proposed methodology achieves equitable outcomes in terms of the Gini index across different areas characterized by their distance from central hubs. Through vehicle rebalancing, the provided scheme maximizes operator performance while ensuring fairness principles for users, reducing iniquity by up to 85% while only increasing costs by 30% (w.r.t. applying no equity adjustment). A case study with synthetic data validates our insights and highlights the importance of fairness in urban micromobility (source code: this https URL).
- [53] arXiv:2404.03703 (replaced) [pdf, other]
-
Title: Mitigating analytical variability in fMRI results with style transferSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
We propose a novel approach to improve the reproducibility of neuroimaging results by converting statistic maps across different functional MRI pipelines. We make the assumption that pipelines used to compute fMRI statistic maps can be considered as a style component and we propose to use different generative models, among which, Generative Adversarial Networks (GAN) and Diffusion Models (DM) to convert statistic maps across different pipelines. We explore the performance of multiple GAN frameworks, and design a new DM framework for unsupervised multi-domain styletransfer. We constrain the generation of 3D fMRI statistic maps using the latent space of an auxiliary classifier that distinguishes statistic maps from different pipelines and extend traditional sampling techniques used in DM to improve the transition performance. Our experiments demonstrate that our proposed methods aresuccessful: pipelines can indeed be transferred as a style component, providing animportant source of data augmentation for future medical studies.
- [54] arXiv:2406.14534 (replaced) [pdf, html, other]
-
Title: Epicardium Prompt-guided Real-time Cardiac Ultrasound Frame-to-volume RegistrationLong Lei, Jun Zhou, Jialun Pei, Baoliang Zhao, Yueming Jin, Yuen-Chun Jeremy Teoh, Jing Qin, Pheng-Ann HengComments: This paper has been accepted by MICCAI 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
A comprehensive guidance view for cardiac interventional surgery can be provided by the real-time fusion of the intraoperative 2D images and preoperative 3D volume based on the ultrasound frame-to-volume registration. However, cardiac ultrasound images are characterized by a low signal-to-noise ratio and small differences between adjacent frames, coupled with significant dimension variations between 2D frames and 3D volumes to be registered, resulting in real-time and accurate cardiac ultrasound frame-to-volume registration being a very challenging task. This paper introduces a lightweight end-to-end Cardiac Ultrasound frame-to-volume Registration network, termed CU-Reg. Specifically, the proposed model leverages epicardium prompt-guided anatomical clues to reinforce the interaction of 2D sparse and 3D dense features, followed by a voxel-wise local-global aggregation of enhanced features, thereby boosting the cross-dimensional matching effectiveness of low-quality ultrasound modalities. We further embed an inter-frame discriminative regularization term within the hybrid supervised learning to increase the distinction between adjacent slices in the same ultrasound volume to ensure registration stability. Experimental results on the reprocessed CAMUS dataset demonstrate that our CU-Reg surpasses existing methods in terms of registration accuracy and efficiency, meeting the guidance requirements of clinical cardiac interventional surgery.
- [55] arXiv:2407.09232 (replaced) [pdf, html, other]
-
Title: Belief Propagation-based Rotation and Translation Estimation for Rigid Body LocalizationSubjects: Signal Processing (eess.SP)
We propose a novel solution to the rigid body localization (RBL) problem, in which the three-dimensional (3D) rotation and translation is estimated by only utilizing the range measurements between the wireless sensors on the rigid body and the anchor sensors. The proposed framework first constructs a linear Gaussian belief propagation (GaBP) algorithm to estimate the absolute sensor positions utilizing the range-based received signal model, which is used for the reconstruction of the RBL transformation model, linearized with a small-angle approximation. In light of the reformulated system, a second bivariate GaBP is designed to directly estimate the 3D rotation angles and translation distances, with an interference cancellation (IC) refinement to improve the angle estimation performance. The effectiveness of the proposed method is verified via numerical simulations, highlighting the superior performance of the proposed method against the state-of-the-art (SotA) techniques for the position, rotation, and translation estimation performance.
- [56] arXiv:2408.08881 (replaced) [pdf, html, other]
-
Title: Challenge Summary U-MedSAM: Uncertainty-aware MedSAM for Medical Image SegmentationComments: arXiv admin note: text overlap with arXiv:2405.17496Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Medical Image Foundation Models have proven to be powerful tools for mask prediction across various datasets. However, accurately assessing the uncertainty of their predictions remains a significant challenge. To address this, we propose a new model, U-MedSAM, which integrates the MedSAM model with an uncertainty-aware loss function and the Sharpness-Aware Minimization (SharpMin) optimizer. The uncertainty-aware loss function automatically combines region-based, distribution-based, and pixel-based loss designs to enhance segmentation accuracy and robustness. SharpMin improves generalization by finding flat minima in the loss landscape, thereby reducing overfitting. Our method was evaluated in the CVPR24 MedSAM on Laptop challenge, where U-MedSAM demonstrated promising performance.
- [57] arXiv:2409.08850 (replaced) [pdf, html, other]
-
Title: DX2CT: Diffusion Model for 3D CT Reconstruction from Bi or Mono-planar 2D X-ray(s)Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Computational tomography (CT) provides high-resolution medical imaging, but it can expose patients to high radiation. X-ray scanners have low radiation exposure, but their resolutions are low. This paper proposes a new conditional diffusion model, DX2CT, that reconstructs three-dimensional (3D) CT volumes from bi or mono-planar X-ray image(s). Proposed DX2CT consists of two key components: 1) modulating feature maps extracted from two-dimensional (2D) X-ray(s) with 3D positions of CT volume using a new transformer and 2) effectively using the modulated 3D position-aware feature maps as conditions of DX2CT. In particular, the proposed transformer can provide conditions with rich information of a target CT slice to the conditional diffusion model, enabling high-quality CT reconstruction. Our experiments with the bi or mono-planar X-ray(s) benchmark datasets show that proposed DX2CT outperforms several state-of-the-art methods. Our codes and model will be available at: this https URL.
- [58] arXiv:2409.16302 (replaced) [pdf, html, other]
-
Title: How Redundant Is the Transformer Stack in Speech Representation Models?Comments: To appear at ICASSP 2025 (excluding appendix)Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training, achieving up to 40% reduction in transformer layers while maintaining over 95% of the model's predictive capacity. Furthermore, we employ a knowledge distillation method to substitute the entire transformer stack with mimicking layers, reducing the network size 95-98% and the inference time by up to 94%. This substantial decrease in computational load occurs without considerable performance loss, suggesting that the transformer stack is almost completely redundant for downstream applications of speech representation models.
- [59] arXiv:2411.18967 (replaced) [pdf, other]
-
Title: Deep Plug-and-Play HIO Approach for Phase RetrievalComments: 16 pages, 5 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
In the phase retrieval problem, the aim is the recovery of an unknown image from intensity-only measurements such as Fourier intensity. Although there are several solution approaches, solving this problem is challenging due to its nonlinear and ill-posed nature. Recently, learning-based approaches have emerged as powerful alternatives to the analytical methods for several inverse problems. In the context of phase retrieval, a novel plug-and-play approach that exploits learning-based prior and efficient update steps has been presented at the Computational Optical Sensing and Imaging topical meeting, with demonstrated state-of-the-art performance. The key idea was to incorporate learning-based prior to the Gerchberg-Saxton type algorithms through plug-and-play regularization. In this paper, we present the mathematical development of the method including the derivation of its analytical update steps based on half-quadratic splitting and comparatively evaluate its performance through extensive simulations on a large test dataset. The results show the effectiveness of the method in terms of both image quality, computational efficiency, and robustness to initialization and noise.
- [60] arXiv:2412.07240 (replaced) [pdf, html, other]
-
Title: Efficient Spectral Differentiation in Grid-Based Continuous State EstimationComments: Accepted to FUSION2024Subjects: Signal Processing (eess.SP)
This paper deals with the state estimation of stochastic models with continuous dynamics. The aim is to incorporate spectral differentiation methods into the solution to the Fokker-Planck equation in grid-based state estimation routine, while taking into account the specifics of the field, such as probability density function (PDF) features, moving grid, zero boundary conditions, etc. The spectral methods, in general, achieve very fast convergence rate of O(c^N )(O < c < 1) for analytical functions such as the probability density function, where N is the number of grid points. This is significantly better than the standard finite difference method (or midpoint rule used in discrete estimation) typically used in grid-based filter design with convergence rate O( 1 / N^2 ). As consequence, the proposed spectral method based filter provides better state estimation accuracy with lower number of grid points, and thus, with lower computational complexity.
- [61] arXiv:2412.13086 (replaced) [pdf, html, other]
-
Title: Higher-Order Sinusoidal Input Describing Functions for Open-Loop and Closed-Loop Reset Control with Application to Mechatronics SystemsSubjects: Systems and Control (eess.SY)
Reset control enhances the performance of high-precision mechatronics systems. This paper introduces a generalized reset feedback control structure that integrates a single reset-state reset controller, a shaping filter for tuning reset actions, and linear compensators arranged in series and parallel configurations with the reset controller. This structure offers greater tuning flexibility to optimize reset control performance. However, frequency-domain analysis for such systems remains underdeveloped. To address this gap, this study makes three key contributions: (1) developing Higher-Order Sinusoidal Input Describing Functions (HOSIDFs) for open-loop reset control systems; (2) deriving HOSIDFs for closed-loop reset control systems and establishing a connection with open-loop analysis; and (3) creating a MATLAB-based App to implement these methods, providing mechatronics engineers with a practical tool for reset control system design and analysis. The accuracy of the proposed methods is validated through simulations and experiments. Finally, the utility of the proposed methods is demonstrated through case studies that analyze and compare the performance of three controllers: a PID controller, a reset controller, and a shaped reset controller on a precision motion stage. Both analytical and experimental results demonstrate that the shaped reset controller provides higher tracking precision while reducing actuation forces, outperforming both the reset and PID controllers. These findings highlight the effectiveness of the proposed frequency-domain methods in analyzing and optimizing the performance of reset-controlled mechatronics systems.
- [62] arXiv:2412.17129 (replaced) [pdf, html, other]
-
Title: Uncovering the Visual Contribution in Audio-Visual Speech RecognitionComments: 5 pages, 2 figures. Accepted to ICASSP 2025Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Audio-Visual Speech Recognition (AVSR) combines auditory and visual speech cues to enhance the accuracy and robustness of speech recognition systems. Recent advancements in AVSR have improved performance in noisy environments compared to audio-only counterparts. However, the true extent of the visual contribution, and whether AVSR systems fully exploit the available cues in the visual domain, remains unclear. This paper assesses AVSR systems from a different perspective, by considering human speech perception. We use three systems: Auto-AVSR, AVEC and AV-RelScore. We first quantify the visual contribution using effective SNR gains at 0 dB and then investigate the use of visual information in terms of its temporal distribution and word-level informativeness. We show that low WER does not guarantee high SNR gains. Our results suggest that current methods do not fully exploit visual information, and we recommend future research to report effective SNR gains alongside WERs.
- [63] arXiv:2501.00472 (replaced) [pdf, html, other]
-
Title: Jointly optimal array geometries and waveforms in active sensing: New insights into array design via the Cram\'er-Rao boundComments: ©2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksSubjects: Signal Processing (eess.SP)
This paper investigates jointly optimal array geometry and waveform designs for active sensing. Specifically, we focus on minimizing the Cramér-Rao lower bound (CRB) of the angle of a single target in white Gaussian noise. We first find that several array-waveform pairs can yield the same CRB by virtue of sequences with equal sums of squares, i.e., solutions to certain Diophantine equations. Furthermore, we show that under physical aperture and sensor number constraints, the CRB-minimizing receive array geometry is unique, whereas the transmit array can be chosen flexibly. We leverage this freedom to design a novel sparse array geometry that not only minimizes the single-target CRB given an optimal waveform, but also has a nonredundant and contiguous sum co-array, a desirable property when launching independent waveforms, with relevance also to the multi-target case.
- [64] arXiv:2501.01170 (replaced) [pdf, other]
-
Title: Automated monitoring of bee colony movement in the hive during winter seasonComments: Paper Accepted at BAIT 2024 CEUR-WS, see this https URLJournal-ref: Proceedings of the 1st International Workshop on Bioinformatics and Applied Information Technologies (BAIT 2024), Zboriv, Ukraine, October 02-04, 2024Subjects: Systems and Control (eess.SY); Networking and Internet Architecture (cs.NI)
In this study, we have experimentally modelled the movement of a bee colony in a hive during the winter season and developed a monitoring system that allows tracking the movement of the bee colony and honey consumption. The monitoring system consists of four load cells connected to the RP2040 controller based on the Raspberry Pi Pico board, from which data is transmitted via the MQTT protocol to the Raspberry Pi 5 microcomputer via a Wi-Fi network. The processed data from the Raspberry Pi 5 is recorded in a MySQL database. The algorithm for finding the location of the bee colony in the hive works correctly, the trajectory of movement based on the data from the sensors repeats the physical movement in the experiment, which is an imitation of the movement of the bee colony in real conditions. The proposed monitoring system provides continuous observation of the bee colony without adversely affecting its natural activities and can be integrated with various wireless data networks. This is a promising tool for improving the efficiency of beekeeping and maintaining the health of bee colonies.
- [65] arXiv:2501.07942 (replaced) [pdf, html, other]
-
Title: Tensor Train Discrete Grid-Based Filters: Breaking the Curse of DimensionalityComments: This work has been accepted for IFAC SYSID24Subjects: Signal Processing (eess.SP)
This paper deals with the state estimation of stochastic systems and examines the possible employment of tensor decompositions in grid-based filtering routines, in particular, the tensor-train decomposition. The aim is to show that these techniques can lead to a massive reduction in both the computational and storage complexity of grid-based filtering algorithms without considerable tradeoffs in accuracy. This claim is supported by an algorithm descriptions and numerical illustrations.
- [66] arXiv:2501.09054 (replaced) [pdf, html, other]
-
Title: NeurOp-Diff:Continuous Remote Sensing Image Super-Resolution via Neural Operator DiffusionSubjects: Image and Video Processing (eess.IV); Graphics (cs.GR)
Most publicly accessible remote sensing data suffer from low resolution, limiting their practical applications. To address this, we propose a diffusion model guided by neural operators for continuous remote sensing image super-resolution (NeurOp-Diff). Neural operators are used to learn resolution representations at arbitrary scales, encoding low-resolution (LR) images into high-dimensional features, which are then used as prior conditions to guide the diffusion model for denoising. This effectively addresses the artifacts and excessive smoothing issues present in existing super-resolution (SR) methods, enabling the generation of high-quality, continuous super-resolution images. Specifically, we adjust the super-resolution scale by a scaling factor s, allowing the model to adapt to different super-resolution magnifications. Furthermore, experiments on multiple datasets demonstrate the effectiveness of NeurOp-Diff. Our code is available at this https URL.
- [67] arXiv:2501.09113 (replaced) [pdf, other]
-
Title: persoDA: Personalized Data Augmentation for Personalized ASRPablo Peso Parada, Spyros Fontalis, Md Asif Jalal, Karthikeyan Saravanan, Anastasios Drosou, Mete Ozay, Gil Ho Lee, Jungin Lee, Seokyeong JungComments: ICASSP'25-Copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Data augmentation (DA) is ubiquitously used in training of Automatic Speech Recognition (ASR) models. DA offers increased data variability, robustness and generalization against different acoustic distortions. Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user's data utilized to personalize ASR. persoDA aims to augment training with data specifically tuned towards acoustic characteristics of the end-user, as opposed to standard augmentation based on Multi-Condition Training (MCT) that applies random reverberation and noises. Our evaluation with an ASR conformer-based baseline trained on Librispeech and personalized for VOICES shows that persoDA achieves a 13.9% relative WER reduction over using standard data augmentation (using random noise & reverberation). Furthermore, persoDA shows 16% to 20% faster convergence over MCT.
- [68] arXiv:2211.15652 (replaced) [pdf, html, other]
-
Title: Stochastic Optimal Control via Local Occupation MeasuresComments: 22 pages, 4 figures, associated implementation: this https URLSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
Viewing stochastic processes through the lens of occupation measures has proved to be a powerful angle of attack for the theoretical and computational analysis of stochastic optimal control problems. We present a simple modification of the traditional occupation measure framework derived from resolving the occupation measures locally on a partition of the control problem's space-time domain. This notion of local occupation measures provides fine-grained control over the construction of structured semidefinite programming relaxations for a rich class of stochastic optimal control problems with embedded diffusion and jump processes via the moment-sum-of-squares hierarchy. As such, it bridges the gap between discretization-based approximations to the Hamilton-Jacobi-Bellmann equations and occupation measure relaxations. We demonstrate with examples that this approach enables the computation of high quality bounds for the optimal value of a large class of stochastic optimal control problems with significant performance gains relative to the traditional occupation measure framework.
- [69] arXiv:2302.04344 (replaced) [pdf, html, other]
-
Title: Learning Dynamical Systems by Leveraging Data from Similar SystemsComments: 15 pages,9 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
We consider the problem of learning the dynamics of a linear system when one has access to data generated by an auxiliary system that shares similar (but not identical) dynamics, in addition to data from the true system. We use a weighted least squares approach, and provide finite sample error bounds of the learned model as a function of the number of samples and various system parameters from the two systems as well as the weight assigned to the auxiliary data. We show that the auxiliary data can help to reduce the intrinsic system identification error due to noise, at the price of adding a portion of error that is due to the differences between the two system models. We further provide a data-dependent bound that is computable when some prior knowledge about the systems, such as upper bounds on noise levels and model difference, is available. This bound can also be used to determine the weight that should be assigned to the auxiliary data during the model training stage.
- [70] arXiv:2305.15595 (replaced) [pdf, html, other]
-
Title: Time-Varying Convex Optimization: A Contraction and Equilibrium Tracking ApproachSubjects: Optimization and Control (math.OC); Signal Processing (eess.SP); Systems and Control (eess.SY)
In this article, we provide a novel and broadly-applicable contraction-theoretic approach to continuous-time time-varying convex optimization. For any parameter-dependent contracting dynamics, we show that the tracking error is asymptotically proportional to the rate of change of the parameter and that the proportionality constant is upper bounded by Lipschitz constant in which the parameter appears divided by the contraction rate of the dynamics squared. We additionally establish that augmenting any parameter-dependent contracting dynamics with a feedforward prediction term ensures that the tracking error vanishes exponentially quickly. To apply these results to time-varying convex optimization, we establish the strong infinitesimal contractivity of dynamics solving three canonical problems: monotone inclusions, linear equality-constrained problems, and composite minimization problems. For each case, we derive the sharpest-known contraction rates and provide explicit bounds on the tracking error between solution trajectories and minimizing trajectories. We validate our theoretical results on two numerical examples and on an application to control barrier function-based controller design that involves real hardware.
- [71] arXiv:2308.04162 (replaced) [pdf, html, other]
-
Title: Expression Prompt Collaboration Transformer for Universal Referring Video Object SegmentationComments: Accepted to Knowledge-Based Systems (KBS). The source code will be made publicly available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object Segmentation (R-VOS) are two highly related tasks that both aim to segment specific objects from video sequences according to expression prompts. However, due to the challenges of modeling representations for different modalities, existing methods struggle to strike a balance between interaction flexibility and localization precision. In this paper, we address this problem from two perspectives: the alignment of audio and text and the deep interaction among audio, text, and visual modalities. First, we propose a universal architecture, the Expression Prompt Collaboration Transformer, herein EPCFormer. Next, we propose an Expression Alignment (EA) mechanism for audio and text. The proposed EPCFormer exploits the fact that audio and text prompts referring to the same objects are semantically equivalent by using contrastive learning for both types of expressions. Then, to facilitate deep interactions among audio, text, and visual modalities, we introduce an Expression-Visual Attention (EVA) module. The knowledge of video object segmentation in terms of the expression prompts can seamlessly transfer between the two tasks by deeply exploring complementary cues between text and audio. Experiments on well-recognized benchmarks demonstrate that our EPCFormer attains state-of-the-art results on both tasks. The source code will be made publicly available at this https URL.
- [72] arXiv:2404.01604 (replaced) [pdf, html, other]
-
Title: WaveDH: Wavelet Sub-bands Guided ConvNet for Efficient Image DehazingComments: Under ReviewSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
The surge in interest regarding image dehazing has led to notable advancements in deep learning-based single image dehazing approaches, exhibiting impressive performance in recent studies. Despite these strides, many existing methods fall short in meeting the efficiency demands of practical applications. In this paper, we introduce WaveDH, a novel and compact ConvNet designed to address this efficiency gap in image dehazing. Our WaveDH leverages wavelet sub-bands for guided up-and-downsampling and frequency-aware feature refinement. The key idea lies in utilizing wavelet decomposition to extract low-and-high frequency components from feature levels, allowing for faster processing while upholding high-quality reconstruction. The downsampling block employs a novel squeeze-and-attention scheme to optimize the feature downsampling process in a structurally compact manner through wavelet domain learning, preserving discriminative features while discarding noise components. In our upsampling block, we introduce a dual-upsample and fusion mechanism to enhance high-frequency component awareness, aiding in the reconstruction of high-frequency details. Departing from conventional dehazing methods that treat low-and-high frequency components equally, our feature refinement block strategically processes features with a frequency-aware approach. By employing a coarse-to-fine methodology, it not only refines the details at frequency levels but also significantly optimizes computational costs. The refinement is performed in a maximum 8x downsampled feature space, striking a favorable efficiency-vs-accuracy trade-off. Extensive experiments demonstrate that our method, WaveDH, outperforms many state-of-the-art methods on several image dehazing benchmarks with significantly reduced computational costs. Our code is available at this https URL.
- [73] arXiv:2405.18251 (replaced) [pdf, html, other]
-
Title: Sensor-Based Distributionally Robust Control for Safe Robot Navigation in Dynamic EnvironmentsComments: Project page: this https URLSubjects: Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC)
We introduce a novel method for mobile robot navigation in dynamic, unknown environments, leveraging onboard sensing and distributionally robust optimization to impose probabilistic safety constraints. Our method introduces a distributionally robust control barrier function (DR-CBF) that directly integrates noisy sensor measurements and state estimates to define safety constraints. This approach is applicable to a wide range of control-affine dynamics, generalizable to robots with complex geometries, and capable of operating at real-time control frequencies. Coupled with a control Lyapunov function (CLF) for path following, the proposed CLF-DR-CBF control synthesis method achieves safe, robust, and efficient navigation in challenging environments. We demonstrate the effectiveness and robustness of our approach for safe autonomous navigation under uncertainty in simulations and real-world experiments with differential-drive robots.
- [74] arXiv:2406.03814 (replaced) [pdf, html, other]
-
Title: Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual DatastoresComments: Accepted by ICASSP 2025Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
The kNN-CTC model has proven to be effective for monolingual automatic speech recognition (ASR). However, its direct application to multilingual scenarios like code-switching, presents challenges. Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. Our method selects the appropriate datastore for decoding each frame, ensuring the injection of language-specific information into the ASR process. We apply this framework to cutting-edge CTC-based models, developing an advanced CS-ASR system. Extensive experiments demonstrate the remarkable effectiveness of our gated datastore mechanism in enhancing the performance of zero-shot Chinese-English CS-ASR.
- [75] arXiv:2407.15580 (replaced) [pdf, html, other]
-
Title: Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealingDavid Perera, Victor Letzelter, Théo Mariotte, Adrien Cortés, Mickael Chen, Slim Essid, Gaël RichardComments: NeurIPS 2024Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Probability (math.PR); Machine Learning (stat.ML)
We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.
- [76] arXiv:2409.00753 (replaced) [pdf, html, other]
-
Title: Generalized Multi-hop Traffic Pressure for Heterogeneous Traffic Perimeter ControlComments: 11 pages main body, 13 figures, journal paperSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Perimeter control (PC) prevents loss of traffic network capacity due to congestion in urban areas. Homogeneous PC allows all access points to a protected region to have identical permitted inflow. However, homogeneous PC performs poorly when the congestion in the protected region is heterogeneous (e.g., imbalanced demand) since the homogeneous PC does not consider specific traffic conditions around each perimeter intersection. When the protected region has spatially heterogeneous congestion, one needs to modulate the perimeter inflow rate to be higher near low-density regions and vice versa for high-density regions. A naïve approach is to leverage 1-hop traffic pressure to measure traffic condition around perimeter intersections, but such metric is too spatially myopic for PC. To address this issue, we formulate multi-hop downstream pressure grounded on Markov chain theory, which ``looks deeper'' into the protected region beyond perimeter intersections. In addition, we formulate a two-stage hierarchical control scheme that can leverage this novel multi-hop pressure to redistribute the total permitted inflow provided by a pre-trained deep reinforcement learning homogeneous control policy. Experimental results show that our heterogeneous PC approaches leveraging multi-hop pressure significantly outperform homogeneous PC in scenarios where the origin-destination flows are highly imbalanced with high spatial heterogeneity. Moveover, our approach is shown to be robust against turning ratio uncertainties by a sensitivity analysis.
- [77] arXiv:2409.10048 (replaced) [pdf, html, other]
-
Title: Audio-Driven Reinforcement Learning for Head-Orientation in Naturalistic EnvironmentsComments: Accepted at ICASSP 2025Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Although deep reinforcement learning (DRL) approaches in audio signal processing have seen substantial progress in recent years, audio-driven DRL for tasks such as navigation, gaze control and head-orientation control in the context of human-robot interaction have received little attention. Here, we propose an audio-driven DRL framework in which we utilise deep Q-learning to develop an autonomous agent that orients towards a talker in the acoustic environment based on stereo speech recordings. Our results show that the agent learned to perform the task at a near perfect level when trained on speech segments in anechoic environments (that is, without reverberation). The presence of reverberation in naturalistic acoustic environments affected the agent's performance, although the agent still substantially outperformed a baseline, randomly acting agent. Finally, we quantified the degree of generalization of the proposed DRL approach across naturalistic acoustic environments. Our experiments revealed that policies learned by agents trained on medium or high reverb environments generalized to low reverb environments, but policies learned by agents trained on anechoic or low reverb environments did not generalize to medium or high reverb environments. Taken together, this study demonstrates the potential of audio-driven DRL for tasks such as head-orientation control and highlights the need for training strategies that enable robust generalization across environments for real-world audio-driven DRL applications.
- [78] arXiv:2409.20539 (replaced) [pdf, html, other]
-
Title: Visual collective behaviors on spherical robotsComments: 26 pages, 16 figures, journal bioinspired and biomimeticsSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
The implementation of collective motion, traditionally, disregard the limited sensing capabilities of an individual, to instead assuming an omniscient perception of the environment. This study implements a visual flocking model in a ``robot-in-the-loop'' approach to reproduce these behaviors with a flock composed of 10 independent spherical robots. The model achieves robotic collective motion by only using panoramic visual information of each robot, such as retinal position, optical size and optic flow of the neighboring robots. We introduce a virtual anchor to confine the collective robotic movements so to avoid wall interactions. For the first time, a simple visual robot-in-the-loop approach succeed in reproducing several collective motion phases, in particular, swarming, and milling. Another milestone achieved with by this model is bridging the gap between simulation and physical experiments by demonstrating nearly identical behaviors in both environments with the same visual model. To conclude, we show that our minimal visual collective motion model is sufficient to recreate most collective behaviors on a robot-in-the-loop system that is scalable, behaves as numerical simulations predict and is easily comparable to traditional models.
- [79] arXiv:2410.02895 (replaced) [pdf, html, other]
-
Title: Near Optimal Approximations and Finite Memory Policies for POMPDs with Continuous SpacesSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
We study an approximation method for partially observed Markov decision processes (POMDPs) with continuous spaces. Belief MDP reduction, which has been the standard approach to study POMDPs requires rigorous approximation methods for practical applications, due to the state space being lifted to the space of probability measures. Generalizing recent work, in this paper we present rigorous approximation methods via discretizing the observation space and constructing a fully observed finite MDP model using a finite length history of the discrete observations and control actions. We show that the resulting policy is near-optimal under some regularity assumptions on the channel, and under certain controlled filter stability requirements for the hidden state process. Furthermore, by quantizing the measurements, we are able to utilize refined filter stability conditions. We also provide a Q learning algorithm that uses a finite memory of discretized information variables, and prove its convergence to the optimality equation of the finite fully observed MDP constructed using the approximation method.
- [80] arXiv:2411.07271 (replaced) [pdf, html, other]
-
Title: Multi-hop Upstream Anticipatory Traffic Signal Control with Deep Reinforcement LearningComments: 5 tables, 11 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Probability (math.PR)
Coordination in traffic signal control is crucial for managing congestion in urban networks. Existing pressure-based control methods focus only on immediate upstream links, leading to suboptimal green time allocation and increased network delays. However, effective signal control inherently requires coordination across a broader spatial scope, as the effect of upstream traffic should influence signal control decisions at downstream intersections, impacting a large area in the traffic network. Although agent communication using neural network-based feature extraction can implicitly enhance spatial awareness, it significantly increases the learning complexity, adding an additional layer of difficulty to the challenging task of control in deep reinforcement learning. To address the issue of learning complexity and myopic traffic pressure definition, our work introduces a novel concept based on Markov chain theory, namely \textit{multi-hop upstream pressure}, which generalizes the conventional pressure to account for traffic conditions beyond the immediate upstream links. This farsighted and compact metric informs the deep reinforcement learning agent to preemptively clear the multi-hop upstream queues, guiding the agent to optimize signal timings with a broader spatial awareness. Simulations on synthetic and realistic (Toronto) scenarios demonstrate controllers utilizing multi-hop upstream pressure significantly reduce overall network delay by prioritizing traffic movements based on a broader understanding of upstream congestion.
- [81] arXiv:2411.14593 (replaced) [pdf, html, other]
-
Title: A Systematic Study of Multi-Agent Deep Reinforcement Learning for Safe and Robust Autonomous Highway Ramp EntryComments: 9 pages, 9 figures; added support ackSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Vehicles today can drive themselves on highways and driverless robotaxis operate in major cities, with more sophisticated levels of autonomous driving expected to be available and become more common in the future. Yet, technically speaking, so-called "Level 5" (L5) operation, corresponding to full autonomy, has not been achieved. For that to happen, functions such as fully autonomous highway ramp entry must be available, and provide provably safe, and reliably robust behavior to enable full autonomy. We present a systematic study of a highway ramp function that controls the vehicles forward-moving actions to minimize collisions with the stream of highway traffic into which a merging (ego) vehicle enters. We take a game-theoretic multi-agent (MA) approach to this problem and study the use of controllers based on deep reinforcement learning (DRL). The virtual environment of the MA DRL uses self-play with simulated data where merging vehicles safely learn to control longitudinal position during a taper-type merge. The work presented in this paper extends existing work by studying the interaction of more than two vehicles (agents) and does so by systematically expanding the road scene with additional traffic and ego vehicles. While previous work on the two-vehicle setting established that collision-free controllers are theoretically impossible in fully decentralized, non-coordinated environments, we empirically show that controllers learned using our approach are nearly ideal when measured against idealized optimal controllers.
- [82] arXiv:2412.18836 (replaced) [pdf, html, other]
-
Title: MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRIComments: Accepted at IEEE ICASSP 2025Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Previous real-time MRI (rtMRI)-based speech synthesis models depend heavily on noisy ground-truth speech. Applying loss directly over ground truth mel-spectrograms entangles speech content with MRI noise, resulting in poor intelligibility. We introduce a novel approach that adapts the multi-modal self-supervised AV-HuBERT model for text prediction from rtMRI and incorporates a new flow-based duration predictor for speaker-specific alignment. The predicted text and durations are then used by a speech decoder to synthesize aligned speech in any novel voice. We conduct thorough experiments on two datasets and demonstrate our method's generalization ability to unseen speakers. We assess our framework's performance by masking parts of the rtMRI video to evaluate the impact of different articulators on text prediction. Our method achieves a $15.18\%$ Word Error Rate (WER) on the USC-TIMIT MRI corpus, marking a huge improvement over the current state-of-the-art. Speech samples are available at this https URL
- [83] arXiv:2501.05050 (replaced) [pdf, html, other]
-
Title: Music Tagging with Classifier Group ChainsComments: Accepted to ICASSP 2025, 5 pages, 2 figuresSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
We propose music tagging with classifier chains that model the interplay of music tags. Most conventional methods estimate multiple tags independently by treating them as multiple independent binary classification problems. This treatment overlooks the conditional dependencies among music tags, leading to suboptimal tagging performance. Unlike most music taggers, the proposed method sequentially estimates each tag based on the idea of the classifier chains. Beyond the naive classifier chains, the proposed method groups the multiple tags by category, such as genre, and performs chains by unit of groups, which we call \textit{classifier group chains}. Our method allows the modeling of the dependence between tag groups. We evaluate the effectiveness of the proposed method for music tagging performance through music tagging experiments using the MTG-Jamendo dataset. Furthermore, we investigate the effective order of chains for music tagging.
- [84] arXiv:2501.07329 (replaced) [pdf, html, other]
-
Title: Joint Automatic Speech Recognition And Structure Learning For Better Speech UnderstandingComments: 5 pages, 2 figures, accepted by ICASSP 2025Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Spoken language understanding (SLU) is a structure prediction task in the field of speech. Recently, many works on SLU that treat it as a sequence-to-sequence task have achieved great success. However, This method is not suitable for simultaneous speech recognition and understanding. In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. We conduct experiments on name entity recognition and intent classification using the Chinese dataset AISHELL-NER and the English dataset SLURP. The results show that our proposed method not only outperforms the traditional sequence-to-sequence method in both transcription and extraction capabilities but also achieves state-of-the-art performance on the two datasets.