Electrical Engineering and Systems Science
See recent articles
Showing new listings for Friday, 20 December 2024
- [1] arXiv:2412.14221 [pdf, html, other]
-
Title: Improving diabetic retinopathy screening using Artificial Intelligence: design, evaluation and before-and-after study of a custom developmentImanol Pinto, Álvaro Olazarán, David Jurío, Borja de la Osa, Miguel Sainz, Aritz Oscoz, Jerónimo Ballaz, Javier Gorricho, Mikel Galar, José AndoneguiSubjects: Image and Video Processing (eess.IV)
Background: The worst outcomes of diabetic retinopathy (DR) can be prevented by implementing DR screening programs assisted by AI. At the University Hospital of Navarre (HUN), Spain, general practitioners (GPs) grade fundus images in an ongoing DR screening program, referring to a second screening level (ophthalmologist) target patients.
Methods: After collecting their requirements, HUN decided to develop a custom AI tool, called NaIA-RD, to assist their GPs in DR screening. This paper introduces NaIA-RD, details its implementation, and highlights its unique combination of DR and retinal image quality grading in a single system. Its impact is measured in an unprecedented before-and-after study that compares 19,828 patients screened before NaIA-RD's implementation and 22,962 patients screened after.
Results: NaIA-RD influenced the screening criteria of 3/4 GPs, increasing their sensitivity. Agreement between NaIA-RD and the GPs was high for non-referral proposals (94.6% or more), but lower and variable (from 23.4\% to 86.6%) for referral proposals. An ophthalmologist discarded a NaIA-RD error in most of contradicted referral proposals by labeling the 93% of a sample of them as referable. In an autonomous setup, NaIA-RD would have reduced the study visualization workload by 4.27 times without missing a single case of sight-threatening DR referred by a GP.
Conclusion: DR screening was more effective when supported by NaIA-RD, which could be safely used to autonomously perform the first level of screening. This shows how AI devices, when seamlessly integrated into clinical workflows, can help improve clinical pathways in the long term. - [2] arXiv:2412.14349 [pdf, html, other]
-
Title: Near-Optimal Cell-Free Beamforming for Physical Layer Multigroup MulticastingComments: 6 pages, accepted at IEEE Global Communications Conference (Globecom), 2024Subjects: Signal Processing (eess.SP)
Physical layer multicasting is an efficient transmission technique that exploits the beamforming potential at the transmitting nodes and the broadcast nature of the wireless channel, together with the demand for the same content from several UEs. This paper addresses the max-min fair multigroup multicast beamforming optimization, which is an NP-hard problem. We propose a novel iterative elimination procedure coupled with semidefinite relaxation (SDR) to find the near-global optimum rank-1 beamforming vectors in a cell-free massive MIMO (multiple-input multiple-output) network setup. The proposed optimization procedure shows significant improvements in computational complexity and spectral efficiency performance compared to the SDR followed by the commonly used randomization procedure and the state-of-the-art difference-of-convex approximation algorithm. The significance of the proposed procedure is that it can be utilized as a rank reduction method for any problem in conjunction with SDR.
- [3] arXiv:2412.14369 [pdf, html, other]
-
Title: Uncertainty Awareness in Wireless Communications, Sensing, and LearningSubjects: Signal Processing (eess.SP)
Wireless communications and sensing (WCS) establish the backbone of modern information exchange and environment perception. Typical applications range from mobile networks and the Internet of Things to radar and sensor grids. The incorporation of machine learning further expands WCS's boundaries, unlocking automated and high-quality data analytics, together with advisable and efficient decision-making. Despite transformative capabilities, wireless systems often face numerous uncertainties in design and operation, such as modeling errors due to incomplete physical knowledge, statistical errors arising from data scarcity, measurement errors caused by sensor imperfections, computational errors owing to resource limitation, and unpredictability of environmental evolution. Once ignored, these uncertainties can lead to severe outcomes, e.g., performance degradation, system untrustworthiness, inefficient resource utilization, and security vulnerabilities. As such, this article reviews mature and emerging architectural, computational, and operational countermeasures, encompassing uncertainty-aware designs of signals and systems (e.g., diversity, adaptivity, modularity), as well as uncertainty-aware modeling and computational frameworks (e.g., risk-informed optimization, robust signal processing, and trustworthy machine learning). Trade-offs to employ these methods, e.g., robustness vs optimality, are also highlighted.
- [4] arXiv:2412.14376 [pdf, html, other]
-
Title: Heartbeat Detection from Ballistocardiogram using Transformer NetworkRuhan Yi, Mihail Popescu, James M. Keller, Grant Scott, Laurel Despins, David Heise, Marjorie SkubicSubjects: Signal Processing (eess.SP)
Longitudinal monitoring of heart rate (HR) and heart rate variability (HRV) can aid in tracking cardiovascular diseases (CVDs), sleep quality, sleep disorders, and reflect autonomic nervous system activity, stress levels, and overall well-being. These metrics are valuable in both clinical and everyday settings. In this paper, we present a transformer network aimed primarily at detecting the precise timing of heart beats from predicted electrocardiogram (ECG), derived from input Ballistocardiogram (BCG). We compared the performance of segment and subject models across three datasets: a lab dataset with 46 young subjects, an elder dataset with 28 elderly adults, and a combined dataset. The segment model demonstrated superior performance, with correlation coefficients of 0.97 for HR and mean heart beat interval (MHBI) when compared to ground truth. This non-invasive method offers significant potential for long-term, in-home HR and HRV monitoring, aiding in the early indication and prevention of cardiovascular issues.
- [5] arXiv:2412.14458 [pdf, html, other]
-
Title: Estimation in the Gaussian Multiplex ChannelComments: 28 pages,5 figuresSubjects: Signal Processing (eess.SP)
An abstraction for multisensor communication termed the Gaussian Multiplex Channel is presented and analyzed. In this model, the sensor outputs can be added together in any combination through a network of switches, and the combinations can be changed arbitrarily during the observation interval. The sensor output sums are observed in additive Gaussian noise. Using a mean square error cost function and a constraint on the total observation time, an optimal set of combinations (switch positions) and observation times is determined. The solution exhibits high complexity (number of different combinations) even for moderate numbers of sensors. It is then shown that there exists an alternative solution based on Hadamard designs, which achieves the same minimizing MSE cost function and only requires a number of combinations equal to the number of sensors.
- [6] arXiv:2412.14511 [pdf, html, other]
-
Title: High-Accuracy Model Predictive Control with Inverse Hysteresis for High-Speed Trajectory Tracking of Piezoelectric Fast Steering MirrorSubjects: Systems and Control (eess.SY)
Piezoelectric fast steering mirrors (PFSM) are widely utilized in beam precision-pointing systems but encounter considerable challenges in achieving high-precision tracking of fast trajectories due to nonlinear hysteresis and mechanical dual-axis cross-coupling. This paper proposes a model predictive control (MPC) approach integrated with a hysteresis inverse based on the Hammerstein modeling structure of the PFSM. The MPC is designed to decouple the rate-dependent dual-axis linear components, with an augmented error integral variable introduced in the state space to eliminate steady-state errors. Moreover, proofs of zero steady-state error and disturbance rejection are provided. The hysteresis inverse model is then cascaded to compensate for the rate-independent nonlinear components. Finally, PFSM tracking experiments are conducted on step, sinusoidal, triangular, and composite trajectories. Compared to traditional model-free and existing model-based controllers, the proposed method significantly enhances tracking accuracy, demonstrating superior tracking performance and robustness to frequency variations. These results offer valuable insights for engineering applications.
- [7] arXiv:2412.14614 [pdf, html, other]
-
Title: A Model-free Biomimetics Algorithm for Deterministic Partially Observable Markov Decision ProcessComments: 27 pages, 5 figuresSubjects: Systems and Control (eess.SY)
Partially Observable Markov Decision Process (POMDP) is a mathematical framework for modeling decision-making under uncertainty, where the agent's observations are incomplete and the underlying system dynamics are probabilistic. Solving the POMDP problem within the model-free paradigm is challenging for agents due to the inherent difficulty in accurately identifying and distinguishing between states and observations. We define such a difficult problem as a DETerministic Partially Observable Markov Decision Process (DET-POMDP) problem, which is a specific setting of POMDP. In this problem, states and observations are in a many-to-one relationship. The state is obscured, and its relationship is less apparent to the agent. This creates obstacles for the agent to infer the state through observations. To effectively address this problem, we convert DET-POMDP into a fully observable MDP using a model-free biomimetics algorithm called BIOMAP. BIOMAP is based on the MDP Graph Automaton framework to distinguish authentic environmental information from fraudulent data. Thus, it enhances the agent's ability to develop stable policies against DET-POMDP. The experimental results highlight the superior capabilities of BIOMAP in maintaining operational effectiveness and environmental reparability in the presence of environmental deceptions when compared with existing POMDP solvers. This research opens up new avenues for the deployment of reliable POMDP-based systems in fields that are particularly susceptible to DET-POMDP problems.
- [8] arXiv:2412.14616 [pdf, html, other]
-
Title: An Age of Information Characterization of SPS for V2X ApplicationsComments: 13 pages, 8 figuresSubjects: Signal Processing (eess.SP)
We derive a closed-form approximation of the stationary distribution of the Age of Information (AoI) of the semi-persistent scheduling (SPS) protocol which is a core part of NR-V2X, an important standard for vehicular communications. While prior works have studied the average AoI under similar assumptions, in this work we provide a full statistical characterization of the AoI by deriving an approximation of its probability mass function. As result, besides the average AoI, we are able to evaluate the age-violation probability, which is of particular relevance for safety-critical applications in vehicular domains, where the priority is to ensure that the AoI does not exceed a predefined threshold during system operation. The study reveals complementary behavior of the age-violation probability compared to the average AoI and highlights the role of the duration of the reservation as a key parameter in the SPS protocol. We use this to demonstrate how this crucial parameter should be tuned according to the performance requirements of the application.
- [9] arXiv:2412.14638 [pdf, html, other]
-
Title: TuneS: Patient-specific model-based optimization of contact configuration in deep brain stimulationComments: 8 pages, 9 figures, submitted to IEEE Transactions on Biomedical EngineeringSubjects: Systems and Control (eess.SY)
Objective: The objective of this study is to develop and evaluate a systematic approach to optimize Deep Brain Stimulation (DBS) parameters, addressing the challenge of identifying patient-specific settings and optimal stimulation targets for various neurological and mental disorders. Methods: TuneS, a novel pipeline to predict clinically optimal DBS contact configurations based on predefined targets and constraints, is introduced. The method relies upon patient-specific models of stimulation spread and extends optimization beyond traditional neural structures to include automated, model-based targeting of streamlines. Results: Initial findings demonstrate that STN motor streamlines consistently receive a significant portion of the allocated stimulation volume, suggesting that a consistent portion of the stimulation should ideally focus on the STN motor streamlines. At the example of a small cohort of Parkinson's disease patients, the value of model-based contact predictions for assessing stimulation targets while observing constraints is demonstrated. Conclusion: TuneS shows promise as a research tool, enabling systematic assessment of DBS target effectiveness and facilitating constraint-aware optimization of stimulation parameters. Significance: The presented pipeline offers a pathway to improve patient-specific DBS therapies and contributes to the broader understanding of effective DBS targeting strategies.
- [10] arXiv:2412.14658 [pdf, html, other]
-
Title: Robustness Evaluation of a Physical Internet-based Intermodal Logistic NetworkComments: 25 pages, 8 figuresSubjects: Systems and Control (eess.SY); Networking and Internet Architecture (cs.NI)
The Physical Internet (PI) paradigm, which has gained attention in research and academia in recent years, leverages advanced logistics and interconnected networks to revolutionize the way goods are transported and delivered, thereby enhancing efficiency, reducing costs and delays, and minimizing environmental impact. Within this system, PI-hubs function similarly to cross-docks enabling the splitting of PI-containers into smaller modules to be delivered through a network of interconnected hubs, allowing dynamic routing optimization and efficient consolidation of PI-containers. Nevertheless, the impact of the system parameters and of the relevant uncertainties on the performance of this innovative logistics framework is still unclear. For this reason, this work proposes a robustness analysis to understand how the PI logistic framework is affected by how PI-containers are handled, consolidated, and processed at the PI-hubs. To this end, the considered PI logistic system is represented via a mathematical programming model that determines the best allocation of PI-containers in an intermodal setting with different transportation modes. In doing so, four Key Performance Indicators (KPIs) are separately considered to investigate different aspects of the PI system's performance and the relevant robustness is assessed with respect to the PI-hubs' processing times and the number of modules per PI-container. In particular, a Global Sensitivity Analysis (GSA) is considered to evaluate, by means of a case study, the individual relevance of each input parameter on the resulting performance.
- [11] arXiv:2412.14673 [pdf, html, other]
-
Title: Classification of Linear Observed Systems on Multi-Frame Groups via AutomorphismsComments: 6 pages, 2 figuresSubjects: Systems and Control (eess.SY)
Many navigation problems can be formulated as observer design on linear observed systems with a two-frame group structure, on which an invariant filter can be implemented with guaranteed consistency and stability. It's still unclear how this could be generalized to simultaneous estimation of the poses of multiple frames and the general forms of the linear observed systems involving multiple frames remain unknown. In this letter, we propose a multi-frame group structure by semi-direct product using the two-frame group as building blocks, covering all natural extensions. More importantly, we give a systematic direct calculation to classify all possible forms of linear observed systems including process ODEs and algebraic observations on such multi-frame group through its automorphism structure, in comparison to the existing classification on two-frame groups relying on ingenious construction. Depth-camera inertial odometry with online extrinsics calibration is provided as an application.
- [12] arXiv:2412.14812 [pdf, html, other]
-
Title: Generative CKM Construction using Partially Observed Data with Diffusion ModelSubjects: Signal Processing (eess.SP)
Channel knowledge map (CKM) is a promising technique that enables environment-aware wireless networks by utilizing location-specific channel prior information to improve communication and sensing performance. A fundamental problem for CKM construction is how to utilize partially observed channel knowledge data to reconstruct a complete CKM for all possible locations of interest. This problem resembles the long-standing ill-posed inverse problem, which tries to infer from a set of limited observations the cause factors that produced them. By utilizing the recent advances of solving inverse problems with generative artificial intelligence (AI), in this paper, we propose generative CKM construction method using partially observed data by solving inverse problems with diffusion models. Simulation results show that the proposed method significantly improves the performance of CKM construction compared with benchmarking schemes.
- [13] arXiv:2412.14846 [pdf, html, other]
-
Title: Head and Neck Tumor Segmentation of MRI from Pre- and Mid-radiotherapy with Pre-training, Data Augmentation and Dual Flow UNetSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Head and neck tumors and metastatic lymph nodes are crucial for treatment planning and prognostic analysis. Accurate segmentation and quantitative analysis of these structures require pixel-level annotation, making automated segmentation techniques essential for the diagnosis and treatment of head and neck cancer. In this study, we investigated the effects of multiple strategies on the segmentation of pre-radiotherapy (pre-RT) and mid-radiotherapy (mid-RT) images. For the segmentation of pre-RT images, we utilized: 1) a fully supervised learning approach, and 2) the same approach enhanced with pre-trained weights and the MixUp data augmentation technique. For mid-RT images, we introduced a novel computational-friendly network architecture that features separate encoders for mid-RT images and registered pre-RT images with their labels. The mid-RT encoder branch integrates information from pre-RT images and labels progressively during the forward propagation. We selected the highest-performing model from each fold and used their predictions to create an ensemble average for inference. In the final test, our models achieved a segmentation performance of 82.38% for pre-RT and 72.53% for mid-RT on aggregated Dice Similarity Coefficient (DSC) as HiLab. Our code is available at this https URL.
- [14] arXiv:2412.14848 [pdf, html, other]
-
Title: ElectraSight: Smart Glasses with Fully Onboard Non-Invasive Eye Tracking Using Hybrid Contact and Contactless EOGNicolas Schärer, Federico Villani, Aishwarya Melatur, Steven Peter, Tommaso Polonelli, Michele MagnoSubjects: Signal Processing (eess.SP)
Smart glasses with integrated eye tracking technology are revolutionizing diverse fields, from immersive augmented reality experiences to cutting-edge health monitoring solutions. However, traditional eye tracking systems rely heavily on cameras and significant computational power, leading to high-energy demand and privacy issues. Alternatively, systems based on electrooculography (EOG) provide superior battery life but are less accurate and primarily effective for detecting blinks, while being highly invasive. The paper introduces ElectraSight, a non-invasive plug-and-play low-power eye tracking system for smart glasses. The hardware-software co-design of the system is detailed, along with the integration of a hybrid EOG (hEOG) solution that incorporates both contact and contactless electrodes. Within 79 kB of memory, the proposed tinyML model performs real-time eye movement classification with 81% accuracy for 10 classes and 92% for 6 classes, not requiring any calibration or user-specific fine-tuning. Experimental results demonstrate that ElectraSight delivers high accuracy in eye movement and blink classification, with minimal overall movement detection latency (90% within 60 ms) and an ultra-low computing time (301 {\mu}s). The power consumption settles down to 7.75 mW for continuous data acquisition and 46 mJ for the tinyML inference. This efficiency enables continuous operation for over 3 days on a compact 175 mAh battery. This work opens new possibilities for eye tracking in commercial applications, offering an unobtrusive solution that enables advancements in user interfaces, health diagnostics, and hands-free control systems.
- [15] arXiv:2412.14890 [pdf, html, other]
-
Title: Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement ScalingSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Recent speech enhancement models have shown impressive performance gains by scaling up model complexity and training data. However, the impact of dataset variability (e.g. text, language, speaker, and noise) has been underexplored. Analyzing each attribute individually is often challenging, as multiple attributes are usually entangled in commonly used datasets, posing a significant obstacle in understanding the distinct contributions of each attribute to the model's performance. To address this challenge, we propose a generation-training-evaluation framework that leverages zero-shot text-to-speech systems to investigate the impact of controlled attribute variations on speech enhancement performance. It enables us to synthesize training datasets in a scalable manner while carefully altering each attribute. Based on the proposed framework, we analyze the scaling effects of various dataset attributes on the performance of both discriminative and generative SE models. Extensive experiments on multi-domain corpora imply that acoustic attributes (e.g., speaker and noise) are much more important to current speech enhancement models than semantic attributes (e.g., language and text), offering new insights for future research.
- [16] arXiv:2412.14968 [pdf, html, other]
-
Title: An Overview on Over-the-airElectromagnetic Signal ProcessingSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
This article provides a tutorial on over-the-air electromagnetic signal processing (ESP) for next-generation wireless networks, addressing the limitations of digital processing to enhance the efficiency and sustainability of future 6th Generation (6G) systems. It explores the integration of electromagnetism and signal processing (SP) highlighting how their convergence can drive innovations for 6G technologies. Key topics include electromagnetic (EM) wave-based processing, the application of metamaterials and advanced antennas to optimize EM field manipulation with a reduced number of radiofrequency chains, and their applications in holographic multiple-input multiple-output systems. By showcasing enabling technologies and use cases, the article demonstrates how wave-based processing can minimize energy consumption, complexity, and latency, offering an effective framework for more sustainable and efficient wireless systems. This article aims to assist researchers and professionals in integrating advanced EM technologies with conventional SP methods.
- [17] arXiv:2412.14984 [pdf, html, other]
-
Title: Co-optimization of Vehicle Dynamics and Powertrain Management for Connected and Automated Electric VehiclesSubjects: Systems and Control (eess.SY)
Connected and automated vehicles (CAVs) represent the future of transportation, utilizing detailed traffic information to enhance control and decision-making. Eco-driving of CAVs has the potential to significantly improve energy efficiency, and the benefits are maximized when both vehicle speed and powertrain operation are optimized. In this paper, we studied the co-optimization of vehicle speed and powertrain management for energy savings in a dual-motor electric vehicle. Control-oriented vehicle dynamics and electric powertrain models were developed to transform the problem into an optimal control problem specifically designed to facilitate real-time computation. Simulation validation was conducted using real-world data calibrated traffic simulation scenarios in Chattanooga, TN. Evaluation results demonstrated a 12.80-24.52% reduction in the vehicle's power consumption under ideal predicted traffic conditions, while maintaining benefits with various prediction uncertainties, such as Gaussian process uncertainties on acceleration and time-shift effects on predicted speed. The energy savings of the proposed eco-driving strategy are achieved through effective speed control and optimized torque allocation. The proposed model can be extended to various CAV and electric vehicle applications, with potential adaptability to diverse traffic scenarios.
- [18] arXiv:2412.15007 [pdf, html, other]
-
Title: Cram\'er-Rao Bound Optimization for Near-Field Sensing with Continuous-Aperture ArraysComments: This work has been submitted to the IEEE for possible publicationSubjects: Signal Processing (eess.SP)
A Cramér-Rao bound (CRB) optimization framework for near-field sensing (NISE) with continuous-aperture arrays (CAPAs) is proposed. In contrast to conventional spatially discrete arrays (SPDAs), CAPAs emit electromagnetic (EM) probing signals through continuous source currents for target sensing, thereby exploiting the full spatial degrees of freedom (DoFs). The maximum likelihood estimation (MLE) method for estimating target locations in the near-field region is developed. To evaluate the NISE performance with CAPAs, the CRB for estimating target locations is derived based on continuous transmit and receive array responses of CAPAs. Subsequently, a CRB minimization problem is formulated to optimize the continuous source current of CAPAs. This results in a non-convex, integral-based functional optimization problem. To address this challenge, the optimal structure of the source current is derived and proven to be spanned by a series of basis functions determined by the system geometry. To solve the CRB minimization problem, a low-complexity subspace manifold gradient descent (SMGD) method is proposed, leveraging the derived optimal structure of the source current. Our simulation results validate the effectiveness of the proposed SMGD method and further demonstrate that i)~the proposed SMGD method can effectively solve the CRB minimization problem with reduced computational complexity, and ii)~CAPA achieves a tenfold improvement in sensing performance compared to its SPDA counterpart, due to full exploitation of spatial DoFs.
- [19] arXiv:2412.15040 [pdf, html, other]
-
Title: Noise Analysis and Modeling of the PMD Flexx2 Depth Camera for Robotic ApplicationsComments: Accepted by COINS 2024Journal-ref: IEEE International Conference on Omni-layer Intelligent Systems (COINS), 2024, pp. 422-427Subjects: Image and Video Processing (eess.IV); Robotics (cs.RO)
Time of Flight ToF cameras renowned for their ability to capture realtime 3D information have become indispensable for agile mobile robotics These cameras utilize light signals to accurately measure distances enabling robots to navigate complex environments with precision Innovative depth cameras characterized by their compact size and lightweight design such as the recently released PMD Flexx2 are particularly suited for mobile robots Capable of achieving high frame rates while capturing depth information this innovative sensor is suitable for tasks such as robot navigation and terrain mapping Operating on the ToF measurement principle the sensor offers multiple benefits over classic stereobased depth cameras However the depth images produced by the camera are subject to noise from multiple sources complicating their simulation This paper proposes an accurate quantification and modeling of the nonsystematic noise of the PMD Flexx2 We propose models for both axial and lateral noise across various camera modes assuming Gaussian distributions Axial noise modeled as a function of distance and incidence angle demonstrated a low average KullbackLeibler KL divergence of 0015 nats reflecting precise noise characterization Lateral noise deviating from a Gaussian distribution was modeled conservatively yielding a satisfactory KL divergence of 0868 nats These results validate our noise models crucial for accurately simulating sensor behavior in virtual environments and reducing the simtoreal gap in learningbased control approaches
- [20] arXiv:2412.15078 [pdf, html, other]
-
Title: Novel Conditions for the Finite-Region Stability of 2D-Systems with Application to Iterative Learning ControlSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Some recent papers have extended the concept of finite-time stability (FTS) to the context of 2D linear systems, where it has been referred to as finite-region stability (FRS). FRS methodologies make even more sense than the classical FTS approach developed for 1D-systems, since, typically, at least one of the state variables of 2D-systems is a space coordinate, rather than a time variable. Since space coordinates clearly belong to finite intervals, FRS techniques are much more effective than the classical Lyapunov approach, which looks to the asymptotic behavior of the system over an infinite interval. To this regard, the novel contribution of this paper goes in several directions. First, we provide a novel sufficient condition for the FRS of linear time-varying (LTV) discrete-time 2D-systems, which turns out to be less conservative than those ones provided in the existing literature. Then, an interesting application of FRS to the context of iterative learning control (ILC) is investigated, by exploiting the previously developed theory. In particular, a new procedure is proposed so that the tracking errors of the ILC law converges within the desired bound in a finite number of iterations. Finally, a sufficient condition to solve the finite-region stabilization problem is proposed. All the results provided in the paper lead to optimization problems constrained by linear matrix inequalities (LMIs), that can be solved via widely available software. Numerical examples illustrate and validate the effectiveness of the proposed technique.
- [21] arXiv:2412.15079 [pdf, html, other]
-
Title: A Traffic Adapative Physics-informed Learning Control for Energy Savings of Connected and Automated VehiclesSubjects: Systems and Control (eess.SY)
Model predictive control has emerged as an effective approach for real-time optimal control of connected and automated vehicles. However, nonlinear dynamics of vehicle and traffic systems make accurate modeling and real-time optimization challenging. Learning-based control offer a promising alternative, as they adapt to environment without requiring an explicit model. For learning control framework, an augmented state space system design is necessary since optimal control depends on both the ego vehicle's state and predicted states of other vehicles. This work develops a traffic adaptive augmented state space system that allows the control strategy to intelligently adapt to varying traffic conditions. This design ensures that while different vehicle trajectories alter initial conditions, the system dynamics remain independent of specific trajectories. Additionally, a physics-informed learning control framework is presented that combines value function from Bellman's equation with derivative of value functions from Pontryagin's Maximum Principle into a unified loss function. This method aims to reduce required training data and time while enhancing robustness and efficiency. The proposed control framework is applied to car-following scenarios in real-world data calibrated simulation environments. The results show that this learning control approach alleviates real-time computational requirements while achieving car-following behaviors comparable to model-based methods, resulting in 9% energy savings in scenarios not previously seen in training dataset.
- [22] arXiv:2412.15105 [pdf, html, other]
-
Title: Exploiting sparse structures and synergy designs to advance situational awareness of electrical power gridComments: PhD thesisSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI)
The growing threats of uncertainties, anomalies, and cyberattacks on power grids are driving a critical need to advance situational awareness which allows system operators to form a complete and accurate picture of the present and future state. Simulation and estimation are foundational tools in this process. However, existing tools lack the robustness and efficiency required to achieve the level of situational awareness needed for the ever-evolving threat landscape. Industry-standard (steady-state) simulators are not robust to blackouts, often leading to non-converging or non-actionable results. Estimation tools lack robustness to anomalous data, returning erroneous system states. Efficiency is the other major concern as nonlinearities and scalability issues make large systems slow to converge.
This thesis addresses robustness and efficiency gaps through a dual-fold contribution. We first address the inherent limitations in the existing physics-based and data-driven worlds; and then transcend the boundaries of conventional algorithmic design in the direction of a new paradigm -- Physics-ML Synergy -- which integrates the strengths of the two worlds. Our approaches are built on circuit formulation which provides a unified framework that applies to both transmission and distribution. Sparse optimization acts as the key enabler to make these tools intrinsically robust and immune to random threats, pinpointing dominant sources of (random) blackouts and data errors. Further, we explore sparsity-exploiting optimizations to develop lightweight ML models whose prediction and detection capabilities are a complement to physics-based tools; and whose lightweight designs advance generalization and scalability. Finally, Physics-ML Synergy brings robustness and efficiency further against targeted cyberthreats, by interconnecting our physics-based tools with lightweight ML. - [23] arXiv:2412.15133 [pdf, html, other]
-
Title: Blind Deconvolution of Graph Signals: Robustness to Graph PerturbationsComments: 6 pages, 3 figures, submitted for publication to the IEEE Signal Processing LettersSubjects: Signal Processing (eess.SP)
We study blind deconvolution of signals defined on the nodes of an undirected graph. Although observations are bilinear functions of both unknowns, namely the forward convolutional filter coefficients and the graph signal input, a filter invertibility requirement along with input sparsity allow for an efficient linear programming reformulation. Unlike prior art that relied on perfect knowledge of the graph eigenbasis, here we derive stable recovery conditions in the presence of small graph perturbations. We also contribute a provably convergent robust algorithm, which alternates between blind deconvolution of graph signals and eigenbasis denoising in the Stiefel manifold. Reproducible numerical tests showcase the algorithm's robustness under several graph eigenbasis perturbation models.
- [24] arXiv:2412.15137 [pdf, html, other]
-
Title: Hydrogen in Aviation: Evaluating the Feasibility and Benefits of a Green Fuel AlternativeSubjects: Systems and Control (eess.SY)
Growing concerns regarding environmental health have highlighted the aviation industry's impact and potential mitigation strategies. Previous research has indicated hydrogen's significant potential for reducing the industry's environmental impact, yet implementation challenges remain. Through analysis of light aircraft and military applications, we demonstrate that hydrogen-based systems can achieve performance metrics approaching those of traditional fuels while reducing emissions by up to 74.7%. Our findings show that hydrogen's superior energy-to-mass ratio (120 MJ/kg versus 43 MJ/kg for jet fuel) makes it particularly advantageous for aviation applications compared to battery-electric alternatives. Primary implementation challenges involve cryogenic storage systems (-253°C), tank placement optimization, and fueling infrastructure development. The observed efficiency penalties of only 2.23% in military applications suggest hydrogen's viability as a sustainable aviation fuel alternative.
- [25] arXiv:2412.15186 [pdf, html, other]
-
Title: Surface-Based Authentication System for Integrated Circuit ChipsSubjects: Signal Processing (eess.SP)
The rapid development of the semiconductor industry and the ubiquity of electronic devices have led to a significant increase in the counterfeiting of integrated circuits (ICs). This poses a major threat to public health, the banking industry, and military defense sectors that are heavily reliant on electronic systems. The electronic physically unclonable functions (PUFs) are widely used to authenticate IC chips at the unit level. However, electronic PUFs are limited by their requirement for IC chips to be in working status for measurements and their sensitivity to environmental variations. This paper proposes using optical PUFs for IC chip authentication by leveraging the unique microscopic structures of the packaging surface of individual IC chips. The proposed method relies on color images of IC chip surfaces acquired using a flatbed scanner or mobile camera. Our initial study reveals that these consumer-grade imaging devices can capture meaningful physical features from IC chip surfaces. We then propose an efficient, lightweight verification scheme leveraging specular-reflection-based features extracted from videos, achieving an equal error rate (EER) of 0.0008. We conducted factor, sensitivity, and ablation studies to understand the detailed characteristics of the proposed lightweight verification scheme. This work is the first to apply the optical PUF principle for the authentication of IC chips and has the potential to significantly enhance the security of the semiconductor supply chain.
New submissions (showing 25 of 25 entries)
- [26] arXiv:2412.14373 (cross-list from cs.CL) [pdf, html, other]
-
Title: ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language ModelingComments: 26 pages, 17 figuresSubjects: Computation and Language (cs.CL); Signal Processing (eess.SP)
Large Language Models (LLMs) have shown remarkable adaptability across domains beyond text, specifically electrocardiograms (ECGs). More specifically, there is a growing body of work exploring the task of generating text from a multi-channeled ECG and corresponding textual prompt. Current approaches typically involve pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective and using the features output by the pretrained encoder to finetune a LLM for natural language generation (NLG). However, these methods are limited by 1) inefficiency from two-stage training and 2) interpretability challenges with encoder-generated features. To address these limitations, we introduce ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. This approach compresses and encodes ECG signals into tokens, enabling end-to-end LLM training by combining ECG and text tokens directly, while being much more interpretable since the ECG tokens can be directly mapped back to the original signal. Using ECG-Byte, we achieve competitive performance in NLG tasks in only half the time and ~48% of the data required by two-stage approaches.
- [27] arXiv:2412.14403 (cross-list from physics.flu-dyn) [pdf, html, other]
-
Title: Short-term wind forecasting via surface pressure measurements: stochastic modeling and sensor placementComments: 23 pages, 24 figuresSubjects: Fluid Dynamics (physics.flu-dyn); Systems and Control (eess.SY); Dynamical Systems (math.DS); Atmospheric and Oceanic Physics (physics.ao-ph)
We propose a short-term wind forecasting framework for predicting real-time variations in atmospheric turbulence based on nacelle-mounted anemometer and ground-level air-pressure measurements. Our approach combines linear stochastic estimation and Kalman filtering algorithms to assimilate and process real-time field measurements with the predictions of a stochastic reduced-order model that is confined to a two-dimensional plane at the hub height of turbines. We bridge the vertical gap between the computational plane of the model at hub height and the measurement plane on the ground using a projection technique that allows us to infer the pressure in one plane from the other. Depending on the quality of this inference, we show that customized variants of the extended and ensemble Kalman filters can be tuned to balance estimation quality and computational speed 1-1.5 diameters ahead and behind leading turbines. In particular, we show how synchronizing the sign of estimates with that of velocity fluctuations recorded at the nacelle can significantly improve the ability to follow temporal variations upwind of the leading turbine. We also propose a convex optimization-based framework for selecting a subset of pressure sensors that achieve a desired level of accuracy relative to the optimal Kalman filter that uses all sensing capabilities.
- [28] arXiv:2412.14432 (cross-list from cs.CV) [pdf, html, other]
-
Title: IntroStyle: Training-Free Introspective Style Attribution using Diffusion FeaturesComments: 16 pages, 17 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Text-to-image (T2I) models have gained widespread adoption among content creators and the general public. However, this has sparked significant concerns regarding data privacy and copyright infringement among artists. Consequently, there is an increasing demand for T2I models to incorporate mechanisms that prevent the generation of specific artistic styles, thereby safeguarding intellectual property rights. Existing methods for style extraction typically necessitate the collection of custom datasets and the training of specialized models. This, however, is resource-intensive, time-consuming, and often impractical for real-time applications. Moreover, it may not adequately address the dynamic nature of artistic styles and the rapidly evolving landscape of digital art. We present a novel, training-free framework to solve the style attribution problem, using the features produced by a diffusion model alone, without any external modules or retraining. This is denoted as introspective style attribution (IntroStyle) and demonstrates superior performance to state-of-the-art models for style retrieval. We also introduce a synthetic dataset of Style Hacks (SHacks) to isolate artistic style and evaluate fine-grained style attribution performance.
- [29] arXiv:2412.14449 (cross-list from cs.CV) [pdf, html, other]
-
Title: Color Enhancement for V-PCC Compressed Point Cloud via 2D Attribute Map OptimizationComments: IEEE VCIP 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Video-based point cloud compression (V-PCC) converts the dynamic point cloud data into video sequences using traditional video codecs for efficient encoding. However, this lossy compression scheme introduces artifacts that degrade the color attributes of the data. This paper introduces a framework designed to enhance the color quality in the V-PCC compressed point clouds. We propose the lightweight de-compression Unet (LDC-Unet), a 2D neural network, to optimize the projection maps generated during V-PCC encoding. The optimized 2D maps will then be back-projected to the 3D space to enhance the corresponding point cloud attributes. Additionally, we introduce a transfer learning strategy and develop a customized natural image dataset for the initial training. The model was then fine-tuned using the projection maps of the compressed point clouds. The whole strategy effectively addresses the scarcity of point cloud training data. Our experiments, conducted on the public 8i voxelized full bodies long sequences (8iVSLF) dataset, demonstrate the effectiveness of our proposed method in improving the color quality.
- [30] arXiv:2412.14456 (cross-list from cs.CV) [pdf, html, other]
-
Title: LEDiff: Latent Exposure Diffusion for HDR GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
While consumer displays increasingly support more than 10 stops of dynamic range, most image assets such as internet photographs and generative AI content remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.
- [31] arXiv:2412.14492 (cross-list from cs.AI) [pdf, html, other]
-
Title: FaultExplainer: Leveraging Large Language Models for Interpretable Fault Detection and DiagnosisSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Machine learning algorithms are increasingly being applied to fault detection and diagnosis (FDD) in chemical processes. However, existing data-driven FDD platforms often lack interpretability for process operators and struggle to identify root causes of previously unseen faults. This paper presents FaultExplainer, an interactive tool designed to improve fault detection, diagnosis, and explanation in the Tennessee Eastman Process (TEP). FaultExplainer integrates real-time sensor data visualization, Principal Component Analysis (PCA)-based fault detection, and identification of top contributing variables within an interactive user interface powered by large language models (LLMs). We evaluate the LLMs' reasoning capabilities in two scenarios: one where historical root causes are provided, and one where they are not to mimic the challenge of previously unseen faults. Experimental results using GPT-4o and o1-preview models demonstrate the system's strengths in generating plausible and actionable explanations, while also highlighting its limitations, including reliance on PCA-selected features and occasional hallucinations.
- [32] arXiv:2412.14522 (cross-list from cs.LG) [pdf, html, other]
-
Title: CAE-T: A Channelwise AutoEncoder with Transformer for EEG Abnormality DetectionComments: The manuscript consists of 10 pages, including 5 figures. The experimental results are based on evaluations using the TUH Abnormal EEG CorpusSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP)
Electroencephalogram (EEG) signals are critical for detecting abnormal brain activity, but their high dimensionality and complexity pose significant challenges for effective analysis. In this paper, we propose CAE-T, a novel framework that combines a channelwise CNN-based autoencoder with a single-head transformer classifier for efficient EEG abnormality detection. The channelwise autoencoder compresses raw EEG signals while preserving channel independence, reducing computational costs and retaining biologically meaningful features. The compressed representations are then fed into the transformer-based classifier, which efficiently models long-term dependencies to distinguish between normal and abnormal signals. Evaluated on the TUH Abnormal EEG Corpus, the proposed model achieves 85.0% accuracy, 76.2% sensitivity, and 91.2% specificity at the per-case level, outperforming baseline models such as EEGNet, Deep4Conv, and FusionCNN. Furthermore, CAE-T requires only 202M FLOPs and 2.9M parameters, making it significantly more efficient than transformer-based alternatives. The framework retains interpretability through its channelwise design, demonstrating great potential for future applications in neuroscience research and clinical practice. The source code is available at this https URL.
- [33] arXiv:2412.14538 (cross-list from cs.NI) [pdf, html, other]
-
Title: Overview of AI and Communication for 6G Network: Fundamentals, Challenges, and Future Research OpportunitiesQimei Cui, Xiaohu You, Ni Wei, Guoshun Nan, Xuefei Zhang, Jianhua Zhang, Xinchen Lyu, Ming Ai, Xiaofeng Tao, Zhiyong Feng, Ping Zhang, Qingqing Wu, Meixia Tao, Yongming Huang, Chongwen Huang, Guangyi Liu, Chenghui Peng, Zhiwen Pan, Tao Sun, Dusit Niyato, Tao Chen, Muhammad Khurram Khan, Abbas Jamalipour, Mohsen Guizani, Chau YuenSubjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
With the increasing demand for seamless connectivity and intelligent communication, the integration of artificial intelligence (AI) and communication for sixth-generation (6G) network is emerging as a revolutionary architecture. This paper presents a comprehensive overview of AI and communication for 6G networks, emphasizing their foundational principles, inherent challenges, and future research opportunities. We commence with a retrospective analysis of AI and the evolution of large-scale AI models, underscoring their pivotal roles in shaping contemporary communication technologies. The discourse then transitions to a detailed exposition of the envisioned integration of AI within 6G networks, delineated across three progressive developmental stages. The initial stage, AI for Network, focuses on employing AI to augment network performance, optimize efficiency, and enhance user service experiences. The subsequent stage, Network for AI, highlights the role of the network in facilitating and buttressing AI operations and presents key enabling technologies, including digital twins for AI and semantic communication. In the final stage, AI as a Service, it is anticipated that future 6G networks will innately provide AI functions as services and support application scenarios like immersive communication and intelligent industrial robots. Specifically, we have defined the quality of AI service, which refers to the measurement framework system of AI services within the network. In addition to these developmental stages, we thoroughly examine the standardization processes pertinent to AI in network contexts, highlighting key milestones and ongoing efforts. Finally, we outline promising future research opportunities that could drive the evolution and refinement of AI and communication for 6G, positioning them as a cornerstone of next-generation communication infrastructure.
- [34] arXiv:2412.14547 (cross-list from cs.CV) [pdf, html, other]
-
Title: Bright-NeRF:Brightening Neural Radiance Field with Color Restoration from Low-light Raw ImagesComments: Accepted by AAAI2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Neural Radiance Fields (NeRFs) have demonstrated prominent performance in novel view synthesis. However, their input heavily relies on image acquisition under normal light conditions, making it challenging to learn accurate scene representation in low-light environments where images typically exhibit significant noise and severe color distortion. To address these challenges, we propose a novel approach, Bright-NeRF, which learns enhanced and high-quality radiance fields from multi-view low-light raw images in an unsupervised manner. Our method simultaneously achieves color restoration, denoising, and enhanced novel view synthesis. Specifically, we leverage a physically-inspired model of the sensor's response to illumination and introduce a chromatic adaptation loss to constrain the learning of response, enabling consistent color perception of objects regardless of lighting conditions. We further utilize the raw data's properties to expose the scene's intensity automatically. Additionally, we have collected a multi-view low-light raw image dataset to advance research in this field. Experimental results demonstrate that our proposed method significantly outperforms existing 2D and 3D approaches. Our code and dataset will be made publicly available.
- [35] arXiv:2412.14571 (cross-list from cs.CV) [pdf, html, other]
-
Title: SCKD: Semi-Supervised Cross-Modality Knowledge Distillation for 4D Radar Object DetectionRuoyu Xu, Zhiyu Xiang, Chenwei Zhang, Hanzhi Zhong, Xijun Zhao, Ruina Dang, Peng Xu, Tianyu Pu, Eryun LiuComments: Accepted by AAAI 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
3D object detection is one of the fundamental perception tasks for autonomous vehicles. Fulfilling such a task with a 4D millimeter-wave radar is very attractive since the sensor is able to acquire 3D point clouds similar to Lidar while maintaining robust measurements under adverse weather. However, due to the high sparsity and noise associated with the radar point clouds, the performance of the existing methods is still much lower than expected. In this paper, we propose a novel Semi-supervised Cross-modality Knowledge Distillation (SCKD) method for 4D radar-based 3D object detection. It characterizes the capability of learning the feature from a Lidar-radar-fused teacher network with semi-supervised distillation. We first propose an adaptive fusion module in the teacher network to boost its performance. Then, two feature distillation modules are designed to facilitate the cross-modality knowledge transfer. Finally, a semi-supervised output distillation is proposed to increase the effectiveness and flexibility of the distillation framework. With the same network structure, our radar-only student trained by SCKD boosts the mAP by 10.38% over the baseline and outperforms the state-of-the-art works on the VoD dataset. The experiment on ZJUODset also shows 5.12% mAP improvements on the moderate difficulty level over the baseline when extra unlabeled data are available. Code is available at this https URL.
- [36] arXiv:2412.14657 (cross-list from cs.IT) [pdf, html, other]
-
Title: Directivity-Aware Degrees of Freedom Analysis for Extremely Large-Scale MIMOComments: 5 pages, 6 figuresSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Extremely large-scale multiple-input multiple-output (XL-MIMO) communications, enabled by numerous antenna elements integrated into large antenna surfaces, can provide increased effective degree of freedom (EDoF) to achieve high diversity gain. However, it remains an open problem that how the EDoF is influenced by the directional radiation pattern of antenna elements. In this work, empowered by the wavenumber-domain channel representation, we analyze the EDoF in a general case where the directivity of antennas, determined by the antenna structure and element spacing, is considered. Specifically, we first reveal the uneven distribution of directivity-aware wavenumber-domain coupling coefficients, i.e., channel gain towards different directions, in the isotropic Rayleigh fading channel. EDoF is then calculated based on such distribution of coupling coefficients. A numerical method is also provided to obtain coupling coefficients via electromagnetic full-wave simulations. Due to the influence of antenna directivity, how EDoF and ergodic channel capacity vary with the element spacing are explored via simulations for different antenna types.
- [37] arXiv:2412.14824 (cross-list from math.OC) [pdf, html, other]
-
Title: Provably Convergent Plug-and-play Proximal Block Coordinate Descent Method for Hyperspectral Anomaly DetectionSubjects: Optimization and Control (math.OC); Image and Video Processing (eess.IV)
Hyperspectral anomaly detection refers to identifying pixels in the hyperspectral images that have spectral characteristics significantly different from the background. In this paper, we introduce a novel model that represents the background information using a low-rank representation. We integrate an implicit proximal denoiser prior, associated with a deep learning based denoiser, within a plug-and-play (PnP) framework to effectively remove noise from the eigenimages linked to the low-rank representation. Anomalies are characterized using a generalized group sparsity measure, denoted as $\|\cdot\|_{2,\psi}$. To solve the resulting orthogonal constrained nonconvex nonsmooth optimization problem, we develop a PnP-proximal block coordinate descent (PnP-PBCD) method, where the eigenimages are updated using a proximal denoiser within the PnP framework. We prove that any accumulation point of the sequence generated by the PnP-PBCD method is a stationary point. We evaluate the effectiveness of the PnP-PBCD method on hyperspectral anomaly detection in scenarios with and without Gaussian noise contamination. The results demonstrate that the proposed method can effectively detect anomalous objects, outperforming the competing methods that may mistakenly identify noise as anomalies or misidentify the anomalous objects due to noise interference.
- [38] arXiv:2412.14925 (cross-list from cs.CV) [pdf, html, other]
-
Title: Automatic Spectral Calibration of Hyperspectral Images:Method, Dataset and BenchmarkSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Hyperspectral image (HSI) densely samples the world in both the space and frequency domain and therefore is more distinctive than RGB images. Usually, HSI needs to be calibrated to minimize the impact of various illumination conditions. The traditional way to calibrate HSI utilizes a physical reference, which involves manual operations, occlusions, and/or limits camera mobility. These limitations inspire this paper to automatically calibrate HSIs using a learning-based method. Towards this goal, a large-scale HSI calibration dataset is created, which has 765 high-quality HSI pairs covering diversified natural scenes and illuminations. The dataset is further expanded to 7650 pairs by combining with 10 different physically measured illuminations. A spectral illumination transformer (SIT) together with an illumination attention module is proposed. Extensive benchmarks demonstrate the SoTA performance of the proposed SIT. The benchmarks also indicate that low-light conditions are more challenging than normal conditions. The dataset and codes are available online:this https URL
- [39] arXiv:2412.15000 (cross-list from cs.RO) [pdf, html, other]
-
Title: Autonomous Navigation in Dynamic Human Environments with an Embedded 2D LiDAR-based Person TrackerComments: Accepted by SAS 2024Journal-ref: IEEE Sensors Applications Symposium (SAS), 2024, pp. 1-6Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
In the rapidly evolving landscape of autonomous mobile robots, the emphasis on seamless human-robot interactions has shifted towards autonomous decision-making. This paper delves into the intricate challenges associated with robotic autonomy, focusing on navigation in dynamic environments shared with humans. It introduces an embedded real-time tracking pipeline, integrated into a navigation planning framework for effective person tracking and avoidance, adapting a state-of-the-art 2D LiDAR-based human detection network and an efficient multi-object tracker. By addressing the key components of detection, tracking, and planning separately, the proposed approach highlights the modularity and transferability of each component to other applications. Our tracking approach is validated on a quadruped robot equipped with 270° 2D-LiDAR against motion capture system data, with the preferred configuration achieving an average MOTA of 85.45% in three newly recorded datasets, while reliably running in real-time at 20 Hz on the NVIDIA Jetson Xavier NX embedded GPU-accelerated platform. Furthermore, the integrated tracking and avoidance system is evaluated in real-world navigation experiments, demonstrating how accurate person tracking benefits the planner in optimizing the generated trajectories, enhancing its collision avoidance capabilities. This paper contributes to safer human-robot cohabitation, blending recent advances in human detection with responsive planning to navigate shared spaces effectively and securely.
- [40] arXiv:2412.15023 (cross-list from cs.SD) [pdf, html, other]
-
Title: Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic ControlsRiccardo Fosco Gramaccioni, Christian Marinoni, Emilian Postolache, Marco Comunità, Luca Cosmo, Joshua D. Reiss, Danilo ComminielloSubjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video. In our case, the intent is to leave full creative control to sound designers with a tool that allows them to bypass the more repetitive parts of their work, thus being able to focus on the creative aspects of sound production. We achieve this presenting Stable-V2A, a two-stage model consisting of: an RMS-Mapper that estimates an envelope representative of the audio characteristics associated with the input video; and Stable-Foley, a diffusion model based on Stable Audio Open that generates audio semantically and temporally aligned with the target video. Temporal alignment is guaranteed by the use of the envelope as a ControlNet input, while semantic alignment is achieved through the use of sound representations chosen by the designer as cross-attention conditioning of the diffusion process. We train and test our model on Greatest Hits, a dataset commonly used to evaluate V2A models. In addition, to test our model on a case study of interest, we introduce Walking The Maps, a dataset of videos extracted from video games depicting animated characters walking in different locations. Samples and code available on our demo page at this https URL.
- [41] arXiv:2412.15032 (cross-list from cs.CV) [pdf, html, other]
-
Title: DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT SpaceMang Ning, Mingxiao Li, Jianlin Su, Haozhe Jia, Lanmiao Liu, Martin Beneš, Albert Ali Salah, Itir Onal ErtugrulComments: 23 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
This paper explores image modeling from the frequency space and introduces DCTdiff, an end-to-end diffusion generative paradigm that efficiently models images in the discrete cosine transform (DCT) space. We investigate the design space of DCTdiff and reveal the key design factors. Experiments on different frameworks (UViT, DiT), generation tasks, and various diffusion samplers demonstrate that DCTdiff outperforms pixel-based diffusion models regarding generative quality and training efficiency. Remarkably, DCTdiff can seamlessly scale up to high-resolution generation without using the latent diffusion paradigm. Finally, we illustrate several intriguing properties of DCT image modeling. For example, we provide a theoretical proof of why `image diffusion can be seen as spectral autoregression', bridging the gap between diffusion and autoregressive models. The effectiveness of DCTdiff and the introduced properties suggest a promising direction for image modeling in the frequency space. The code is at \url{this https URL}.
- [42] arXiv:2412.15054 (cross-list from cs.CV) [pdf, html, other]
-
Title: GIRAFE: Glottal Imaging Dataset for Advanced Segmentation, Analysis, and Facilitative Playbacks EvaluationComments: 18 pages, 8 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
The advances in the development of Facilitative Playbacks extracted from High-Speed videoendoscopic sequences of the vocal folds are hindered by a notable lack of publicly available datasets annotated with the semantic segmentations corresponding to the area of the glottal gap. This fact also limits the reproducibility and further exploration of existing research in this field.
To address this gap, GIRAFE is a data repository designed to facilitate the development of advanced techniques for the semantic segmentation, analysis, and fast evaluation of High-Speed videoendoscopic sequences of the vocal folds. The repository includes 65 high-speed videoendoscopic recordings from a cohort of 50 patients (30 female, 20 male). The dataset comprises 15 recordings from healthy controls, 26 from patients with diagnosed voice disorders, and 24 with an unknown health condition. All of them were manually annotated by an expert, including the masks corresponding to the semantic segmentation of the glottal gap. The repository is also complemented with the automatic segmentation of the glottal area using different state-of-the-art approaches.
This data set has already supported several studies, which demonstrates its usefulness for the development of new glottal gap segmentation algorithms from High-Speed-Videoendoscopic sequences to improve or create new Facilitative Playbacks. Despite these advances and others in the field, the broader challenge of performing an accurate and completely automatic semantic segmentation method of the glottal area remains open. - [43] arXiv:2412.15058 (cross-list from cs.CV) [pdf, html, other]
-
Title: MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context GuidanceComments: Project Website: this https URL Keywords: interactive segmentation, in-context learning, medical image analysis, biomedical imaging, image annotation, visual promptingSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Medical researchers and clinicians often need to perform novel segmentation tasks on a set of related images. Existing methods for segmenting a new dataset are either interactive, requiring substantial human effort for each image, or require an existing set of manually labeled images. We introduce a system, MultiverSeg, that enables practitioners to rapidly segment an entire new dataset without requiring access to any existing labeled data from that task or domain. Along with the image to segment, the model takes user interactions such as clicks, bounding boxes or scribbles as input, and predicts a segmentation. As the user segments more images, those images and segmentations become additional inputs to the model, providing context. As the context set of labeled images grows, the number of interactions required to segment each new image decreases. We demonstrate that MultiverSeg enables users to interactively segment new datasets efficiently, by amortizing the number of interactions per image to achieve an accurate segmentation. Compared to using a state-of-the-art interactive segmentation method, using MultiverSeg reduced the total number of scribble steps by 53% and clicks by 36% to achieve 90% Dice on sets of images from unseen tasks. We release code and model weights at this https URL
- [44] arXiv:2412.15182 (cross-list from cs.RO) [pdf, html, other]
-
Title: STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy LearningComments: Project website at this https URLSubjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task, generalist policy by training broadly across them. Notably, while these generalist policies can improve the average performance across many tasks, the performance of generalist policies on any one task is often suboptimal due to negative transfer between partitions of the data, compared to task-specific specialist policies. In this work, we argue for the paradigm of training policies during deployment given the scenarios they encounter: rather than deploying pre-trained policies to unseen problems in a zero-shot manner, we non-parametrically retrieve and train models directly on relevant data at test time. Furthermore, we show that many robotics tasks share considerable amounts of low-level behaviors and that retrieval at the "sub"-trajectory granularity enables significantly improved data utilization, generalization, and robustness in adapting policies to novel problems. In contrast, existing full-trajectory retrieval methods tend to underutilize the data and miss out on shared cross-task content. This work proposes STRAP, a technique for leveraging pre-trained vision foundation models and dynamic time warping to retrieve sub-sequences of trajectories from large training corpora in a robust fashion. STRAP outperforms both prior retrieval algorithms and multi-task learning methods in simulated and real experiments, showing the ability to scale to much larger offline datasets in the real world as well as the ability to learn robust control policies with just a handful of real-world demonstrations.
- [45] arXiv:2412.15191 (cross-list from cs.CV) [pdf, html, other]
-
Title: AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video GenerationMoayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey TulyakovComments: Project Page: this http URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: this http URL
Cross submissions (showing 20 of 20 entries)
- [46] arXiv:2111.02363 (replaced) [pdf, html, other]
-
Title: Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain FeaturesComments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 31, pp. 54-70, 2023Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in short-time objective intelligibility (STOI) prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in mean opinion score (MOS) prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model.
- [47] arXiv:2204.13620 (replaced) [pdf, html, other]
-
Title: Generative Adversarial Networks for Image Super-Resolution: A SurveyComments: 31pages, 10 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Single image super-resolution (SISR) has played an important role in the field of image processing. Recent generative adversarial networks (GANs) can achieve excellent results on low-resolution images with small samples. However, there are little literatures summarizing different GANs in SISR. In this paper, we conduct a comparative study of GANs from different perspectives. We first take a look at developments of GANs. Second, we present popular architectures for GANs in big and small samples for image applications. Then, we analyze motivations, implementations and differences of GANs based optimization methods and discriminative learning for image super-resolution in terms of supervised, semi-supervised and unsupervised manners, where these GANs are analyzed via integrating different network architectures, prior knowledge, loss functions and multiple tasks. Next, we compare performance of these popular GANs on public datasets via quantitative and qualitative analysis in SISR. Finally, we highlight challenges of GANs and potential research points for SISR.
- [48] arXiv:2306.08918 (replaced) [pdf, html, other]
-
Title: PUGAN: Physical Model-Guided Underwater Image Enhancement Using GAN with Dual-DiscriminatorsComments: 8 pages, 4 figures, Accepted by IEEE Transactions on Image Processing 2023Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Due to the light absorption and scattering induced by the water medium, underwater images usually suffer from some degradation problems, such as low contrast, color distortion, and blurring details, which aggravate the difficulty of downstream underwater understanding tasks. Therefore, how to obtain clear and visually pleasant images has become a common concern of people, and the task of underwater image enhancement (UIE) has also emerged as the times require. Among existing UIE methods, Generative Adversarial Networks (GANs) based methods perform well in visual aesthetics, while the physical model-based methods have better scene adaptability. Inheriting the advantages of the above two types of models, we propose a physical model-guided GAN model for UIE in this paper, referred to as PUGAN. The entire network is under the GAN architecture. On the one hand, we design a Parameters Estimation subnetwork (Par-subnet) to learn the parameters for physical model inversion, and use the generated color enhancement image as auxiliary information for the Two-Stream Interaction Enhancement sub-network (TSIE-subnet). Meanwhile, we design a Degradation Quantization (DQ) module in TSIE-subnet to quantize scene degradation, thereby achieving reinforcing enhancement of key regions. On the other hand, we design the Dual-Discriminators for the style-content adversarial constraint, promoting the authenticity and visual aesthetics of the results. Extensive experiments on three benchmark datasets demonstrate that our PUGAN outperforms state-of-the-art methods in both qualitative and quantitative metrics.
- [49] arXiv:2307.04111 (replaced) [pdf, html, other]
-
Title: Model-Based End-to-End Learning for Multi-Target Integrated Sensing and Communication under Hardware ImpairmentsComments: 15 pages, 10 figures, accepted to TWCSubjects: Signal Processing (eess.SP)
We study model-based end-to-end learning in the context of integrated sensing and communication (ISAC) under hardware impairments. Hardware impairments are usually addressed by means of array calibration with a focus on communication performance. However, residual impairments may exist that affect sensing performance. This paper proposes a data-driven framework for mitigating such impairments. A monostatic orthogonal frequency-division multiplexing (OFDM) sensing and multiple-input single-output (MISO) communication scenario is considered, incorporating hardware imperfections at the ISAC transceiver antenna array. We propose a novel differentiable version of the orthogonal matching pursuit (OMP) algorithm that is suitable for multi-target sensing and allows for efficient end-to-end learning of the hardware impairments. Based on the differentiable OMP, we devise two model-based parameterization strategies of the ISAC beamformer and sensing receiver to account for hardware impairments: (i) learning a dictionary of steering vectors for different angles and (ii) learning the parameterized hardware impairments. We carry out a comprehensive performance analysis of the proposed model-based learning approaches and a strong baseline consisting of least-squares beamforming, conventional OMP, and maximum-likelihood symbol detection for communication. Results show that by parameterizing the hardware impairments, learning approaches offer gains in terms of higher detection probability, position estimation accuracy, and lower symbol error rate (SER) compared to the baseline. We demonstrate that learning the parameterized hardware impairments outperforms learning a dictionary of steering vectors, also exhibiting the lowest complexity.
- [50] arXiv:2406.05437 (replaced) [pdf, html, other]
-
Title: From Analog to Digital: Multi-Order Digital Joint Coding-Modulation for Semantic CommunicationSubjects: Signal Processing (eess.SP)
Recent studies in joint source-channel coding (JSCC) have fostered a fresh paradigm in end-to-end semantic communication. Despite notable performance achievements, present initiatives in building semantic communication systems primarily hinge on the transmission of continuous channel symbols, thus presenting challenges in compatibility with established digital systems. In this paper, we introduce a novel approach to address this challenge by developing a multi-order digital joint coding-modulation (MDJCM) scheme for semantic communications. Initially, we construct a digital semantic communication system by integrating a multi-order modulation/demodulation module into a nonlinear transform source-channel coding (NTSCC) framework. Recognizing the non-differentiable nature of modulation/demodulation, we propose a novel substitution training strategy. Herein, we treat modulation/demodulation as a constrained quantization process and introduce scaling operations alongside manually crafted noise to approximate this process. As a result, employing this approximation in training semantic communication systems can be deployed in practical modulation/demodulation scenarios with superior performance. Additionally, we demonstrate the equivalence by analyzing the involved probability distribution. Moreover, to further upgrade the performance, we develop a hierarchical dimension-reduction strategy to provide a gradual information extraction process. Extensive experimental evaluations demonstrate the superiority of our proposed method over existing digital and non-digital JSCC techniques.
- [51] arXiv:2406.14861 (replaced) [pdf, other]
-
Title: Resilience of the Electric Grid through Trustable IoT-Coordinated Assets (Extended version)Vineet J. Nair, Venkatesh Venkataramanan, Priyank Srivastava, Partha S. Sarker, Anurag Srivastava, Laurentiu D. Marinovici, Jun Zha, Christopher Irwin, Prateek Mittal, John Williams, Jayant Kumar, H. Vincent Poor, Anuradha M. AnnaswamyComments: Accepted to the Proceedings of the National Academy of Sciences (PNAS) 2024. Extended version with supplementary information includedSubjects: Systems and Control (eess.SY); Emerging Technologies (cs.ET)
The electricity grid has evolved from a physical system to a cyber-physical system with digital devices that perform measurement, control, communication, computation, and actuation. The increased penetration of distributed energy resources (DERs) including renewable generation, flexible loads, and storage provides extraordinary opportunities for improvements in efficiency and sustainability. However, they can introduce new vulnerabilities in the form of cyberattacks, which can cause significant challenges in ensuring grid resilience. We propose a framework in this paper for achieving grid resilience through suitably coordinated assets including a network of Internet of Things (IoT) devices. A local electricity market is proposed to identify trustable assets and carry out this coordination. Situational Awareness (SA) of locally available DERs with the ability to inject power or reduce consumption is enabled by the market, together with a monitoring procedure for their trustability and commitment. With this SA, we show that a variety of cyberattacks can be mitigated using local trustable resources without stressing the bulk grid. Multiple demonstrations are carried out using a high-fidelity co-simulation platform, real-time hardware-in-the-loop validation, and a utility-friendly simulator.
- [52] arXiv:2406.17666 (replaced) [pdf, html, other]
-
Title: Improving ovarian cancer segmentation accuracy with transformers through AI-guided labelingAneesh Rangnekar, Kevin M. Boehm, Emily A. Aherne, Ines Nikolovski, Natalie Gangai, Ying Liu, Dimitry Zamarin, Kara L. Roche, Sohrab P. Shah, Yulia Lakhman, Harini VeeraraghavanSubjects: Image and Video Processing (eess.IV)
Transformer models have demonstrated the capability to produce highly accurate segmentation of organs and tumors. However, model training requires high-quality curated datasets to ensure robust generalization to unseen datasets. Hence, we developed an artificial intelligence (AI) guided approach to assist with radiologist tumor delineation of partially segmented computed tomography datasets containing primary (adnexa) tumors and metastatic (omental) implants. AI guidance was implemented by training a 2D multiple resolution residual network trained with a dataset of 245 contrast-enhanced CTs with partially segmented examples. The same dataset curated through AI guidance was then used to refine two pretrained transformer models called SMIT and Swin UNETR. The models were independently tested on 71 publicly available multi-institutional 3D CT datasets. Segmentation accuracy was computed using the Dice similarity coefficient metric (DSC), average symmetric surface distance (ASSD), and the relative volume difference (RVD) metrics. Radiomic features reproducibility was assessed using the concordance correlation coefficient (CCC). Training with AI-guided segmentations significantly improved the accuracy of both SMIT (p = 6.2e-5) and Swin UNETR (p = 2e-4) models compared with using a partially delineated training dataset. Furthermore, SMIT-generated segmentations resulted in more reproducible features compared to Swin UNETR under multiple feature categories. Our results show that AI-guided data curation provides a more efficient approach to train AI models and that AI-generated segmentations can provide reproducible radiomics features.
- [53] arXiv:2407.04336 (replaced) [pdf, html, other]
-
Title: AI-Driven Mobility Management for High-Speed Railway Communications: Compressed Measurements and Proactive HandoverSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI)
High-speed railway (HSR) communications are pivotal for ensuring rail safety, operations, maintenance, and delivering passenger information services. The high speed of trains creates rapidly time-varying wireless channels, increases the signaling overhead, and reduces the system throughput, making it difficult to meet the growing and stringent needs of HSR applications. In this article, we explore artificial intelligence (AI)-based beam-level and cell-level mobility management suitable for HSR communications. Particularly, we propose a compressed spatial multi-beam measurements scheme via compressive sensing for beam-level mobility management in HSR communications. In comparison to traditional down-sampling spatial beam measurements, this method leads to improved spatial-temporal beam prediction accuracy with the same measurement overhead. Moreover, we propose a novel AI-based proactive handover scheme to predict handover events and reduce radio link failure (RLF) rates in HSR communications. Compared with the traditional event A3-based handover mechanism, the proposed approach significantly reduces the RLF rates which saves 50% beam measurement overhead.
- [54] arXiv:2408.02549 (replaced) [pdf, html, other]
-
Title: Generative AI as a Service in 6G Edge-Cloud: Generation Task Offloading by In-context LearningComments: This paper has been accepted by IEEE Wireless Communications LettersSubjects: Systems and Control (eess.SY)
Generative artificial intelligence (GAI) is a promising technique towards 6G networks, and generative foundation models such as large language models (LLMs) have attracted considerable interest from academia and telecom industry. This work considers a novel edge-cloud deployment of foundation models in 6G networks. Specifically, it aims to minimize the service delay of foundation models by radio resource allocation and task offloading, i.e., offloading diverse content generation tasks to proper LLMs at the network edge or cloud. In particular, we first introduce the communication system model, i.e., allocating radio resources and calculating link capacity to support generated content transmission, and then we present the LLM inference model to calculate the delay of content generation. After that, we propose a novel in-context learning method to optimize the task offloading decisions. It utilizes LLM's inference capabilities, and avoids the difficulty of dedicated model training or fine-tuning as in conventional machine learning algorithms. Finally, the simulations demonstrate that the proposed edge-cloud deployment and in-context learning task offloading method can achieve satisfactory generation service quality without dedicated model training or fine-tuning.
- [55] arXiv:2408.11227 (replaced) [pdf, other]
-
Title: OCTCube-M: A 3D multimodal optical coherence tomography foundation model for retinal and systemic diseases with cross-cohort and cross-device validationZixuan Liu, Hanwen Xu, Addie Woicik, Linda G. Shapiro, Marian Blazes, Yue Wu, Verena Steffen, Catherine Cukras, Cecilia S. Lee, Miao Zhang, Aaron Y. Lee, Sheng WangSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
We present OCTCube-M, a 3D OCT-based multi-modal foundation model for jointly analyzing OCT and en face images. OCTCube-M first developed OCTCube, a 3D foundation model pre-trained on 26,685 3D OCT volumes encompassing 1.62 million 2D OCT images. It then exploits a novel multi-modal contrastive learning framework COEP to integrate other retinal imaging modalities, such as fundus autofluorescence and infrared retinal imaging, into OCTCube, efficiently extending it into multi-modal foundation models. OCTCube achieves best performance on predicting 8 retinal diseases, demonstrating strong generalizability on cross-cohort, cross-device and cross-modality prediction. OCTCube can also predict cross-organ nodule malignancy (CT) and low cardiac ejection fraction as well as systemic diseases, such as diabetes and hypertension, revealing its wide applicability beyond retinal diseases. We further develop OCTCube-IR using COEP with 26,685 OCT and IR image pairs. OCTCube-IR can accurately retrieve between OCT and IR images, allowing joint analysis between 3D and 2D retinal imaging modalities. Finally, we trained a tri-modal foundation model OCTCube-EF from 4 million 2D OCT images and 400K en face retinal images. OCTCube-EF attains the best performance on predicting the growth rate of geographic atrophy (GA) across datasets collected from 6 multi-center global trials conducted in 23 countries. This improvement is statistically equivalent to running a clinical trial with more than double the size of the original study. Our analysis based on another retrospective case study reveals OCTCube-EF's ability to avoid false positive Phase-III results according to its accurate treatment effect estimation on the Phase-II results. In sum, OCTCube-M is a 3D multi-modal foundation model framework that integrates OCT and other retinal imaging modalities revealing substantial diagnostic and prognostic benefits.
- [56] arXiv:2409.02597 (replaced) [pdf, html, other]
-
Title: Rate-Adaptive Generative Semantic Communication Using Conditional Diffusion ModelsSubjects: Signal Processing (eess.SP)
Recent advances in deep learning-based joint source-channel coding (DJSCC) have shown promise for end-to-end semantic image transmission. However, most existing schemes primarily focus on optimizing pixel-wise metrics, which often fail to align with human perception, leading to lower perceptual quality. In this letter, we propose a novel generative DJSCC approach using conditional diffusion models to enhance the perceptual quality of transmitted images. Specifically, by utilizing entropy models, we effectively manage transmission bandwidth based on the estimated entropy of transmitted sym-bols. These symbols are then used at the receiver as conditional information to guide a conditional diffusion decoder in image reconstruction. Our model is built upon the emerging advanced mamba-like linear attention (MLLA) skeleton, which excels in image processing tasks while also offering fast inference speed. Besides, we introduce a multi-stage training strategy to ensure the stability and improve the overall performance of the model. Simulation results demonstrate that our proposed method significantly outperforms existing approaches in terms of perceptual quality.
- [57] arXiv:2409.07666 (replaced) [pdf, html, other]
-
Title: Design of Distributed Controller for Discrete-Time Systems Via the Integration of Extended LMI and Clique-Wise DecompositionSubjects: Systems and Control (eess.SY)
This study addresses the centralized synthesis of distributed controllers using linear matrix inequalities (LMIs). Sparsity constraints on control gains of distributed controllers result in conservatism via the convexification of the existing methods such as the extended LMI method. In order to mitigate the conservatism, we introduce a novel LMI formulation for this problem, utilizing the clique-wise decomposition method from our previous work on continuous-time systems. By reformulating the sparsity constraint on the gain matrix within cliques, this method achieves a broader solution set. Also, the analytical superiority of our method is confirmed through numerical examples.
- [58] arXiv:2411.11857 (replaced) [pdf, html, other]
-
Title: Radiance Field Delta Video Compression in Edge-Enabled Vehicular MetaverseMatúš Dopiriak, Eugen Šlapak, Juraj Gazda, Devendra S. Gurjar, Mohammad Abdullah Al Faruque, Marco LevoratoComments: 1. III. Formulation of the problem -> refined mathematical notations and equations. this http URL. B Delta Segmentation -> updated Delta segmentation (DS) algorithm using mathematical description, pseudocode and Fig.3. 3. V. D Packet Loss -> added reference. 4. Added biographies. 5. Changed templateSubjects: Signal Processing (eess.SP); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
Connected and autonomous vehicles (CAVs) offload computationally intensive tasks to multi-access edge computing (MEC) servers via vehicle-to-infrastructure (V2I) communication, enabling applications within the vehicular metaverse, which transforms physical environment into the digital space enabling advanced analysis or predictive modeling. A core challenge is physical-to-virtual (P2V) synchronization through digital twins (DTs), reliant on MEC networks and ultra-reliable low-latency communication (URLLC). To address this, we introduce radiance field (RF) delta video compression (RFDVC), which uses RF-encoder and RF-decoder architecture using distributed RFs as DTs storing photorealistic 3D urban scenes in compressed form. This method extracts differences between CAV-frame capturing actual traffic and RF-frame capturing empty scene from the same camera pose in batches encoded and transmitted over the MEC network. Experiments show data savings up to 71% against H.264 codec and 44% against H.265 codec under different conditions as lighting changes, and rain. RFDVC also demonstrates resilience to transmission errors, achieving up to +0.29 structural similarity index measure (SSIM) improvement at block error rate (BLER) = 0.35 in non-rainy and +0.25 at BLER = 0.2 in rainy conditions, ensuring superior visual quality compared to standard video coding (VC) methods across various conditions.
- [59] arXiv:2412.10341 (replaced) [pdf, other]
-
Title: Shape error prediction in 5-axis machining using graph neural networksSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
This paper presents an innovative method for predicting shape errors in 5-axis machining using graph neural networks. The graph structure is defined with nodes representing workpiece surface points and edges denoting the neighboring relationships. The dataset encompasses data from a material removal simulation, process data, and post-machining quality information. Experimental results show that the presented approach can generalize the shape error prediction for the investigated workpiece geometry. Moreover, by modelling spatial and temporal connections within the workpiece, the approach handles a low number of labels compared to non-graphical methods such as Support Vector Machines.
- [60] arXiv:2412.11343 (replaced) [pdf, html, other]
-
Title: Temporal Logic Control for Nonlinear Stochastic Systems Under Unknown DisturbancesSubjects: Systems and Control (eess.SY)
In this paper, we present a novel framework to synthesize robust strategies for discrete-time nonlinear systems with random disturbances that are unknown, against temporal logic specifications. The proposed framework is data-driven and abstraction-based: leveraging observations of the system, our approach learns a high-confidence abstraction of the system in the form of an uncertain Markov decision process (UMDP). The uncertainty in the resulting UMDP is used to formally account for both the error in abstracting the system and for the uncertainty coming from the data. Critically, we show that for any given state-action pair in the resulting UMDP, the uncertainty in the transition probabilities can be represented as a convex polytope obtained by a two-layer state discretization and concentration inequalities. This allows us to obtain tighter uncertainty estimates compared to existing approaches, and guarantees efficiency, as we tailor a synthesis algorithm exploiting the structure of this UMDP. We empirically validate our approach on several case studies, showing substantially improved performance compared to the state-of-the-art.
- [61] arXiv:2308.03240 (replaced) [pdf, html, other]
-
Title: Carbon-Aware Optimal Power FlowSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
To facilitate effective decarbonization of the electric power sector, this paper introduces the generic Carbon-aware Optimal Power Flow (C-OPF) method for power system decision-making that considers demand-side carbon accounting and emission management. Built upon the classic optimal power flow (OPF) model, the C-OPF method incorporates carbon emission flow equations and constraints, as well as carbon-related objectives, to jointly optimize power flow and carbon flow. In particular, this paper establishes the feasibility and solution uniqueness of the carbon emission flow equations, and proposes modeling and linearization techniques to address the issues of undetermined power flow directions and bilinear terms in the C-OPF model. Additionally, two novel carbon emission models, together with the carbon accounting schemes, for energy storage systems are developed and integrated into the C-OPF model. Numerical simulations demonstrate the characteristics and effectiveness of the C-OPF method, in comparison with OPF solutions.
- [62] arXiv:2310.12446 (replaced) [pdf, html, other]
-
Title: Electromagnetic Information Theory-Based Statistical Channel Model for Improved Channel EstimationComments: Electromagnetic information theory (EIT) is an emerging interdisciplinary subject, aiming at providing a unified analytical framework for wireless systems as well as guiding practical system design. This paper answers the question: "Whether can we improve wireless communication systems via EIT"?Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Electromagnetic information theory (EIT) is an emerging interdisciplinary subject that integrates classical Maxwell electromagnetics and Shannon information theory. The goal of EIT is to uncover the information transmission mechanisms from an electromagnetic (EM) perspective in wireless systems. Existing works on EIT are mainly focused on the analysis of EM channel characteristics, degrees-of-freedom, and system capacity. However, these works do not clarify how to integrate EIT knowledge into the design and optimization of wireless systems. To fill in this gap, in this paper, we propose an EIT-based statistical channel model with simplified parameterization. Thanks to the simplified closed-form expression of the EMCF, it can be readily applied to various channel modeling and inference tasks. Specifically, by averaging the solutions of Maxwell's equations over a tunable von Mises distribution, we obtain a spatio-temporal correlation function (STCF) model of the EM channel, which we name as the EMCF. Furthermore, by tuning the parameters of the EMCF, we propose an EIT-based covariance estimator (EIT-Cov) to accurately capture the channel covariance. Since classical MMSE estimators can exploit prior information contained in the channel covariance matrix, we further propose the EIT-MMSE channel estimator by substituting EMCF for the covariance matrix. Simulation results show that both the proposed EIT-Cov covariance estimator and the EIT-MMSE channel estimator outperform their baseline algorithms, thus proving that EIT is beneficial to wireless communication systems.
- [63] arXiv:2310.14778 (replaced) [pdf, html, other]
-
Title: Audio-Visual Speaker Tracking: Progress, Challenges, and Future DirectionsJinzheng Zhao, Yong Xu, Xinyuan Qian, Davide Berghi, Peipei Wu, Meng Cui, Jianyuan Sun, Philip J.B. Jackson, Wenwu WangSubjects: Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boosts the development of audio visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. At last, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking.
- [64] arXiv:2312.15638 (replaced) [pdf, html, other]
-
Title: Risk-Aware Control of Discrete-Time Stochastic Systems: Integrating Kalman Filter and Worst-case CVaR in Control Barrier FunctionsComments: This has been presented at IEEE Conference on Decision and Control 2024, pp. 2019-2024. Minor typos in equations (12) - (14) have been fixedSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
This paper proposes control approaches for discrete-time linear systems subject to stochastic disturbances. It employs Kalman filter to estimate the mean and covariance of the state propagation, and the worst-case conditional value-at-risk (CVaR) to quantify the tail risk using the estimated mean and covariance. The quantified risk is then integrated into a control barrier function (CBF) to derive constraints for controller synthesis, addressing tail risks near safe set boundaries. Two optimization-based control methods are presented using the obtained constraints for half-space and ellipsoidal safe sets, respectively. The effectiveness of the obtained results is demonstrated using numerical simulations.
- [65] arXiv:2401.11141 (replaced) [pdf, html, other]
-
Title: Wideband Beamforming for RIS Assisted Near-Field CommunicationsJournal-ref: in IEEE Transactions on Wireless Communications, vol. 23, no. 11, pp. 16836-16851, Nov. 2024Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
A near-field wideband beamforming scheme is investigated for reconfigurable intelligent surface (RIS) assisted multiple-input multiple-output (MIMO) systems, in which a deep learning-based end-to-end (E2E) optimization framework is proposed to maximize the system spectral efficiency. To deal with the near-field double beam split effect, the base station is equipped with frequency-dependent hybrid precoding architecture by introducing sub-connected true time delay (TTD) units, while two specific RIS architectures, namely true time delay-based RIS (TTD-RIS) and virtual subarray-based RIS (SA-RIS), are exploited to realize the frequency-dependent passive beamforming at the RIS. Furthermore, the efficient E2E beamforming models without explicit channel state information are proposed, which jointly exploits the uplink channel training module and the downlink wideband beamforming module. In the proposed network architecture of the E2E models, the classical communication signal processing methods, i.e., polarized filtering and sparsity transform, are leveraged to develop a signal-guided beamforming network. Numerical results show that the proposed E2E models have superior beamforming performance and robustness to conventional beamforming benchmarks. Furthermore, the tradeoff between the beamforming gain and the hardware complexity is investigated for different frequency-dependent RIS architectures, in which the TTD-RIS can achieve better spectral efficiency than the SA-RIS while requiring additional energy consumption and hardware cost.
- [66] arXiv:2401.14898 (replaced) [pdf, html, other]
-
Title: Decentralized real-time iterations for distributed NMPCSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
This article presents a Real-Time Iteration (RTI) scheme for distributed Nonlinear Model Predictive Control (NMPC). The scheme transfers the well-known RTI approach, a key enabler for many industrial real-time NMPC implementations, to the setting of cooperative distributed control. At each sampling instant, one outer iteration of a bi-level decentralized Sequential Quadratic Programming (dSQP) method is applied to a centralized optimal control problem. This ensures that real-time requirements are met and it facilitates cooperation between subsystems. Combining novel dSQP convergence results with RTI stability guarantees, we prove local exponential stability under standard assumptions on the MPC design with and without terminal constraints. The proposed scheme only requires neighbor-to-neighbor communication and avoids a central coordinator. A numerical example with coupled inverted pendulums demonstrates the efficacy of the approach.
- [67] arXiv:2403.05974 (replaced) [pdf, html, other]
-
Title: Deep Reinforcement Learning Enhanced Rate-Splitting Multiple Access for Interference MitigationSubjects: Information Theory (cs.IT); Multiagent Systems (cs.MA); Signal Processing (eess.SP)
This study explores the application of the rate-splitting multiple access (RSMA) technique, vital for interference mitigation in modern communication systems. It investigates the use of precoding methods in RSMA, especially in complex multiple-antenna interference channels, employing deep reinforcement learning. The aim is to optimize precoders and power allocation for common and private data streams involving multiple decision-makers. A multi-agent deep deterministic policy gradient (MADDPG) framework is employed to address this complexity, where decentralized agents collectively learn to optimize actions in a continuous policy space. We also explore the challenges posed by imperfect channel side information at the transmitter. Additionally, decoding order estimation is addressed to determine the optimal decoding sequence for common and private data sequences. Simulation results demonstrate the effectiveness of the proposed RSMA method based on MADDPG, achieving the upper bound in single-antenna scenarios and closely approaching theoretical limits in multi-antenna scenarios. Comparative analysis shows superiority over other techniques such as MADDPG without rate-splitting, maximal ratio transmission (MRT), zero-forcing (ZF), and leakage-based precoding methods. These findings highlight the potential of deep reinforcement learning-driven RSMA in reducing interference and enhancing system performance in communication systems.
- [68] arXiv:2407.16302 (replaced) [pdf, html, other]
-
Title: DeepClean: Integrated Distortion Identification and Algorithm Selection for Rectifying Image CorruptionsComments: 7 pages, 3 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Distortion identification and rectification in images and videos is vital for achieving good performance in downstream vision applications. Instead of relying on fixed trial-and-error based image processing pipelines, we propose a two-level sequential planning approach for automated image distortion classification and rectification. At the higher level it detects the class of corruptions present in the input image, if any. The lower level selects a specific algorithm to be applied, from a set of externally provided candidate algorithms. The entire two-level setup runs in the form of a single forward pass during inference and it is to be queried iteratively until the retrieval of the original image. We demonstrate improvements compared to three baselines on the object detection task on COCO image dataset with rich set of distortions. The advantage of our approach is its dynamic reconfiguration, conditioned on the input image and generalisability to unseen candidate algorithms at inference time, since it relies only on the comparison of their output of the image embeddings.
- [69] arXiv:2408.10561 (replaced) [pdf, html, other]
-
Title: ICSD: An Open-source Dataset for Infant Cry and Snoring DetectionComments: 11 pages, 6 figuresSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
The detection and analysis of infant cry and snoring events are crucial tasks within the field of audio signal processing. While existing datasets for general sound event detection are plentiful, they often fall short in providing sufficient, strongly labeled data specific to infant cries and snoring. To provide a benchmark dataset and thus foster the research of infant cry and snoring detection, this paper introduces the Infant Cry and Snoring Detection (ICSD) dataset, a novel, publicly available dataset specially designed for ICSD tasks. The ICSD comprises three types of subsets: a real strongly labeled subset with event-based labels annotated manually, a weakly labeled subset with only clip-level event annotations, and a synthetic subset generated and labeled with strong annotations. This paper provides a detailed description of the ICSD creation process, including the challenges encountered and the solutions adopted. We offer a comprehensive characterization of the dataset, discussing its limitations and key factors for ICSD usage. Additionally, we conduct extensive experiments on the ICSD dataset to establish baseline systems and offer insights into the main factors when using this dataset for ICSD research. Our goal is to develop a dataset that will be widely adopted by the community as a new open benchmark for future ICSD research.
- [70] arXiv:2408.11479 (replaced) [pdf, html, other]
-
Title: Learning Deep Dissipative DynamicsComments: AAAI 2025Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
This study challenges strictly guaranteeing ``dissipativity'' of a dynamical system represented by neural networks learned from given time-series data. Dissipativity is a crucial indicator for dynamical systems that generalizes stability and input-output stability, known to be valid across various systems including robotics, biological systems, and molecular dynamics. By analytically proving the general solution to the nonlinear Kalman-Yakubovich-Popov (KYP) lemma, which is the necessary and sufficient condition for dissipativity, we propose a differentiable projection that transforms any dynamics represented by neural networks into dissipative ones and a learning method for the transformed dynamics. Utilizing the generality of dissipativity, our method strictly guarantee stability, input-output stability, and energy conservation of trained dynamical systems. Finally, we demonstrate the robustness of our method against out-of-domain input through applications to robotic arms and fluid dynamics. Code is this https URL
- [71] arXiv:2409.17603 (replaced) [pdf, html, other]
-
Title: Deep CLAS: Deep Contextual Listen, Attend and SpellComments: Submitted to JUSTCSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to use contextual information better. We introduce bias loss forcing model to focus on contextual information. The query of bias attention is also enriched to improve the accuracy of the bias attention score. To get fine-grained contextual information, we replace phrase-level encoding with character-level encoding and encode contextual information with conformer rather than LSTM. Moreover, we directly use the bias attention score to correct the output probability distribution of the model. Experiments using the public AISHELL-1 and AISHELL-NER. On AISHELL-1, compared to CLAS baselines, deep CLAS obtains a 65.78% relative recall and a 53.49% relative F1-score increase in the named entity recognition scene.
- [72] arXiv:2410.10913 (replaced) [pdf, html, other]
-
Title: Audio Captioning RAG via Generative Pair-to-Pair Retrieval with Refined Knowledge BaseSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Recent advances in audio understanding tasks leverage the reasoning capabilities of LLMs. However, adapting LLMs to learn audio concepts requires massive training data and substantial computational resources. To address these challenges, Retrieval-Augmented Generation (RAG) retrieves audio-text pairs from a knowledge base (KB) and augments them with query audio to generate accurate textual responses. In RAG, the relevance of the retrieved information plays a crucial role in effectively processing the input. In this paper, we analyze how different retrieval methods and knowledge bases impact the relevance of audio-text pairs and the performance of audio captioning with RAG. We propose generative pair-to-pair retrieval, which uses the generated caption as a text query to accurately find relevant audio-text pairs to the query audio, thereby improving the relevance and accuracy of retrieved information. Additionally, we refine the large-scale knowledge base to retain only audio-text pairs that align with the contextualized intents. Our approach achieves state-of-the-art results on benchmarks including AudioCaps, Clotho, and Auto-ACD, with detailed ablation studies validating the effectiveness of our retrieval and KB construction methods.
- [73] arXiv:2410.13677 (replaced) [pdf, html, other]
-
Title: Beamforming Optimization for Continuous Aperture Array (CAPA)-based CommunicationsComments: 14 pages, 9 figuresSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
The beamforming optimization in continuous aperture array (CAPA)-based multi-user communications is studied. In contrast to conventional spatially discrete antenna arrays, CAPAs can exploit the full spatial degrees of freedom (DoFs) by emitting information-bearing electromagnetic (EM) waves through continuous source current distributed across the aperture. Nevertheless, such an operation renders the beamforming optimization problem as a non-convex integral-based functional programming problem, which is challenging for conventional discrete optimization methods. A couple of low-complexity approaches are proposed to solve the functional programming problem. 1) Calculus of variations (CoV)-based approach: Closed-form structure of the optimal continuous source patterns are derived based on CoV, inspiring a low-complexity integral-free iterative algorithm for solving the functional programming problem. 2) Correlation-based zero-forcing (Corr-ZF) approach: Closed-form ZF source current patterns that completely eliminate the inter-user interference are derived based on the channel correlations. By using these patterns, the original functional programming problem is transformed to a simple power allocation problem, which can be solved using the classical water-filling approach with reduced complexity. Our numerical results validate the effectiveness of the proposed designs and reveal that: i) compared to the state-of-the-art Fourier-based discretization approach, the proposed CoV-based approach not only improves communication performance but also reduces computational complexity by up to hundreds of times for large CAPA apertures and high frequencies, and ii) the proposed Corr-ZF approach achieves asymptotically optimal performance compared to the CoV-based approach.
- [74] arXiv:2412.11795 (replaced) [pdf, html, other]
-
Title: ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech SynthesisComments: Accepted by AAAI 2025Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which learns a bank of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.
- [75] arXiv:2412.14031 (replaced) [pdf, html, other]
-
Title: Gauss-Newton Dynamics for Neural Networks: A Riemannian Optimization PerspectiveSubjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
We analyze the convergence of Gauss-Newton dynamics for training neural networks with smooth activation functions. In the underparameterized regime, the Gauss-Newton gradient flow induces a Riemannian gradient flow on a low-dimensional, smooth, embedded submanifold of the Euclidean output space. Using tools from Riemannian optimization, we prove \emph{last-iterate} convergence of the Riemannian gradient flow to the optimal in-class predictor at an \emph{exponential rate} that is independent of the conditioning of the Gram matrix, \emph{without} requiring explicit regularization. We further characterize the critical impacts of the neural network scaling factor and the initialization on the convergence behavior. In the overparameterized regime, we show that the Levenberg-Marquardt dynamics with an appropriately chosen damping factor yields robustness to ill-conditioned kernels, analogous to the underparameterized regime. These findings demonstrate the potential of Gauss-Newton methods for efficiently optimizing neural networks, particularly in ill-conditioned problems where kernel and Gram matrices have small singular values.