Electrical Engineering and Systems Science

New submissions
Cross-lists
Replacements

See recent articles

Showing new listings for Friday, 21 February 2025

Total of 108 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2502.13969 [pdf, html, other]: Title: Bridging Simulation and Reality: A 3D Clustering-Based Deep Learning Model for UAV-Based RF Source Localization

Saad Masrur, Ismail Guvenc

Comments: This paper has been submitted to IEEE ICC 2025

Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI)

Localization of radio frequency (RF) sources has critical applications, including search and rescue, jammer detection, and monitoring of hostile activities. Unmanned aerial vehicles (UAVs) offer significant advantages for RF source localization (RFSL) over terrestrial methods, leveraging autonomous 3D navigation and improved signal capture at higher altitudes. Recent advancements in deep learning (DL) have further enhanced localization accuracy, particularly for outdoor scenarios. DL models often face challenges in real-world performance, as they are typically trained on simulated datasets that fail to replicate real-world conditions fully. To address this, we first propose the Enhanced Two-Ray propagation model, reducing the simulation-to-reality gap by improving the accuracy of propagation environment modeling. For RFSL, we propose the 3D Cluster-Based RealAdaptRNet, a DL-based method leveraging 3D clustering-based feature extraction for robust localization. Experimental results demonstrate that the proposed Enhanced Two-Ray model provides superior accuracy in simulating real-world propagation scenarios compared to conventional free-space and two-ray models. Notably, the 3D Cluster-Based RealAdaptRNet, trained entirely on simulated datasets, achieves exceptional performance when validated in real-world environments using the AERPAW physical testbed, with an average localization error of 18.2 m. The proposed approach is computationally efficient, utilizing 33.5 times fewer parameters, and demonstrates strong generalization capabilities across diverse trajectories, making it highly suitable for real-world applications.
[2] arXiv:2502.13972 [pdf, html, other]: Title: IncepFormerNet: A multi-scale multi-head attention network for SSVEP classification

Yan Huang, Yongru Chen, Lei Cao, Yongnian Cao, Xuechun Yang, Yilin Dong, Tianyu Liu

Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

In recent years, deep learning (DL) models have shown outstanding performance in EEG classification tasks, particularly in Steady-State Visually Evoked Potential(SSVEP)-based Brain-Computer-Interfaces(BCI)systems. DL methods have been successfully applied to SSVEP-BCI. This study proposes a new model called IncepFormerNet, which is a hybrid of the Inception and Transformer architectures. IncepFormerNet adeptly extracts multi-scale temporal information from time series data using parallel convolution kernels of varying sizes, accurately capturing the subtle variations and critical features within SSVEP this http URL, the model integrates the multi-head attention mechanism from the Transformer architecture, which not only provides insights into global dependencies but also significantly enhances the understanding and representation of complex this http URL, it takes advantage of filter bank techniques to extract features based on the spectral characteristics of SSVEP data. To validate the effectiveness of the proposed model, we conducted experiments on two public datasets, . The experimental results show that IncepFormerNet achieves an accuracy of 87.41 on Dataset 1 and 71.97 on Dataset 2 using a 1.0-second time window. To further verify the superiority of the proposed model, we compared it with other deep learning models, and the results indicate that our method achieves significantly higher accuracy than the this http URL source codes in this work are available at: this https URL.
[3] arXiv:2502.13974 [pdf, html, other]: Title: Segmentation-free integration of nuclei morphology and spatial transcriptomics for retinal images

Eduard Chelebian, Pratiti Dasgupta, Zainalabedin Samadi, Carolina Wählby, Amjad Askary

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

This study introduces SEFI (SEgmentation-Free Integration), a novel method for integrating morphological features of cell nuclei with spatial transcriptomics data. Cell segmentation poses a significant challenge in the analysis of spatial transcriptomics data, as tissue-specific structural complexities and densely packed cells in certain regions make it difficult to develop a universal approach. SEFI addresses this by utilizing self-supervised learning to extract morphological features from fluorescent nuclear staining images, enhancing the clustering of gene expression data without requiring segmentation. We demonstrate SEFI on spatially resolved gene expression profiles of the developing retina, acquired using multiplexed single molecule Fluorescence In Situ Hybridization (smFISH). SEFI is publicly available at this https URL.
[4] arXiv:2502.13976 [pdf, html, other]: Title: Regularização, aprendizagem profunda e interdisciplinaridade em problemas inversos mal-postos

Roberto Gutierrez Beraldo, Ricardo Suyama

Comments: 200 pages, in Portuguese language, 54 figures

Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)

In this book, written in Portuguese, we discuss what ill-posed problems are and how the regularization method is used to solve them. In the form of questions and answers, we reflect on the origins and future of regularization, relating the similarities and differences of its meaning in different areas, including inverse problems, statistics, machine learning, and deep learning.
[5] arXiv:2502.13982 [pdf, html, other]: Title: Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics

Kabir Kumar

Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)

Natural Language Processing (NLP) and Voice Recognition agents are rapidly evolving healthcare by enabling efficient, accessible, and professional patient support while automating grunt work. This report serves as my self project wherein models finetuned on medical call recordings are analysed through a two-stage system: Automatic Speech Recognition (ASR) for speech transcription and a Large Language Model (LLM) for context-aware, professional responses. ASR, finetuned on phone call recordings provides generalised transcription of diverse patient speech over call, while the LLM matches transcribed text to medical diagnosis. A novel audio preprocessing strategy, is deployed to provide invariance to incoming recording/call data, laden with sufficient augmentation with noise/clipping to make the pipeline robust to the type of microphone and ambient conditions the patient might have while calling/recording.
[6] arXiv:2502.13983 [pdf, html, other]: Title: Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders

Seungbae Kim, Daeun Lee, Brielle Stark, Jinyoung Han

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)

Individuals with language disorders often face significant communication challenges due to their limited language processing and comprehension abilities, which also affect their interactions with voice-assisted systems that mostly rely on Automatic Speech Recognition (ASR). Despite advancements in ASR that address disfluencies, there has been little attention on integrating non-verbal communication methods, such as gestures, which individuals with language disorders substantially rely on to supplement their communication. Recognizing the need to interpret the latent meanings of visual information not captured by speech alone, we propose a gesture-aware ASR system utilizing a multimodal large language model with zero-shot learning for individuals with speech impairments. Our experiment results and analyses show that including gesture information significantly enhances semantic understanding. This study can help develop effective communication technologies, specifically designed to meet the unique needs of individuals with language impairments.
[7] arXiv:2502.13985 [pdf, other]: Title: End-to-end pipeline for simultaneous temperature estimation and super resolution of low-cost uncooled infrared camera frames for precision agriculture applications

Navot Oz, Nir Sochen, David Mendlovic, Iftach Klapp

Subjects: Image and Video Processing (eess.IV)

Radiometric infrared (IR) imaging is a valuable technique for remote-sensing applications in precision agriculture, such as irrigation monitoring, crop health assessment, and yield estimation. Low-cost uncooled non-radiometric IR cameras offer new implementations in agricultural monitoring. However, these cameras have inherent drawbacks that limit their usability, such as low spatial resolution, spatially variant nonuniformity, and lack of radiometric calibration. In this article, we present an end-to-end pipeline for temperature estimation and super resolution of frames captured by a low-cost uncooled IR camera. The pipeline consists of two main components: a deep-learning-based temperature-estimation module, and a deep-learning-based super-resolution module. The temperature-estimation module learns to map the raw gray level IR images to the corresponding temperature maps while also correcting for nonuniformity. The super-resolution module uses a deep-learning network to enhance the spatial resolution of the IR images by scale factors of x2 and x4. We evaluated the performance of the pipeline on both simulated and real-world agricultural datasets composing of roughly 20,000 frames of various crops. For the simulated data, the results were on par with the real-world data with sub-degree accuracy. For the real data, the proposed pipeline was compared to a high-end radiometric thermal camera, and achieved sub-degree accuracy. The results of the real data are on par with the simulated data. The proposed pipeline can enable various applications in precision agriculture that require high quality thermal information from low-cost IR cameras.
[8] arXiv:2502.13986 [pdf, html, other]: Title: Structure-from-Sherds++: Robust Incremental 3D Reassembly of Axially Symmetric Pots from Unordered and Mixed Fragment Collections

Seong Jong Yoo, Sisung Liu, Muhammad Zeeshan Arshad, Jinhyeok Kim, Young Min Kim, Yiannis Aloimonos, Cornelia Fermuller, Kyungdon Joo, Jinwook Kim, Je Hyeong Hong

Comments: 24 pages

Subjects: Image and Video Processing (eess.IV)

Reassembling multiple axially symmetric pots from fragmentary sherds is crucial for cultural heritage preservation, yet it poses significant challenges due to thin and sharp fracture surfaces that generate numerous false positive matches and hinder large-scale puzzle solving. Existing global approaches, which optimize all potential fragment pairs simultaneously or data-driven models, are prone to local minima and face scalability issues when multiple pots are intermixed. Motivated by Structure-from-Motion (SfM) for 3D reconstruction from multiple images, we propose an efficient reassembly method for axially symmetric pots based on iterative registration of one sherd at a time, called Structure-from-Sherds++ (SfS++). Our method extends beyond simple replication of incremental SfM and leverages multi-graph beam search to explore multiple registration paths. This allows us to effectively filter out indistinguishable false matches and simultaneously reconstruct multiple pots without requiring prior information such as base or the number of mixed objects. Our approach achieves 87% reassembly accuracy on a dataset of 142 real fragments from 10 different pots, outperforming other methods in handling complex fracture patterns with mixed datasets and achieving state-of-the-art performance. Code and results can be found in our project page this https URL.
[9] arXiv:2502.13988 [pdf, html, other]: Title: A Lightweight Model for Perceptual Image Compression via Implicit Priors

Hao Wei, Yanhui Zhou, Yiwen Jia, Chenyang Ge, Saeed Anwar, Ajmal Mian

Subjects: Image and Video Processing (eess.IV)

Perceptual image compression has shown strong potential for producing visually appealing results at low bitrates, surpassing classical standards and pixel-wise distortion-oriented neural methods. However, existing methods typically improve compression performance by incorporating explicit semantic priors, such as segmentation maps and textual features, into the encoder or decoder, which increases model complexity by adding parameters and floating-point operations. This limits the model's practicality, as image compression often occurs on resource-limited mobile devices. To alleviate this problem, we propose a lightweight perceptual Image Compression method using Implicit Semantic Priors (ICISP). We first develop an enhanced visual state space block that exploits local and global spatial dependencies to reduce redundancy. Since different frequency information contributes unequally to compression, we develop a frequency decomposition modulation block to adaptively preserve or reduce the low-frequency and high-frequency information. We establish the above blocks as the main modules of the encoder-decoder, and to further improve the perceptual quality of the reconstructed images, we develop a semantic-informed discriminator that uses implicit semantic priors from a pretrained DINOv2 encoder. Experiments on popular benchmarks show that our method achieves competitive compression performance and has significantly fewer network parameters and floating point operations than the existing state-of-the-art.
[10] arXiv:2502.13989 [pdf, html, other]: Title: Erasing with Precision: Evaluating Specific Concept Erasure from Text-to-Image Generative Models

Masane Fuchi, Tomohiro Takagi

Comments: 21 pages, 8 figures, 15 tables

Subjects: Image and Video Processing (eess.IV)

Studies have been conducted to prevent specific concepts from being generated from pretrained text-to-image generative models, achieving concept erasure in various ways. However, the performance evaluation of these studies is still largely reliant on visualization, with the superiority of studies often determined by human subjectivity. The metrics of quantitative evaluation also vary, making comprehensive comparisons difficult. We propose EraseEval, an evaluation method that differs from previous evaluation methods in that it involves three fundamental evaluation criteria: (1) How well does the prompt containing the target concept be reflected, (2) To what extent the concepts related to the erased concept can reduce the impact of the erased concept, and (3) Whether other concepts are preserved. These criteria are evaluated and integrated into a single metric, such that a lower score is given if any of the evaluations are low, leading to a more robust assessment. We experimentally evaluated baseline concept erasure methods, organized their characteristics, and identified challenges with them. Despite being fundamental evaluation criteria, some concept erasure methods failed to achieve high scores, which point toward future research directions for concept erasure methods. Our code is available at this https URL.
[11] arXiv:2502.13990 [pdf, html, other]: Title: Remote Sensing Semantic Segmentation Quality Assessment based on Vision Language Model

Huiying Shi, Zhihong Tan, Zhihan Zhang, Hongchen Wei, Yaosi Hu, Yingxue Zhang, Zhenzhong Chen

Comments: 16 pages,6 figures

Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)

The complexity of scenes and variations in image quality result in significant variability in the performance of semantic segmentation methods of remote sensing imagery (RSI) in supervised real-world scenarios. This makes the evaluation of semantic segmentation quality in such scenarios an issue to be resolved. However, most of the existing evaluation metrics are developed based on expert-labeled object-level annotations, which are not applicable in such scenarios. To address this issue, we propose RS-SQA, an unsupervised quality assessment model for RSI semantic segmentation based on vision language model (VLM). This framework leverages a pre-trained RS VLM for semantic understanding and utilizes intermediate features from segmentation methods to extract implicit information about segmentation quality. Specifically, we introduce CLIP-RS, a large-scale pre-trained VLM trained with purified text to reduce textual noise and capture robust semantic information in the RS domain. Feature visualizations confirm that CLIP-RS can effectively differentiate between various levels of segmentation quality. Semantic features and low-level segmentation features are effectively integrated through a semantic-guided approach to enhance evaluation accuracy. To further support the development of RS semantic segmentation quality assessment, we present RS-SQED, a dedicated dataset sampled from four major RS semantic segmentation datasets and annotated with segmentation accuracy derived from the inference results of 8 representative segmentation methods. Experimental results on the established dataset demonstrate that RS-SQA significantly outperforms state-of-the-art quality assessment models. This provides essential support for predicting segmentation accuracy and high-quality semantic segmentation interpretation, offering substantial practical value.
[12] arXiv:2502.13992 [pdf, html, other]: Title: A Synergy Scoring Filter for Unsupervised Anomaly Detection with Noisy Data

Fengjie Wang, Chengming Liu, Pang Haibo, Lei Shi

Subjects: Image and Video Processing (eess.IV)

Noise-inclusive fully unsupervised anomaly detection (FUAD) holds significant practical relevance. Although various methods exist to address this problem, they are limited in both performance and scalability. Our work seeks to overcome these obstacles, enabling broader adaptability of unsupervised anomaly detection (UAD) models to FUAD. To achieve this, we introduce the Synergy Scoring Filter (SSFilter), the first fully unsupervised anomaly detection approach to leverage sample-level filtering. SSFilter facilitates end-to-end robust training and applies filtering to the complete training set post-training, offering a model-agnostic solution for FUAD. Specifically, SSFilter integrates a batch-level anomaly scoring mechanism based on mutual patch comparison and utilizes regression errors in anomalous regions, alongside prediction uncertainty, to estimate sample-level uncertainty scores that calibrate the anomaly scoring mechanism. This design produces a synergistic, robust filtering approach. Furthermore, we propose a realistic anomaly synthesis method and an integrity enhancement strategy to improve model training and mitigate missed noisy samples. Our method establishes state-of-the-art performance on the FUAD benchmark of the recent large-scale industrial anomaly detection dataset, Real-IAD. Additionally, dataset-level filtering enhances the performance of various UAD methods on the FUAD benchmark, and the high scalability of our approach significantly boosts its practical applicability.
[13] arXiv:2502.13998 [pdf, html, other]: Title: A Baseline Method for Removing Invisible Image Watermarks using Deep Image Prior

Hengyue Liang, Taihui Li, Ju Sun

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)

Image watermarks have been considered a promising technique to help detect AI-generated content, which can be used to protect copyright or prevent fake image abuse. In this work, we present a black-box method for removing invisible image watermarks, without the need of any dataset of watermarked images or any knowledge about the watermark system. Our approach is simple to implement: given a single watermarked image, we regress it by deep image prior (DIP). We show that from the intermediate steps of DIP one can reliably find an evasion image that can remove invisible watermarks while preserving high image quality. Due to its unique working mechanism and practical effectiveness, we advocate including DIP as a baseline invasion method for benchmarking the robustness of watermarking systems. Finally, by showing the limited ability of DIP and other existing black-box methods in evading training-based visible watermarks, we discuss the positive implications on the practical use of training-based visible watermarks to prevent misinformation abuse.
[14] arXiv:2502.14002 [pdf, other]: Title: A Data-Driven Paradigm-Based Image Denoising and Mosaicking Approach for High-Resolution Acoustic Camera

Xiaoteng Zhou, Yilong Zhang, Katsunori Mizuno, Kenichiro Tsutsumi, Hideki Sugimoto

Comments: Marine acoustic conference

Subjects: Image and Video Processing (eess.IV)

In this work, an approach based on a data-driven paradigm to denoise and mosaic acoustic camera images is proposed. Acoustic cameras, also known as 2D forward-looking sonar, could collect high-resolution acoustic images in dark and turbid water. However, due to the unique sensor imaging mechanism, main vision-based processing methods, like image denoising and mosaicking are still in the early stages. Due to the complex noise interference in acoustic images and the narrow field of view of acoustic cameras, it is difficult to restore the entire detection scene even if enough acoustic images are collected. Relevant research work addressing these issues focuses on the design of handcrafted operators for acoustic image processing based on prior knowledge and sensor models. However, such methods lack robustness due to noise interference and insufficient feature details on acoustic images. This study proposes an acoustic image denoising and mosaicking method based on a data-driven paradigm and conducts experimental testing using collected acoustic camera images. The results demonstrate the effectiveness of the proposal.
[15] arXiv:2502.14009 [pdf, html, other]: Title: Benchmarking Self-Supervised Methods for Accelerated MRI Reconstruction

Andrew Wang, Mike Davies

Comments: Preprint: Work in Progress

Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)

Reconstructing MRI from highly undersampled measurements is crucial for accelerating medical imaging, but is challenging due to the ill-posedness of the inverse problem. While supervised deep learning approaches have shown remarkable success, they rely on fully-sampled ground truth data, which is often impractical or impossible to obtain. Recently, numerous self-supervised methods have emerged that do not require ground truth, however, the lack of systematic comparison and standard experimental setups have hindered research. We present the first comprehensive review of loss functions from all feedforward self-supervised methods and the first benchmark on accelerated MRI reconstruction without ground truth, showing that there is a wide range in performance across methods. In addition, we propose Multi-Operator Equivariant Imaging (MO-EI), a novel framework that builds on the imaging model considered in existing methods to outperform all state-of-the-art and approaches supervised performance. Finally, to facilitate reproducible benchmarking, we provide implementations of all methods in the DeepInverse library (this https URL) and easy-to-use demo code at this https URL.
[16] arXiv:2502.14014 [pdf, html, other]: Title: SegRet: An Efficient Design for Semantic Segmentation with Retentive Network

Zhiyuan Li, Yi Chang, Yuan Wu

Comments: 12 pages

Subjects: Image and Video Processing (eess.IV)

With the ongoing advancement of autonomous driving technology and intelligent transportation systems, research into semantic segmentation has become increasingly pivotal. Accurate understanding and analysis of real-world scenarios are now essential for these emerging fields. However, traditional semantic segmentation methods often struggle to balance high model accuracy with computational efficiency, particularly in terms of parameter count. To address this challenge, we introduce SegRet, a novel approach that leverages the Retentive Network (RetNet) architecture and integrates a lightweight residual decoder featuring zero-initialization. SegRet exhibits three key characteristics: (1) Lightweight Residual Decoder: We incorporate a zero-initialization layer within the residual network framework, ensuring that the decoder remains computationally efficient while preserving critical information flow; (2) Robust Feature Extraction: Utilizing RetNet as the backbone, our model adeptly extracts hierarchical features from input images, thereby enhancing the depth and breadth of feature representation; (3) Parameter Efficiency: SegRet achieves state-of-the-art performance while significantly reducing the number of parameters, maintaining high accuracy without compromising on computational resources. Empirical evaluations on benchmark datasets such as ADE20K, Cityscapes, and COCO-Stuff10K demonstrate the efficacy of our approach. SegRet delivers impressive results, achieving an mIoU of 52.23\% on ADE20K with only 95.81M parameters, 83.36\% on Cityscapes, and 46.63\% on COCO-Stuff. The code is available at: this https URL.
[17] arXiv:2502.14066 [pdf, html, other]: Title: Experiment Design with Gaussian Process Regression with Applications to Chance-Constrained Control

Sean Anderson, Katie Byl, João P. Hespanha

Comments: 8 pages

Journal-ref: 2023 62nd IEEE Conference on Decision and Control (CDC), Singapore, Singapore, 2023, pp. 3931-3938

Subjects: Systems and Control (eess.SY)

Learning for control in repeated tasks allows for well-designed experiments to gather the most useful data. We consider the setting in which we use a data-driven controller that does not have access to the true system dynamics. Rather, the controller uses inferred dynamics based on the available information. In order to acquire data that is beneficial for this controller, we present an experimental design approach that leverages the current data to improve expected control performance. We focus on the setting in which inference on the unknown dynamics is performed using Gaussian processes. Gaussian processes not only provide uncertainty quantification but also allow us to leverage structures inherent to Gaussian random variables. Through this structure, we design experiments via gradient descent on the expected control performance with respect to the experiment input. In particular, we focus on a chance-constrained minimum expected time control problem. Numerical demonstrations of our approach indicate our experimental design outperforms relevant benchmarks.
[18] arXiv:2502.14090 [pdf, html, other]: Title: MambaLiteSR: Image Super-Resolution with Low-Rank Mamba using Knowledge Distillation

Romina Aalishah, Mozhgan Navardi, Tinoosh Mohsenin

Comments: Special Session: Generative AI on Edge, 26th International Symposium on Quality Electronic Design (ISQED'25)

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Generative Artificial Intelligence (AI) has gained significant attention in recent years, revolutionizing various applications across industries. Among these, advanced vision models for image super-resolution are in high demand, particularly for deployment on edge devices where real-time processing is crucial. However, deploying such models on edge devices is challenging due to limited computing power and memory. In this paper, we present MambaLiteSR, a novel lightweight image Super-Resolution (SR) model that utilizes the architecture of Vision Mamba. It integrates State Space Blocks and a reconstruction module for efficient feature extraction. To optimize efficiency without affecting performance, MambaLiteSR employs knowledge distillation to transfer key insights from a larger Mamba-based teacher model to a smaller student model via hyperparameter tuning. Through mathematical analysis of model parameters and their impact on PSNR, we identify key factors and adjust them accordingly. Our comprehensive evaluation shows that MambaLiteSR outperforms state-of-the-art edge SR methods by reducing power consumption while maintaining competitive PSNR and SSIM scores across benchmark datasets. It also reduces power usage during training via low-rank approximation. Moreover, MambaLiteSR reduces parameters with minimal performance loss, enabling efficient deployment of generative AI models on resource-constrained devices. Deployment on the embedded NVIDIA Jetson Orin Nano confirms the superior balance of MambaLiteSR size, latency, and efficiency. Experiments show that MambaLiteSR achieves performance comparable to both the baseline and other edge models while using 15% fewer parameters. It also improves power consumption by up to 58% compared to state-of-the-art SR edge models, all while maintaining low energy use during training.
[19] arXiv:2502.14111 [pdf, other]: Title: Comprehensive Review on the Control of Heat Pumps for Energy Flexibility in Distribution Networks

Gustavo L. Aschidamini, Mina Pavlovic, Bradley A. Reinholz, Malcolm S. Metcalfe, Taco Niet, Mariana Resener

Subjects: Systems and Control (eess.SY)

Decarbonization plans promote the transition to heat pumps (HPs), creating new opportunities for their energy flexibility in demand response programs, solar photovoltaic integration and optimization of distribution networks. This paper reviews scheduling-based and real-time optimization methods for controlling HPs with a focus on energy flexibility in distribution networks. Scheduling-based methods fall into two categories: rule-based controllers (RBCs), which rely on predefined control rules without explicitly seeking optimal solutions, and optimization models, which are designed to determine the optimal scheduling of operations. Real-time optimization is achieved through model predictive control (MPC), which relies on a predictive model to optimize decisions over a time horizon, and reinforcement learning (RL), which takes a model-free approach by learning optimal strategies through direct interaction with the environment. The paper also examines studies on the impact of HPs on distribution networks, particularly those leveraging energy flexibility strategies. Key takeaways suggest the need to validate control strategies for extreme cold-weather regions that require backup heaters, as well as develop approaches designed for demand charge schemes that integrate HPs with other controllable loads. From a grid impact assessment perspective, studies have focused primarily on RBCs for providing energy flexibility through HP operation, without addressing more advanced methods such as real-time optimization using MPC or RL-based algorithms. Incorporating these advanced control strategies could help identify key limitations, including the impact of varying user participation levels and the cost-benefit trade-offs associated with their implementation.
[20] arXiv:2502.14150 [pdf, html, other]: Title: Risk-Sensitive Security-Constrained Economic Dispatch: Pricing and Algorithm Design

Avinash N. Madavan, Nathan Dahlin, Subhonmesh Bose, Lang Tong

Subjects: Systems and Control (eess.SY); Theoretical Economics (econ.TH)

We propose a risk-sensitive security-constrained economic dispatch (R-SCED) formulation capturing the tradeoff between dispatch cost and resilience against potential line failures, where risk is modeled via the conditional value at risk (CVaR). In the context of our formulation, we analyze revenue adequacy and side payments of two pricing models, one based on nominal generation costs, and another based on total marginal cost including contingencies. In particular, we prove that the system operator's (SO) merchandising surplus (MS) and total revenue are nonnegative under the latter, while under the former the same does not hold in general. We demonstrate that the proposed R-SCED formulation is amenable to decomposition and describe a Benders' decomposition algorithm to solve it. In numerical examples, we illustrate the differences in MS and total revenue under the considered pricing schemes, and the computational efficiency of our decomposition approach.
[21] arXiv:2502.14193 [pdf, html, other]: Title: Near-Field Motion Parameter Estimation: A Variational Bayesian Approach

Chunwei Meng, Zhaolin Wang, Zhiqing Wei, Yuanwei Liu, Zhiyong Feng

Subjects: Signal Processing (eess.SP)

A near-field motion parameter estimation method is proposed. In contract to far-field sensing systems, the near-field sensing system leverages spherical-wave characteristics to enable full-vector location and velocity estimation.
Despite promising advantages, the near-field sensing system faces a significant challenge, where location and velocity parameters are intricately coupled within the signal.
To address this challenge, a novel subarray-based variational message passing (VMP) method is proposed for near-field joint location and velocity estimation. First, a factor graph representation is introduced, employing subarray-level directional and Doppler parameters as intermediate variables to decouple the complex location-velocity dependencies.
Based on this, the variational Bayesian inference is employed to obtain closed-form posterior distributions of subarray-level parameters.
Subsequently, the message passing technique is employed, enabling tractable computation of location and velocity marginal distributions. Two implementation strategies are proposed: 1) System-level fusion that aggregates all subarray posteriors for centralized estimation, or 2) Subarray-level fusion where locally processed estimates from subarrays are fused through Guassian product rule.
Cramér-Rao bounds for location and velocity estimation are derived, providing theoretical performance limits.
Numerical results demonstrate that the proposed VMP method outperforms existing approaches while achieving a magnitude lower complexity.
Specifically, the proposed VMP method achieves centimeter-level location accuracy and sub-m/s velocity accuracy.
It also demonstrates robust performance for high-mobility targets, making the proposed VMP method suitable for real-time near-field sensing and communication applications.
[22] arXiv:2502.14203 [pdf, html, other]: Title: AFDM-Enabled Integrated Sensing and Communication: Theoretical Framework and Pilot Design

Fan Zhang, Zhaocheng Wang, Tianqi Mao, Tianyu Jiao, Yinxiao Zhuo, Miaowen Wen, Wei Xiang, Sheng Chen, George K. Karagiannidis

Subjects: Signal Processing (eess.SP)

The integrated sensing and communication (ISAC) has been envisioned as one representative usage scenario of sixth-generation (6G) network. However, the unprecedented characteristics of 6G, especially the doubly dispersive channel, make classical ISAC waveforms rather challenging to guarantee a desirable performance level. The recently proposed affine frequency division multiplexing (AFDM) can attain full diversity even under doubly dispersive effects, thus becoming a competitive candidate for next-generation ISAC waveforms. Relevant investigations are still at an early stage, which involve only straightforward design lacking explicit theoretical analysis. This paper provides an in-depth investigation on AFDM waveform design for ISAC applications. Specifically, the closed-form Crámer-Rao bounds of target detection for AFDM are derived, followed by a demonstration on its merits over existing counterparts. Furthermore, we formulate the ambiguity function of the pilot-assisted AFDM waveform for the first time, revealing conditions for stable sensing performance. To further enhance both the communication and sensing performance of the AFDM waveform, we propose a novel pilot design by exploiting the characteristics of AFDM signals. The proposed design is analytically validated to be capable of optimizing the ambiguity function property and channel estimation accuracy simultaneously as well as overcoming the sensing and channel estimation range limitation originated from the pilot spacing. Numerical results have verified the superiority of the proposed pilot design in terms of dual-functional performance.
[23] arXiv:2502.14224 [pdf, html, other]: Title: Adaptive Convolution for CNN-based Speech Enhancement Models

Dahan Wang, Xiaobin Rong, Shiruo Sun, Yuxiang Hu, Changbao Zhu, Jing Lu

Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

Deep learning-based speech enhancement methods have significantly improved speech quality and intelligibility. Convolutional neural networks (CNNs) have been proven to be essential components of many high-performance models. In this paper, we introduce adaptive convolution, an efficient and versatile convolutional module that enhances the model's capability to adaptively represent speech signals. Adaptive convolution performs frame-wise causal dynamic convolution, generating time-varying kernels for each frame by assembling multiple parallel candidate kernels. A Lightweight attention mechanism leverages both current and historical information to assign adaptive weights to each candidate kernel, guiding their aggregation. This enables the convolution operation to adapt to frame-level speech spectral features, leading to more efficient extraction and reconstruction. Experimental results on various CNN-based models demonstrate that adaptive convolution significantly improves the performance with negligible increases in computational complexity, especially for lightweight models. Furthermore, we propose the adaptive convolutional recurrent network (AdaptCRN), an ultra-lightweight model that incorporates adaptive convolution and an efficient encoder-decoder design, achieving superior performance compared to models with similar or even higher computational costs.
[24] arXiv:2502.14242 [pdf, html, other]: Title: On the Contraction Analysis of Nonlinear System with Multiple Equilibrium Points

Riddhi Mohan Bora, Bhabani Shankar Dey, Indra Narayan Kar

Comments: 14 pages, 10 figures

Subjects: Systems and Control (eess.SY)

In this work, we leverage the 2-contraction theory, which extends the capabilities of classical contraction theory, to develop a global stability framework. Coupled with powerful geometric tools such as the Poincare index theory, the 2-contraction theory enables us to analyze the stability of planar nonlinear systems without relying on local equilibrium analysis. By utilizing index theory and 2-contraction results, we efficiently characterize the nature of equilibrium points and delineate regions in 2-dimensional state space where periodic solutions, closed orbits, or stable dynamics may exist. A key focus of this work is the identification of regions in the state space where periodic solutions may occur, as well as 2-contraction regions that guarantee the nonexistence of such solutions. Additionally, we address a critical problem in engineering the determination of the basin of attraction (BOA) for stable equilibrium points. For systems with multiple equilibria identifying candidate BOAs becomes highly nontrivial. We propose a novel methodology leveraging the 2-contraction theory to approximate a common BOA for a class of nonlinear systems with multiple stable equilibria. Theoretical findings are substantiated through benchmark examples and numerical simulations, demonstrating the practical utility of the proposed approach. Furthermore, we extend our framework to analyze networked systems, showcasing their efficacy in an opinion dynamics problem.
[25] arXiv:2502.14250 [pdf, html, other]: Title: A Low-Complexity Placement Design of Pinching-Antenna Systems

Ximing Xie, Fang Fang, Zhiguo Ding, Xianbin Wang

Subjects: Signal Processing (eess.SP)

Pinching-antenna systems have recently been proposed as a new candidate for flexible-antenna systems, not only inheriting the reconfiguration capability but also offering a unique feature: establishing line-of-sight links to mitigate large-scale path loss. However, sophisticated optimization of the placement of pinching antennas has very high complexity, which is challenging for practical implementation. This paper proposes a low-complexity placement design, providing the closed-form expression of the placement of pinching antennas, to maximize the sum rate of multiple downlink users. Orthogonal multiple access (OMA) and non-orthogonal multiple access (NOMA) are both investigated when the pinching-antenna system is only equipped with a single antenna and only the OMA case is studied when there are multiple antennas equipped by the pinching-antenna system. Simulation results indicate pinching-antenna systems can outperform conventional fixed-antenna systems and are more suitable for large service areas.
[26] arXiv:2502.14260 [pdf, html, other]: Title: EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement

Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Over the past decade, generative models have achieved significant success in enhancement fundus this http URL, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical research (e.g., Vessel morphology consistency). 2) There is a lack of comprehensive evaluation for both paired and unpaired enhancement methods, along with the need for expert protocols to accurately assess clinical value. 3) An ideal evaluation system should provide insights to inform future developments of fundus image enhancement. To this end, we propose a novel comprehensive benchmark, EyeBench, to provide insights that align enhancement models with clinical needs, offering a foundation for future work to improve the clinical relevance and applicability of generative models for fundus image enhancement. EyeBench has three appealing properties: 1) multi-dimensional clinical alignment downstream evaluation: In addition to evaluating the enhancement task, we provide several clinically significant downstream tasks for fundus images, including vessel segmentation, DR grading, denoising generalization, and lesion segmentation. 2) Medical expert-guided evaluation design: We introduce a novel dataset that promote comprehensive and fair comparisons between paired and unpaired methods and includes a manual evaluation protocol by medical experts. 3) Valuable insights: Our benchmark study provides a comprehensive and rigorous evaluation of existing methods across different downstream tasks, assisting medical experts in making informed choices. Additionally, we offer further analysis of the challenges faced by existing methods. The code is available at \url{this https URL}
[27] arXiv:2502.14290 [pdf, html, other]: Title: Road to 6G Digital Twin Networks: Multi-Task Adaptive Ray-Tracing as a Key Enabler

Li Yu, Yinghe Miao, Jianhua Zhang, Shaoyi Liu, Yuxiang Zhang, Guangyi Liu

Subjects: Signal Processing (eess.SP)

As a virtual, synchronized replica of physical network, the digital twin network (DTN) is envisioned to sense, predict, optimize and manage the intricate wireless technologies and architectures brought by 6G. Given that the properties of wireless channel fundamentally determine the system performances from the physical layer to network layer, it is a critical prerequisite that the invisible wireless channel in physical world be accurately and efficiently twinned. To support 6G DTN, this paper first proposes a multi-task adaptive ray-tracing platform for 6G (MART-6G) to generate the channel with 6G features, specially designed for DTN online real-time and offline high-accurate tasks. Specifically, the MART-6G platform comprises three core modules, i.e., environment twin module to enhance the sensing ability of dynamic environment; RT engine module to incorporate the main algorithms of propagations, accelerations, calibrations, 6G-specific new features; and channel twin module to generate channel multipath, parameters, statistical distributions, and corresponding three-dimensional (3D) environment information. Moreover, MART-6G is tailored for DTN tasks through the adaptive selection of proper sensing methods, antenna and material libraries, propagation models and calibration strategy, etc. To validate MART-6G performance, we present two real-world case studies to demonstrate the accuracy, efficiency and generality in both offline coverage prediction and online real-time channel prediction. Finally, some open issues and challenges are outlined to further support future diverse DTN tasks.
[28] arXiv:2502.14325 [pdf, html, other]: Title: Joint Waveform and Beamforming Design in RIS-ISAC Systems: A Model-Driven Learning Approach

Peng Jiang, Ming Li, Rang Liu, Wei Wang, Qian Liu

Comments: Accepted by IEEE Transactions on Communications

Subjects: Signal Processing (eess.SP)

Integrated Sensing and Communication (ISAC) has emerged as a key enabler for future wireless systems. The recently developed symbol-level precoding (SLP) technique holds significant potential for ISAC waveform design, as it leverages both temporal and spatial degrees of freedom (DoFs) to enhance multi-user communication and radar sensing capabilities. Concurrently, reconfigurable intelligent surfaces (RIS) offer additional controllable propagation paths, further amplifying interest in their application. However, previous studies have encountered substantial computational challenges due to the complexity of jointly designing SLP-based waveforms and RIS passive beamforming. In this paper, we propose a novel model-driven learning approach that jointly optimizes waveform and beamforming by unfolding the iterative alternative direction method of multipliers (ADMM) algorithm. Two joint design algorithms are developed for radar target detection and direction-of-arrival (DoA) estimation tasks in a cluttered RIS-ISAC system. While ensuring the communication quality-of-service (QoS) requirements, our objectives are: 1) to maximize the radar output signal-to-interference-plus-noise ratio (SINR) for target detection, and 2) to minimize the Cramér-Rao bound (CRB) for DoA estimation. Simulation results verify that our proposed model-driven learning algorithms achieve satisfactory communication and sensing performance, while also offering a substantial reduction in computational complexity, as reflected by the average execution time.
[29] arXiv:2502.14363 [pdf, html, other]: Title: Topology-Aware Wavelet Mamba for Airway Structure Segmentation in Postoperative Recurrent Nasopharyngeal Carcinoma CT Scans

Haishan Huang, Pengchen Liang, Naier Lin, Luxi Wang, Bin Pu, Jianguo Chen, Qing Chang, Xia Shen, Guo Ran

Comments: 20 pages, 11 figures, 6 tables

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Nasopharyngeal carcinoma (NPC) patients often undergo radiotherapy and chemotherapy, which can lead to postoperative complications such as limited mouth opening and joint stiffness, particularly in recurrent cases that require re-surgery. These complications can affect airway function, making accurate postoperative airway risk assessment essential for managing patient care. Accurate segmentation of airway-related structures in postoperative CT scans is crucial for assessing these risks. This study introduces TopoWMamba (Topology-aware Wavelet Mamba), a novel segmentation model specifically designed to address the challenges of postoperative airway risk evaluation in recurrent NPC patients. TopoWMamba combines wavelet-based multi-scale feature extraction, state-space sequence modeling, and topology-aware modules to segment airway-related structures in CT scans robustly. By leveraging the Wavelet-based Mamba Block (WMB) for hierarchical frequency decomposition and the Snake Conv VSS (SCVSS) module to preserve anatomical continuity, TopoWMamba effectively captures both fine-grained boundaries and global structural context, crucial for accurate segmentation in complex postoperative scenarios. Through extensive testing on the NPCSegCT dataset, TopoWMamba achieves an average Dice score of 88.02%, outperforming existing models such as UNet, Attention UNet, and SwinUNet. Additionally, TopoWMamba is tested on the SegRap 2023 Challenge dataset, where it shows a significant improvement in trachea segmentation with a Dice score of 95.26%. The proposed model provides a strong foundation for automated segmentation, enabling more accurate postoperative airway risk evaluation.
[30] arXiv:2502.14387 [pdf, html, other]: Title: MPPI-DBaS: Safe Trajectory Optimization with Adaptive Exploration

Fanxin Wang, Yikun Cheng, Chuyuan Tao

Comments: CCC 2025

Subjects: Systems and Control (eess.SY)

In trajectory optimization, Model Predictive Path Integral (MPPI) control is a sampling-based Model Predictive Control (MPC) framework that generates optimal inputs by efficiently simulating numerous trajectories. In practice, however, MPPI often struggles to guarantee safety assurance and balance efficient sampling in open spaces with the need for more extensive exploration under tight constraints. To address this challenge, we incorporate discrete barrier states (DBaS) into MPPI and propose a novel MPPI-DBaS algorithm that ensures system safety and enables adaptive exploration across diverse scenarios. We evaluate our method in simulation experiments where the vehicle navigates through closely placed obstacles. The results demonstrate that the proposed algorithm significantly outperforms standard MPPI, achieving a higher success rate and lower tracking errors.
[31] arXiv:2502.14401 [pdf, html, other]: Title: MedFuncta: Modality-Agnostic Representations Based on Efficient Neural Fields

Paul Friedrich, Florentin Bieder, Phlippe C. Cattin

Comments: Code and Dataset: this https URL

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Recent research in medical image analysis with deep learning almost exclusively focuses on grid- or voxel-based data representations. We challenge this common choice by introducing MedFuncta, a modality-agnostic continuous data representation based on neural fields. We demonstrate how to scale neural fields from single instances to large datasets by exploiting redundancy in medical signals and by applying an efficient meta-learning approach with a context reduction scheme. We further address the spectral bias in commonly used SIREN activations, by introducing an $\omega_0$-schedule, improving reconstruction quality and convergence speed. We validate our proposed approach on a large variety of medical signals of different dimensions and modalities (1D: ECG; 2D: Chest X-ray, Retinal OCT, Fundus Camera, Dermatoscope, Colon Histopathology, Cell Microscopy; 3D: Brain MRI, Lung CT) and successfully demonstrate that we can solve relevant downstream tasks on these representations. We additionally release a large-scale dataset of > 550k annotated neural fields to promote research in this direction.
[32] arXiv:2502.14404 [pdf, html, other]: Title: Electromagnetic Degrees of Freedom for Continuous-Aperture Array (CAPA) Systems

Chongjun Ouyang, Boqun Zhao, Xingqi Zhang, Yuanwei Liu

Comments: 4 pages

Subjects: Signal Processing (eess.SP)

The spatial degrees of freedom (DoFs) of a continuous-aperture array (CAPA)-based continuous electromagnetic (EM) channel are analyzed. To this end, a simplified spatial model is derived using the Fresnel approximation. Leveraging this model and Landau's theorem, a closed-form expression for the spatial DoFs is derived. It is demonstrated that the number of DoFs is proportional to the transmit and receive aperture sizes while being inversely proportional to the propagation distance. Numerical results are provided to validate the accuracy of the derived expressions.
[33] arXiv:2502.14416 [pdf, html, other]: Title: Reliable Explainability of Deep Learning Spatial-Spectral Classifiers for Improved Semantic Segmentation in Autonomous Driving

Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Integrating hyperspectral imagery (HSI) with deep neural networks (DNNs) can strengthen the accuracy of intelligent vision systems by combining spectral and spatial information, which is useful for tasks like semantic segmentation in autonomous driving. To advance research in such safety-critical systems, determining the precise contribution of spectral information to complex DNNs' output is needed. To address this, several saliency methods, such as class activation maps (CAM), have been proposed primarily for image classification. However, recent studies have raised concerns regarding their reliability. In this paper, we address their limitations and propose an alternative approach by leveraging the data provided by activations and weights from relevant DNN layers to better capture the relationship between input features and predictions. The study aims to assess the superior performance of HSI compared to 3-channel and single-channel DNNs. We also address the influence of spectral signature normalization for enhancing DNN robustness in real-world driving conditions.
[34] arXiv:2502.14418 [pdf, html, other]: Title: Role of the Pretraining and the Adaptation data sizes for low-resource real-time MRI video segmentation

Masoud Thajudeen Tholan, Vinayaka Hegde, Chetan Sharma, Prasanta Kumar Ghosh

Comments: Accepted to ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

Real-time Magnetic Resonance Imaging (rtMRI) is frequently used in speech production studies as it provides a complete view of the vocal tract during articulation. This study investigates the effectiveness of rtMRI in analyzing vocal tract movements by employing the SegNet and UNet models for Air-Tissue Boundary (ATB)segmentation tasks. We conducted pretraining of a few base models using increasing numbers of subjects and videos, to assess performance on two datasets. First, consisting of unseen subjects with unseen videos from the same data source, achieving 0.33% and 0.91% (Pixel-wise Classification Accuracy (PCA) and Dice Coefficient respectively) better than its matched condition. Second, comprising unseen videos from a new data source, where we obtained an accuracy of 99.63% and 98.09% (PCA and Dice Coefficient respectively) of its matched condition performance. Here, matched condition performance refers to the performance of a model trained only on the test subjects which was set as a benchmark for the other models. Our findings highlight the significance of fine-tuning and adapting models with limited data. Notably, we demonstrated that effective model adaptation can be achieved with as few as 15 rtMRI frames from any new dataset.
[35] arXiv:2502.14422 [pdf, html, other]: Title: Towards Routing and Edge Computing in Satellite-Terrestrial Networks: A Column Generation Approach

Yuan Liao, Kan Cheng, Hao Jin

Subjects: Systems and Control (eess.SY)

Edge computing that enables satellites to process raw data locally is expected to bring further timeliness and flexibility to satellite-terrestrial networks (STNs). In this letter, In this letter, we propose a three-layer edge computing protocol, where raw data collected by satellites can be processed locally, or transmitted to other satellites or the ground station via multi-hop routing for further processing. The overall computing capacity of the proposed framework is maximized by determining the offloading strategy and route formation, subject to channel capacity and hop constraints. Given that the problem scale grows exponentially with the number of satellites and maximum-allowed hops, the column generation approach is employed to obtain the global optimal solution by activating only a subset of variables. Numerical investigations reveal that the proposed three-layer computing protocol improves the computing capacity by 40\%, compared to the single-layer configuration.
[36] arXiv:2502.14434 [pdf, html, other]: Title: Evaluating Multi-Sensor Placement and Neural Network Architectures for Physical Activity Level Classification

Bo Cui, Xiaowen Song, Tabak Monique, Bert-Jan van Beijnum, Ying Wang

Subjects: Signal Processing (eess.SP)

Accurate physical activity level (PAL) classification could be beneficial for osteoarthritis (OA) management. This study examines the impact of sensor placement and deep learning models on AL classification using the Metabolic Equivalent of Task values. The results show that the addition of anankle sensor (WA) significantly improves the classification of intensity activities compared to wrist-only configuration(53% to 86.2%). The CNN-LSTM model achieves the highest accuracy (95.09%). Statistical analysis confirms multi-sensor setups outperform single-sensor configurations (p < 0.05). The WA configuration offers a balance between usability and accuracy, making it a cost-effective solution for AL monitoring, particularly in OA management.
[37] arXiv:2502.14453 [pdf, html, other]: Title: Maximizing Spectrum Efficiency of Data-Carrying Reference Signals via Bayesian Optimization

Taiki Kato, Hiroki Iimori, Chandan Pradhan, Szabolcs Malomsoky, Naoki Ishikawa

Comments: 11 pages, 9 figures

Subjects: Signal Processing (eess.SP)

Data-carrying reference signals are a type of reference signal (RS) constructed on the Grassmann manifold, which allows for simultaneous data transmission and channel estimation to achieve boosted spectral efficiency at high signal-to-noise ratios (SNRs). However, they do not improve spectral efficiency at low to middle SNRs compared with conventional RSs. To address this problem, we propose a numerical optimization-based Grassmann constellation design on the Grassmann manifold that accounts for both data transmission and channel estimation. In our numerical optimization, we derive an upper bound on the normalized mean squared error (NMSE) of estimated channel matrices and a lower bound on the noncoherent average mutual information (AMI), and these bounds are optimized simultaneously by using a Bayesian optimization technique. The proposed objective function outperforms conventional design metrics in obtaining Pareto-optimal constellations for NMSE and AMI. The constellation obtained by our method achieves an NMSE comparable to conventional non-data-carrying RSs while enabling data transmission, resulting in superior AMI performance and improved spectral efficiency even at middle SNRs.
[38] arXiv:2502.14534 [pdf, other]: Title: Poststroke rehabilitative mechanisms in individualized fatigue level-controlled treadmill training -- a Rat Model Study

Yuchen Xu (1,2), Yulong Peng (2), Yuanfa Yao (3), Xiaoman Fan (2), Minmin Wang (2,4), Feng Gao (5), Mohamad Sawan (1), Shaomin Zhang (2), Xiaoling Hu (6) ((1) CenBRAIN Neurotech Center of Excellence, School of Engineering, Westlake University, China (2) Key Laboratory of Biomedical Engineering of Ministry of Education, Qiushi Academy for Advanced Studies, Zhejiang University, China (3) The Affiliated Huizhou Hospital, Guangzhou Medical University, China (4) Westlake Institute for Optoelectronics, Westlake University, China (5) Department of Neurology, The Second Affiliated Hospital, School of Medicine, Zhejiang University, China (6) Department of Biomedical Engineering, The Hong Kong Polytechnic University, China)

Subjects: Signal Processing (eess.SP)

Individualized training improved post-stroke motor function rehabilitation efficiency. However, the mechanisms of how individualized training facilitates recovery is not clear. This study explored the cortical and corticomuscular rehabilitative effects in post-stroke motor function recovery during individualized training. Sprague-Dawley rats with intracerebral hemorrhage (ICH) were randomly distributed into two groups: forced training (FOR-T, n=13) and individualized fatigue-controlled training (FAT-C, n=13) to receive training respectively from day 2 to day 14 post-stroke. The FAT-C group exhibited superior motor function recovery and less central fatigue compared to the FOR-T group. EEG PSD slope analysis demonstrated a better inter-hemispheric balance in FAT-C group compare to the FOR-T group. The dCMC analysis indicated that training-induced fatigue led to a short-term down-regulation of descending corticomuscular coherence (dCMC) and an up-regulation of ascending dCMC. In the long term, excessive fatigue hindered the recovery of descending control in the affected hemisphere. The individualized strategy of peripheral fatigue-controlled training achieved better motor function recovery, which could be attributed to the mitigation of central fatigue, optimization of inter-hemispheric balance and enhancement of descending control in the affected hemisphere.
[39] arXiv:2502.14584 [pdf, html, other]: Title: Vision Foundation Models in Medical Image Analysis: Advances and Challenges

Pengchen Liang, Bin Pu, Haishan Huang, Yiwei Li, Hualiang Wang, Weibo Ma, Qing Chang

Comments: 17 pages, 1 figure

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

The rapid development of Vision Foundation Models (VFMs), particularly Vision Transformers (ViT) and Segment Anything Model (SAM), has sparked significant advances in the field of medical image analysis. These models have demonstrated exceptional capabilities in capturing long-range dependencies and achieving high generalization in segmentation tasks. However, adapting these large models to medical image analysis presents several challenges, including domain differences between medical and natural images, the need for efficient model adaptation strategies, and the limitations of small-scale medical datasets. This paper reviews the state-of-the-art research on the adaptation of VFMs to medical image segmentation, focusing on the challenges of domain adaptation, model compression, and federated learning. We discuss the latest developments in adapter-based improvements, knowledge distillation techniques, and multi-scale contextual feature modeling, and propose future directions to overcome these bottlenecks. Our analysis highlights the potential of VFMs, along with emerging methodologies such as federated learning and model compression, to revolutionize medical image analysis and enhance clinical applications. The goal of this work is to provide a comprehensive overview of current approaches and suggest key areas for future research that can drive the next wave of innovation in medical image segmentation.
[40] arXiv:2502.14585 [pdf, html, other]: Title: A Stackelberg Game Approach for Signal Temporal Logic Control Synthesis with Uncontrollable Agents

Bohan Cui, Xinyi Yu, Alessandro Giua, Xiang Yin

Comments: 8 pages

Subjects: Systems and Control (eess.SY)

In this paper, we investigate the control synthesis problem for Signal Temporal Logic (STL) specifications in the presence of uncontrollable agents. Existing works mainly address this problem in a robust control setting by assuming the uncontrollable agents are adversarial and accounting for the worst-case scenario. While this approach ensures safety, it can be overly conservative in scenarios where uncontrollable agents have their own objectives that are not entirely opposed to the system's goals. Motivated by this limitation, we propose a new framework for STL control synthesis within the Stackelberg game setting. Specifically, we assume that the system controller, acting as the leader, first commits to a plan, after which the uncontrollable agents, acting as followers, take a best response based on the committed plan and their own objectives. Our goal is to synthesize a control sequence for the leader such that, for any rational followers producing a best response, the leader's STL task is guaranteed to be satisfied. We present an effective solution to this problem by transforming it into a single-stage optimization problem and leveraging counter-example guided synthesis techniques. We demonstrate that the proposed approach is sound and identify conditions under which it is optimal. Simulation results are also provided to illustrate the effectiveness of the proposed framework.
[41] arXiv:2502.14591 [pdf, html, other]: Title: Data-driven Control of T-Product-based Dynamical Systems

Ziqin He, Yidan Mei, Shenghan Mei, Xin Mao, Anqi Dong, Ren Wang, Can Chen

Comments: 8 pages, 2 tables

Subjects: Systems and Control (eess.SY)

Data-driven control is a powerful tool that enables the design and implementation of control strategies directly from data without explicitly identifying the underlying system dynamics. While various data-driven control techniques, such as stabilization, linear quadratic regulation, and model predictive control, have been extensively developed, these methods are not inherently suited for multi-linear dynamical systems, where the states are represented as higher-order tensors. In this article, we propose a novel framework for data-driven control of T-product-based dynamical systems (TPDSs), where the system evolution is governed by the T-product between a third-order dynamic tensor and a third-order state tensor. In particular, we offer necessary and sufficient conditions to determine the data informativity for system identification, stabilization by state feedback, and T-product quadratic regulation of TPDSs with detailed complexity analyses. Finally, we validate our framework through numerical examples.
[42] arXiv:2502.14630 [pdf, html, other]: Title: Understanding long-term energy use in off-grid solar home systems in sub-Saharan Africa

Rebecca Perriment, Vasco Mergulhao, Volkan Kumtepeli, Priti Parikh, Malcolm McCulloch, David Howey

Subjects: Systems and Control (eess.SY)

Solar home systems provide low-cost electricity access for rural off-grid communities. As access to them increases, more long-term data becomes available on how these systems are used throughout their lifetime. This work analyses a dataset of 1,000 systems across sub-Saharan Africa. Dynamic time warping clustering was applied to the load demand data from the systems, identifying five distinct archetypal daily load profiles and their occurrence across the dataset. Temporal analysis reveals a general decline in daily energy consumption over time, with 57% of households reducing their usage after the first year of ownership. On average, there is a 33% decrease in daily consumption by the end of the second year compared to the peak demand, which occurs on the 96th day. Combining the load demand analysis with payment data shows that this decrease in energy consumption is observed even in households that are not experiencing economic hardship, indicating there are reasons beyond financial constraints for decreasing energy use once energy access is obtained.
[43] arXiv:2502.14707 [pdf, html, other]: Title: TRUSWorthy: Toward Clinically Applicable Deep Learning for Confident Detection of Prostate Cancer in Micro-Ultrasound

Mohamed Harmanani, Paul F.R. Wilson, Minh Nguyen Nhat To, Mahdi Gilany, Amoon Jamzad, Fahimeh Fooladgar, Brian Wodlinger, Purang Abolmaesumi, Parvin Mousavi

Comments: accepted to IJCARS. This preprint has not undergone post-submission improvements or corrections. To access the Version of Record of this article, see the journal reference below

Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)

While deep learning methods have shown great promise in improving the effectiveness of prostate cancer (PCa) diagnosis by detecting suspicious lesions from trans-rectal ultrasound (TRUS), they must overcome multiple simultaneous challenges. There is high heterogeneity in tissue appearance, significant class imbalance in favor of benign examples, and scarcity in the number and quality of ground truth annotations available to train models. Failure to address even a single one of these problems can result in unacceptable clinical this http URL propose TRUSWorthy, a carefully designed, tuned, and integrated system for reliable PCa detection. Our pipeline integrates self-supervised learning, multiple-instance learning aggregation using transformers, random-undersampled boosting and ensembling: these address label scarcity, weak labels, class imbalance, and overconfidence, respectively. We train and rigorously evaluate our method using a large, multi-center dataset of micro-ultrasound data. Our method outperforms previous state-of-the-art deep learning methods in terms of accuracy and uncertainty calibration, with AUROC and balanced accuracy scores of 79.9% and 71.5%, respectively. On the top 20% of predictions with the highest confidence, we can achieve a balanced accuracy of up to 91%. The success of TRUSWorthy demonstrates the potential of integrated deep learning solutions to meet clinical needs in a highly challenging deployment setting, and is a significant step towards creating a trustworthy system for computer-assisted PCa diagnosis.
[44] arXiv:2502.14729 [pdf, html, other]: Title: Leveraging Error Resilience of Iterative Algorithms for Energy Efficiency: from Concept to Implementation

G.A. Gillani, A. Krapukhin, A.B.J. Kokkeler

Comments: 22 pages, 13 figures

Subjects: Signal Processing (eess.SP); Hardware Architecture (cs.AR)

Iterative algorithms are widely used in digital signal processing applications. With the case study of radio astronomy calibration processing, this work contributes towards revealing and exploiting the intrinsic error resilience of iterative algorithms for energy efficiency benefits. We consider iterative methods that use a convergence criterion as a quality metric to terminate the iterative computations. We propose an adaptive statistical approximation model for high-level resilience analysis that provides an opportunity to divide an iterative algorithm into exact and approximate iterations. We realize an energy-efficient accelerator based on a heterogeneous architecture, where the heterogeneity is introduced using accurate and approximate processing cores. Our proposed methodology exploits the error-resilience of the algorithm, where initial iterations are processed on approximate modules while the later ones on accurate modules. The proposed accelerator design does not increase the number of iterations as compared to that of an accurate counterpart and provides sufficient precision to converge to an acceptable solution. Our implementation using TSMC 40nm Low Power (TCBN40LP) technology shows 23% savings in electrical energy consumption.
[45] arXiv:2502.14730 [pdf, html, other]: Title: Reconfigurable Intelligent Surface for OFDM Radar Interference Mitigation

Ali Parchekani, Milad Johnny, Shahrokh Valaee

Subjects: Signal Processing (eess.SP)

This paper introduces a method to reduce interference in OFDM radar systems through the use of reconfigurable intelligent surfaces (RIS). The method involves adjusting the RIS elements to diminish interference effects and improve the clarity of the desired signal. A neural network framework is established to optimize the configurations of the RIS, aiming to lower the power from unwanted sources while enhancing the target signal. The network produces settings that focus on maximizing the signal at the intended angle. Utilizing a convolution-based approach, we illustrate the effective tuning of RIS elements for interference mitigation and the creation of nulls in the direction of interference, resulting in a better signal-to-interference-and-noise ratio (SINR). Simulations confirm the effectiveness of the proposed method in a radar context, demonstrating its capability to enhance target detection while reducing interference.
[46] arXiv:2502.14753 [pdf, html, other]: Title: MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders

Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Medical images are acquired at high resolutions with large fields of view in order to capture fine-grained features necessary for clinical decision-making. Consequently, training deep learning models on medical images can incur large computational costs. In this work, we address the challenge of downsizing medical images in order to improve downstream computational efficiency while preserving clinically-relevant features. We introduce MedVAE, a family of six large-scale 2D and 3D autoencoders capable of encoding medical images as downsized latent representations and decoding latent representations back to high-resolution images. We train MedVAE autoencoders using a novel two-stage training approach with 1,052,730 medical images. Across diverse tasks obtained from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent representations in place of high-resolution images when training downstream models can lead to efficiency benefits (up to 70x improvement in throughput) while simultaneously preserving clinically-relevant features and (2) MedVAE can decode latent representations back to high-resolution images with high fidelity. Our work demonstrates that large-scale, generalizable autoencoders can help address critical efficiency challenges in the medical domain. Our code is available at this https URL.
[47] arXiv:2502.14807 [pdf, html, other]: Title: FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis

Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Foundation models are becoming increasingly effective in the medical domain, offering pre-trained models on large datasets that can be readily adapted for downstream tasks. Despite progress, fetal ultrasound images remain a challenging domain for foundation models due to their inherent complexity, often requiring substantial additional training and facing limitations due to the scarcity of paired multimodal data. To overcome these challenges, here we introduce FetalCLIP, a vision-language foundation model capable of generating universal representation of fetal ultrasound images. FetalCLIP was pre-trained using a multimodal learning approach on a diverse dataset of 210,035 fetal ultrasound images paired with text. This represents the largest paired dataset of its kind used for foundation model development to date. This unique training approach allows FetalCLIP to effectively learn the intricate anatomical features present in fetal ultrasound images, resulting in robust representations that can be used for a variety of downstream applications. In extensive benchmarking across a range of key fetal ultrasound applications, including classification, gestational age estimation, congenital heart defect (CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all baselines while demonstrating remarkable generalizability and strong performance even with limited labeled data. We plan to release the FetalCLIP model publicly for the benefit of the broader scientific community.
[48] arXiv:2502.14853 [pdf, html, other]: Title: On the $H$-property for Step-graphons: Residual Case

Wanting Gao, Xudong Chen

Subjects: Systems and Control (eess.SY)

We sample graphs $G_n$ on $n$ nodes from a step-graphon and evaluate the probability that $G_n$ has a Hamiltonian decomposition in the asymptotic regime as $n\to\infty$. It has recently been shown that for almost all step-graphons, this probability converges to either zero or one. In this paper, we focus on the class of step-graphons such that the zero-one property does not hold. We show in this case that the limit of the probability still exists and provide an explicit expression of it.

[49] arXiv:2502.12778 (cross-list from math.CO) [pdf, html, other]: Title: Toeplitz Unlabeled Sensing

Xin Hong, Manolis C.Tsakiris

Comments: 10 pages

Subjects: Combinatorics (math.CO); Signal Processing (eess.SP)

Unlabeled sensing is the problem of recovering an element of a vector subspace of R^n, from its image under an unknown permutation of the coordinates and knowledge of the subspace. Here we study this problem for the special class of subspaces that admit a Toeplitz basis.
[50] arXiv:2502.13987 (cross-list from cs.GR) [pdf, html, other]: Title: SelfAge: Personalized Facial Age Transformation Using Self-reference Images

Taishi Ito, Yuki Endo, Yoshihiro Kanamori

Subjects: Graphics (cs.GR); Image and Video Processing (eess.IV)

Age transformation of facial images is a technique that edits age-related person's appearances while preserving the identity. Existing deep learning-based methods can reproduce natural age transformations; however, they only reproduce averaged transitions and fail to account for individual-specific appearances influenced by their life histories. In this paper, we propose the first diffusion model-based method for personalized age transformation. Our diffusion model takes a facial image and a target age as input and generates an age-edited face image as output. To reflect individual-specific features, we incorporate additional supervision using self-reference images, which are facial images of the same person at different ages. Specifically, we fine-tune a pretrained diffusion model for personalized adaptation using approximately 3 to 5 self-reference images. Additionally, we design an effective prompt to enhance the performance of age editing and identity preservation. Experiments demonstrate that our method achieves superior performance both quantitatively and qualitatively compared to existing methods. The code and the pretrained model are available at this https URL.
[51] arXiv:2502.14007 (cross-list from cs.GR) [pdf, html, other]: Title: d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein

Comments: Accepted in The International Conference on Pattern Recognition (ICPR) 2024

Subjects: Graphics (cs.GR); Multimedia (cs.MM); Image and Video Processing (eess.IV)

Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.
[52] arXiv:2502.14013 (cross-list from cs.GR) [pdf, html, other]: Title: Appeal prediction for AI up-scaled Images

Steve Göring, Rasmus Merten, Alexander Raake

Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

DNN- or AI-based up-scaling algorithms are gaining in popularity due to the improvements in machine learning. Various up-scaling models using CNNs, GANs or mixed approaches have been published. The majority of models are evaluated using PSRN and SSIM or only a few example images. However, a performance evaluation with a wide range of real-world images and subjective evaluation is missing, which we tackle in the following paper. For this reason, we describe our developed dataset, which uses 136 base images and five different up-scaling methods, namely Real-ESRGAN, BSRGAN, waifu2x, KXNet, and Lanczos. Overall the dataset consists of 1496 annotated images. The labeling of our dataset focused on image appeal and has been performed using crowd-sourcing employing our open-source tool AVRate Voyager. We evaluate the appeal of the different methods, and the results indicate that Real-ESRGAN and BSRGAN are the best. Furthermore, we train a DNN to detect which up-scaling method has been used, the trained models have a good overall performance in our evaluation. In addition to this, we evaluate state-of-the-art image appeal and quality models, here none of the models showed a high prediction performance, therefore we also trained two own approaches. The first uses transfer learning and has the best performance, and the second model uses signal-based features and a random forest model with good overall performance. We share the data and implementation to allow further research in the context of open science.
[53] arXiv:2502.14068 (cross-list from cs.CV) [pdf, html, other]: Title: A Racing Dataset and Baseline Model for Track Detection in Autonomous Racing

Shreya Ghosh, Yi-Huan Chen, Ching-Hsiang Huang, Abu Shafin Mohammad Mahdee Jameel, Chien Chou Ho, Aly El Gamal, Samuel Labi

Comments: Currently Under Review

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at this http URL.
[54] arXiv:2502.14092 (cross-list from cs.RO) [pdf, html, other]: Title: Hybrid Visual Servoing of Tendon-driven Continuum Robots

Rana Danesh, Farrokh Janabi-Sharifi, Farhad Aghili

Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

This paper introduces a novel Hybrid Visual Servoing (HVS) approach for controlling tendon-driven continuum robots (TDCRs). The HVS system combines Image-Based Visual Servoing (IBVS) with Deep Learning-Based Visual Servoing (DLBVS) to overcome the limitations of each method and improve overall performance. IBVS offers higher accuracy and faster convergence in feature-rich environments, while DLBVS enhances robustness against disturbances and offers a larger workspace. By enabling smooth transitions between IBVS and DLBVS, the proposed HVS ensures effective control in dynamic, unstructured environments. The effectiveness of this approach is validated through simulations and real-world experiments, demonstrating that HVS achieves reduced iteration time, faster convergence, lower final error, and smoother performance compared to DLBVS alone, while maintaining DLBVS's robustness in challenging conditions such as occlusions, lighting changes, actuator noise, and physical impacts.
[55] arXiv:2502.14110 (cross-list from cs.SD) [pdf, html, other]: Title: On the application of Visibility Graphs in the Spectral Domain for Speaker Recognition

Hernan Bocaccio, Sergio Iglesias-Pérez, Miguel Romance, Regino Criado, Gabriel B. Mindlin

Comments: 13 pages, 5 figures

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

In this study, we explore the potential of visibility graphs in the spectral domain for speaker recognition. Adult participants were instructed to record vocalizations of the five Spanish vowels. For each vocalization, we computed the frequency spectrum considering the source-filter model of speech production, where formants are shaped by the vocal tract acting as a passive filter with resonant frequencies. Spectral profiles exhibited consistent intra-speaker characteristics, reflecting individual vocal tract anatomies, while showing variation between speakers. We then constructed visibility graphs from these spectral profiles and extracted various graph-theoretic metrics to capture their topological features. These metrics were assembled into feature vectors representing the five vowels for each speaker. Using an ensemble of decision trees trained on these features, we achieved high accuracy in speaker identification. Our analysis identified key topological features that were critical in distinguishing between speakers. This study demonstrates the effectiveness of visibility graphs for spectral analysis and their potential in speaker recognition. We also discuss the robustness of this approach, offering insights into its applicability for real-world speaker recognition systems. This research contributes to expanding the feature extraction toolbox for speaker recognition by leveraging the topological properties of speech signals in the spectral domain.
[56] arXiv:2502.14120 (cross-list from cs.LG) [pdf, html, other]: Title: A Supervised Machine-Learning Approach For Turboshaft Engine Dynamic Modeling Under Real Flight Conditions

Damiano Paniccia, Francesco Aldo Tucci, Joel Guerrero, Luigi Capone, Nicoletta Sanguini, Tommaso Benacchio, Luigi Bottasso

Comments: 26 pages, 14 figures, submitted to the Aeronautical Journal

Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)

Rotorcraft engines are highly complex, nonlinear thermodynamic systems that operate under varying environmental and flight conditions. Simulating their dynamics is crucial for design, fault diagnostics, and deterioration control phases, and requires robust and reliable control systems to estimate engine performance throughout flight envelope. However, the development of detailed physical models of the engine based on numerical simulations is a very challenging task due to the complex and entangled physics driving the engine. In this scenario, data-driven machine-learning techniques are of great interest to the aircraft engine community, due to their ability to describe nonlinear systems' dynamic behavior and enable online performance estimation, achieving excellent results with accuracy competitive with the state of the art. In this work, we explore different Neural Network architectures to model the turboshaft engine of Leonardo's AW189P4 prototype, aiming to predict the engine torque. The models are trained on an extensive database of real flight tests featuring a variety of operational maneuvers performed under different flight conditions, providing a comprehensive representation of the engine's performance. To complement the neural network approach, we apply Sparse Identification of Nonlinear Dynamics (SINDy) to derive a low-dimensional dynamical model from the available data, describing the relationship between fuel flow and engine torque. The resulting model showcases SINDy's capability to recover the actual physics underlying the engine dynamics and demonstrates its potential for investigating more complex aspects of the engine. The results prove that data-driven engine models can exploit a wider range of parameters than standard transfer function-based approaches, enabling the use of trained schemes to simulate nonlinear effects in different engines and helicopters.
[57] arXiv:2502.14145 (cross-list from cs.CL) [pdf, html, other]: Title: LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, Dong Yu

Comments: In submission to INTERSPEECH 2025

Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.
[58] arXiv:2502.14178 (cross-list from cs.GR) [pdf, html, other]: Title: NeRF-3DTalker: Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis

Xiaoxing Liu, Zhilei Liu, Chongke Bi

Comments: Accepted by ICASSP 2025

Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Talking head synthesis is to synthesize a lip-synchronized talking head video using audio. Recently, the capability of NeRF to enhance the realism and texture details of synthesized talking heads has attracted the attention of researchers. However, most current NeRF methods based on audio are exclusively concerned with the rendering of frontal faces. These methods are unable to generate clear talking heads in novel views. Another prevalent challenge in current 3D talking head synthesis is the difficulty in aligning acoustic and visual spaces, which often results in suboptimal lip-syncing of the generated talking heads. To address these issues, we propose Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis (NeRF-3DTalker). Specifically, the proposed method employs 3D prior information to synthesize clear talking heads with free views. Additionally, we propose a 3D Prior Aided Audio Disentanglement module, which is designed to disentangle the audio into two distinct categories: features related to 3D awarded speech movements and features related to speaking style. Moreover, to reposition the generated frames that are distant from the speaker's motion space in the real space, we have devised a local-global Standardized Space. This method normalizes the irregular positions in the generated frames from both global and local semantic perspectives. Through comprehensive qualitative and quantitative experiments, it has been demonstrated that our NeRF-3DTalker outperforms state-of-the-art in synthesizing realistic talking head videos, exhibiting superior image quality and lip synchronization. Project page: this https URL.
[59] arXiv:2502.14190 (cross-list from cs.CV) [pdf, html, other]: Title: Stereo Image Coding for Machines with Joint Visual Feature Compression

Dengchao Jin, Jianjun Lei, Bo Peng, Zhaoqing Pan, Nam Ling, Qingming Huang

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

2D image coding for machines (ICM) has achieved great success in coding efficiency, while less effort has been devoted to stereo image fields. To promote the efficiency of stereo image compression (SIC) and intelligent analysis, the stereo image coding for machines (SICM) is formulated and explored in this paper. More specifically, a machine vision-oriented stereo feature compression network (MVSFC-Net) is proposed for SICM, where the stereo visual features are effectively extracted, compressed, and transmitted for 3D visual task. To efficiently compress stereo visual features in MVSFC-Net, a stereo multi-scale feature compression (SMFC) module is designed to gradually transform sparse stereo multi-scale features into compact joint visual representations by removing spatial, inter-view, and cross-scale redundancies simultaneously. Experimental results show that the proposed MVSFC-Net obtains superior compression efficiency as well as 3D visual task performance, when compared with the existing ICM anchors recommended by MPEG and the state-of-the-art SIC method.
[60] arXiv:2502.14198 (cross-list from cs.IT) [pdf, html, other]: Title: Antenna Position and Beamforming Optimization for Movable Antenna Enabled ISAC: Optimal Solutions and Efficient Algorithms

Lebin Chen, Ming-Min Zhao, Min-Jian Zhao, Rui Zhang

Comments: 13 pages, 7 figures

Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

In this paper, we propose an integrated sensing and communication (ISAC) system enabled by movable antennas (MAs), which can dynamically adjust antenna positions to enhance both sensing and communication performance for future wireless networks. To characterize the benefits of MA-enabled ISAC systems, we first derive the Cramér-Rao bound (CRB) for angle estimation error, which is then minimized for optimizing the antenna position vector (APV) and beamforming design, subject to a pre-defined signal-to-noise ratio (SNR) constraint to ensure the communication performance. In particular, for the case with receive MAs only, we provide a closed-form optimal antenna position solution, and show that employing MAs over conventional fixed-position antennas (FPAs) can achieve a sensing performance gain upper-bounded by 4.77 dB. On the other hand, for the case with transmit MAs only, we develop a boundary traversal breadth-first search (BT-BFS) algorithm to obtain the global optimal solution in the line-of-sight (LoS) channel scenario, along with a lower-complexity boundary traversal depth-first search (BT-DFS) algorithm to find a local optimal solution efficiently. While in the scenario with non-LoS (NLoS) channels, a majorization-minimization (MM) based Rosen's gradient projection (RGP) algorithm with an efficient initialization method is proposed to obtain stationary solutions for the considered problem, which can be extended to the general case with both transmit and receive MAs. Extensive numerical results are presented to verify the effectiveness of the proposed algorithms, and demonstrate the superiority of the considered MA-enabled ISAC system over conventional ISAC systems with FPAs in terms of sensing and communication performance trade-off.
[61] arXiv:2502.14210 (cross-list from math.OC) [pdf, html, other]: Title: Sample Complexity of Linear Quadratic Regulator Without Initial Stability

Amirreza Neshaei Moghaddam, Alex Olshevsky, Bahman Gharesifard

Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)

Inspired by REINFORCE, we introduce a novel receding-horizon algorithm for the Linear Quadratic Regulator (LQR) problem with unknown parameters. Unlike prior methods, our algorithm avoids reliance on two-point gradient estimates while maintaining the same order of sample complexity. Furthermore, it eliminates the restrictive requirement of starting with a stable initial policy, broadening its applicability. Beyond these improvements, we introduce a refined analysis of error propagation through the contraction of the Riemannian distance over the Riccati operator. This refinement leads to a better sample complexity and ensures improved convergence guarantees. Numerical simulations validate the theoretical results, demonstrating the method's practical feasibility and performance in realistic scenarios.
[62] arXiv:2502.14222 (cross-list from cs.DB) [pdf, other]: Title: Enhancing Pavement Sensor Data Acquisition for AI-Driven Transportation Research

Manish Kumar Krishne Gowda, Andrew Balmos, Shin Boonam, James V. Krogmeier

Comments: This paper was accepted for presentation at the 104th TRB Annual Meeting, held on January 5-9, 2025, in Washington, D.C., and was presented during the poster session on January 8, 2025

Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)

Effective strategies for sensor data management are essential for advancing transportation research, especially in the current data-driven era, due to the advent of novel applications in artificial intelligence. This paper presents comprehensive guidelines for managing transportation sensor data, encompassing both archived static data and real-time data streams. The real-time system architecture integrates various applications with data acquisition systems (DAQ). By deploying the in-house designed, open-source Avena software platform alongside the NATS messaging system as a secure communication broker, reliable data exchange is ensured. While robust databases like TimescaleDB facilitate organized storage, visualization platforms like Grafana provide real-time monitoring capabilities.
In contrast, static data standards address the challenges in handling unstructured, voluminous datasets. The standards advocate for a combination of cost-effective bulk cloud storage for unprocessed sensor data and relational databases for recording summarized analyses. They highlight the role of cloud data transfer tools like FME for efficient migration of sensor data from local storages onto the cloud. Further, integration of robust visualization tools into the framework helps in deriving patterns and trends from these complex datasets.
The proposals were applied to INDOT's real-world case studies involving the I-65 and I-69 Greenfield districts. For real-time data collection, Campbell Scientific DAQ systems were used, enabling continuous generation and monitoring of sensor metrics. In the case of the archived I-69 database, summary data was compiled in Oracle, while the unprocessed data was stored in SharePoint. The results underline the effectiveness of the proposed guidelines and motivate their adoption in research projects.
[63] arXiv:2502.14226 (cross-list from cs.CV) [pdf, html, other]: Title: Designing Parameter and Compute Efficient Diffusion Transformers using Distillation

Vignesh Sundaresha

Comments: 4 pages

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Diffusion Transformers (DiTs) with billions of model parameters form the backbone of popular image and video generation models like DALL.E, Stable-Diffusion and SORA. Though these models are necessary in many low-latency applications like Augmented/Virtual Reality, they cannot be deployed on resource-constrained Edge devices (like Apple Vision Pro or Meta Ray-Ban glasses) due to their huge computational complexity. To overcome this, we turn to knowledge distillation and perform a thorough design-space exploration to achieve the best DiT for a given parameter size. In particular, we provide principles for how to choose design knobs such as depth, width, attention heads and distillation setup for a DiT. During the process, a three-way trade-off emerges between model performance, size and speed that is crucial for Edge implementation of diffusion. We also propose two distillation approaches - Teaching Assistant (TA) method and Multi-In-One (MI1) method - to perform feature distillation in the DiT context. Unlike existing solutions, we demonstrate and benchmark the efficacy of our approaches on practical Edge devices such as NVIDIA Jetson Orin Nano.
[64] arXiv:2502.14231 (cross-list from cs.RO) [pdf, html, other]: Title: Real-Time Sampling-based Online Planning for Drone Interception

Gilhyun Ryou, Lukas Lao Beyer, Sertac Karaman

Comments: Accepted at ICRA 2025. Supplementary video: this https URL

Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)

This paper studies high-speed online planning in dynamic environments. The problem requires finding time-optimal trajectories that conform to system dynamics, meeting computational constraints for real-time adaptation, and accounting for uncertainty from environmental changes. To address these challenges, we propose a sampling-based online planning algorithm that leverages neural network inference to replace time-consuming nonlinear trajectory optimization, enabling rapid exploration of multiple trajectory options under uncertainty. The proposed method is applied to the drone interception problem, where a defense drone must intercept a target while avoiding collisions and handling imperfect target predictions. The algorithm efficiently generates trajectories toward multiple potential target drone positions in parallel. It then assesses trajectory reachability by comparing traversal times with the target drone's predicted arrival time, ultimately selecting the minimum-time reachable trajectory. Through extensive validation in both simulated and real-world environments, we demonstrate our method's capability for high-rate online planning and its adaptability to unpredictable movements in unstructured settings.
[65] arXiv:2502.14238 (cross-list from cs.RO) [pdf, html, other]: Title: No Minima, No Collisions: Combining Modulation and Control Barrier Function Strategies for Feasible Dynamical Collision Avoidance

Yifan Xue, Nadia Figueroa

Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

As prominent real-time safety-critical reactive control techniques, Control Barrier Function Quadratic Programs (CBF-QPs) work for control affine systems in general but result in local minima in the generated trajectories and consequently cannot ensure convergence to the goals. Contrarily, Modulation of Dynamical Systems (Mod-DSs), including normal, reference, and on-manifold Mod-DS, achieve obstacle avoidance with few and even no local minima but have trouble optimally minimizing the difference between the constrained and the unconstrained controller outputs, and its applications are limited to fully-actuated systems. We dive into the theoretical foundations of CBF-QP and Mod-DS, proving that despite their distinct origins, normal Mod-DS is a special case of CBF-QP, and reference Mod-DS's solutions are mathematically connected to that of the CBF-QP through one equation. Building on top of the unveiled theoretical connections between CBF-QP and Mod-DS, reference Mod-based CBF-QP and on-manifold Mod-based CBF-QP controllers are proposed to combine the strength of CBF-QP and Mod-DS approaches and realize local-minimum-free reactive obstacle avoidance for control affine systems in general. We validate our methods in both simulated hospital environments and real-world experiments using Ridgeback for fully-actuated systems and Fetch robots for underactuated systems. Mod-based CBF-QPs outperform CBF-QPs as well as the optimally constrained-enforcing Mod-DS approaches we proposed in all experiments.
[66] arXiv:2502.14405 (cross-list from cs.SD) [pdf, html, other]: Title: Differentiable Black-box and Gray-box Modeling of Nonlinear Audio Effects

Marco Comunità, Christian J. Steinmetz, Joshua D. Reiss

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

Audio effects are extensively used at every stage of audio and music content creation. The majority of differentiable audio effects modeling approaches fall into the black-box or gray-box paradigms; and most models have been proposed and applied to nonlinear effects like guitar amplifiers, overdrive, distortion, fuzz and compressor. Although a plethora of architectures have been introduced for the task at hand there is still lack of understanding on the state of the art, since most publications experiment with one type of nonlinear audio effect and a very small number of devices.
In this work we aim to shed light on the audio effects modeling landscape by comparing black-box and gray-box architectures on a large number of nonlinear audio effects, identifying the most suitable for a wide range of devices. In the process, we also: introduce time-varying gray-box models and propose models for compressor, distortion and fuzz, publish a large dataset for audio effects research - ToneTwist AFx this https URL - that is also the first open to community contributions, evaluate models on a variety of metrics and conduct extensive subjective evaluation. Code this https URL and supplementary material this https URL are also available.
[67] arXiv:2502.14514 (cross-list from cs.RO) [pdf, html, other]: Title: A Mobile Robotic Approach to Autonomous Surface Scanning in Legal Medicine

Sarah Grube, Sarah Latus, Martin Fischer, Vidas Raudonis, Axel Heinemann, Benjamin Ondruschka, Alexander Schlaefer

Comments: Submitted and accepted for presentation at CARS 2025. This preprint has not undergone peer review or post-submission revisions. The final version of this work will appear in the official CARS 2025 proceedings

Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

Purpose: Comprehensive legal medicine documentation includes both an internal but also an external examination of the corpse. Typically, this documentation is conducted manually during conventional autopsy. A systematic digital documentation would be desirable, especially for the external examination of wounds, which is becoming more relevant for legal medicine analysis. For this purpose, RGB surface scanning has been introduced. While a manual full surface scan using a handheld camera is timeconsuming and operator dependent, floor or ceiling mounted robotic systems require substantial space and a dedicated room. Hence, we consider whether a mobile robotic system can be used for external documentation. Methods: We develop a mobile robotic system that enables full-body RGB-D surface scanning. Our work includes a detailed configuration space analysis to identify the environmental parameters that need to be considered to successfully perform a surface scan. We validate our findings through an experimental study in the lab and demonstrate the system's application in a legal medicine environment. Results: Our configuration space analysis shows that a good trade-off between coverage and time is reached with three robot base positions, leading to a coverage of 94.96 %. Experiments validate the effectiveness of the system in accurately capturing body surface geometry with an average surface coverage of 96.90 +- 3.16 % and 92.45 +- 1.43 % for a body phantom and actual corpses, respectively. Conclusion: This work demonstrates the potential of a mobile robotic system to automate RGB-D surface scanning in legal medicine, complementing the use of post-mortem CT scans for inner documentation. Our results indicate that the proposed system can contribute to more efficient and autonomous legal medicine documentation, reducing the need for manual intervention.
[68] arXiv:2502.14627 (cross-list from cs.SD) [pdf, html, other]: Title: ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors

Yuguo Yin, Yuxin Xie, Wenyuan Yang, Dongchao Yang, Jinghan Ru, Xianwei Zhuang, Liming Liang, Yuexian Zou

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to retrieve audio clips or multilingual texts from databases. However, existing ML-ATR schemes suffer from inconsistencies for instance similarity matching across languages. We theoretically analyze the inconsistency in terms of both multilingual modal alignment direction error and weight error, and propose the theoretical weight error upper bound for quantifying the inconsistency. Based on the analysis of the weight error upper bound, we find that the inconsistency problem stems from the data distribution error caused by random sampling of languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive learning and audio-English co-anchor contrastive learning, aiming to mitigate the negative impact of data distribution error on recall and consistency in ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets show that our scheme achieves state-of-the-art performance on recall and consistency metrics for eight mainstream languages, including English. Our code will be available at this https URL.
[69] arXiv:2502.14673 (cross-list from cs.SD) [pdf, html, other]: Title: ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription

Khanh Le, Tuan Vu Ho, Dung Tran, Duc Thanh Chau

Comments: Accepted to ICASSP 2025

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

Deploying ASR models at an industrial scale poses significant challenges in hardware resource management, especially for long-form transcription tasks where audio may last for hours. Large Conformer models, despite their capabilities, are limited to processing only 15 minutes of audio on an 80GB GPU. Furthermore, variable input lengths worsen inefficiencies, as standard batching leads to excessive padding, increasing resource consumption and execution time. To address this, we introduce ChunkFormer, an efficient ASR model that uses chunk-wise processing with relative right context, enabling long audio transcriptions on low-memory GPUs. ChunkFormer handles up to 16 hours of audio on an 80GB GPU, 1.5x longer than the current state-of-the-art FastConformer, while also boosting long-form transcription performance with up to 7.7% absolute reduction on word error rate and maintaining accuracy on shorter tasks compared to Conformer. By eliminating the need for padding in standard batching, ChunkFormer's masked batching technique reduces execution time and memory usage by more than 3x in batch processing, substantially reducing costs for a wide range of ASR systems, particularly regarding GPU resources for models serving in real-world applications.
[70] arXiv:2502.14685 (cross-list from cs.SD) [pdf, html, other]: Title: SegAug: CTC-Aligned Segmented Augmentation For Robust RNN-Transducer Based Speech Recognition

Khanh Le, Tuan Vu Ho, Dung Tran, Duc Thanh Chau

Comments: Accepted to ICASSP 2025

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

RNN-Transducer (RNN-T) is a widely adopted architecture in speech recognition, integrating acoustic and language modeling in an end-to-end framework. However, the RNN-T predictor tends to over-rely on consecutive word dependencies in training data, leading to high deletion error rates, particularly with less common or out-of-domain phrases. Existing solutions, such as regularization and data augmentation, often compromise other aspects of performance. We propose SegAug, an alignment-based augmentation technique that generates contextually varied audio-text pairs with low sentence-level semantics. This method encourages the model to focus more on acoustic features while diversifying the learned textual patterns of its internal language model, thereby reducing deletion errors and enhancing overall performance. Evaluations on the LibriSpeech and Tedlium-v3 datasets demonstrate a relative WER reduction of up to 12.5% on small-scale and 6.9% on large-scale settings. Notably, most of the improvement stems from reduced deletion errors, with relative reductions of 45.4% and 18.5%, respectively. These results highlight SegAug's effectiveness in improving RNN-T's robustness, offering a promising solution for enhancing speech recognition performance across diverse and challenging scenarios.
[71] arXiv:2502.14720 (cross-list from physics.app-ph) [pdf, html, other]: Title: Advancing Measurement Capabilities in Lithium-Ion Batteries: Exploring the Potential of Fiber Optic Sensors for Thermal Monitoring of Battery Cells

Florian Krause, Felix Schweizer, Alexandra Burger, Franziska Ludewig, Marcus Knips, Katharina Quade, Andreas Wuersig, Dirk Uwe Sauer

Subjects: Applied Physics (physics.app-ph); Systems and Control (eess.SY)

This work demonstrates the potential of fiber optic sensors for measuring thermal effects in lithium-ion batteries, using a fiber optic measurement method of Optical Frequency Domain Reflectometry (OFDR). The innovative application of fiber sensors allows for spatially resolved temperature measurement, particularly emphasizing the importance of monitoring not just the exterior but also the internal conditions within battery cells. Utilizing inert glass fibers as sensors, which exhibit minimal sensitivity to electric fields, opens up new pathways for their implementation in a wide range of applications, such as battery monitoring. The sensors used in this work provide real-time information along the entire length of the fiber, unlike commonly used Fiber Bragg Grating (FBG) sensors. It is shown that using the herein presented novel sensors in a temperature range of 0 to 80 degree celsius reveals a linear thermal dependency with high sensitivity and a local resolution of a few centimeters. Furthermore, this study presents preliminary findings on the potential application of fiber optic sensors in lithium-ion battery (LIB) cells, demonstrating that the steps required for battery integration do not impose any restrictive effects on thermal measurements.
[72] arXiv:2502.14726 (cross-list from cs.SD) [pdf, html, other]: Title: Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis

Kevin Warren, Daniel Olszewski, Seth Layton, Kevin Butler, Carrie Gates, Patrick Traynor

Subjects: Sound (cs.SD); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)

Audio deepfakes are increasingly in-differentiable from organic speech, often fooling both authentication systems and human listeners. While many techniques use low-level audio features or optimization black-box model training, focusing on the features that humans use to recognize speech will likely be a more long-term robust approach to detection. We explore the use of prosody, or the high-level linguistic features of human speech (e.g., pitch, intonation, jitter) as a more foundational means of detecting audio deepfakes. We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models used by the community to detect audio deepfakes with an accuracy of 93% and an EER of 24.7%. More importantly, we demonstrate the benefits of using a linguistic features-based approach over existing models by applying an adaptive adversary using an $L_{\infty}$ norm attack against the detectors and using attention mechanisms in our training for explainability. We show that we can explain the prosodic features that have highest impact on the model's decision (Jitter, Shimmer and Mean Fundamental Frequency) and that other models are extremely susceptible to simple $L_{\infty}$ norm attacks (99.3% relative degradation in accuracy). While overall performance may be similar, we illustrate the robustness and explainability benefits to a prosody feature approach to audio deepfake detection.
[73] arXiv:2502.14727 (cross-list from cs.SD) [pdf, html, other]: Title: WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models

Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, Zhou Zhao

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

Retrieval Augmented Generation (RAG) has gained widespread adoption owing to its capacity to empower large language models (LLMs) to integrate external knowledge. However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support. WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw audio for both embedding and retrieval. 2) WavRAG integrates audio and text into a unified knowledge representation. Specifically, we propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base, and further enhance the in-context capabilities of spoken dialogue models through the integration of chain-of-thought reasoning. In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration. Furthermore, WavRAG's unique text-audio hybrid retrieval capability extends the boundaries of RAG to the audio modality.
[74] arXiv:2502.14738 (cross-list from stat.ML) [pdf, html, other]: Title: Robust Information Selection for Hypothesis Testing with Misclassification Penalties

Jayanth Bhargav, Shreyas Sundaram, Mahsa Ghasemi

Comments: 23 pages, 2 figures

Subjects: Machine Learning (stat.ML); Signal Processing (eess.SP); Systems and Control (eess.SY); Combinatorics (math.CO); Optimization and Control (math.OC)

We study the problem of robust information selection for a Bayesian hypothesis testing / classification task, where the goal is to identify the true state of the world from a finite set of hypotheses based on observations from the selected information sources. We introduce a novel misclassification penalty framework, which enables non-uniform treatment of different misclassification events. Extending the classical subset selection framework, we study the problem of selecting a subset of sources that minimize the maximum penalty of misclassification under a limited budget, despite deletions or failures of a subset of the selected sources. We characterize the curvature properties of the objective function and propose an efficient greedy algorithm with performance guarantees. Next, we highlight certain limitations of optimizing for the maximum penalty metric and propose a submodular surrogate metric to guide the selection of the information set. We propose a greedy algorithm with near-optimality guarantees for optimizing the surrogate metric. Finally, we empirically demonstrate the performance of our proposed algorithms in several instances of the information set selection problem.
[75] arXiv:2502.14741 (cross-list from cs.NI) [pdf, html, other]: Title: Reinforcement Learning with Graph Attention for Routing and Wavelength Assignment with Lightpath Reuse

Michael Doherty, Alejandra Beghelli

Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)

Many works have investigated reinforcement learning (RL) for routing and spectrum assignment on flex-grid networks but only one work to date has examined RL for fixed-grid with flex-rate transponders, despite production systems using this paradigm. Flex-rate transponders allow existing lightpaths to accommodate new services, a task we term routing and wavelength assignment with lightpath reuse (RWA-LR). We re-examine this problem and present a thorough benchmarking of heuristic algorithms for RWA-LR, which are shown to have 6% increased throughput when candidate paths are ordered by number of hops, rather than total length. We train an RL agent for RWA-LR with graph attention networks for the policy and value functions to exploit the graph-structured data. We provide details of our methodology and open source all of our code for reproduction. We outperform the previous state-of-the-art RL approach by 2.5% (17.4 Tbps mean additional throughput) and the best heuristic by 1.2% (8.5 Tbps mean additional throughput). This marginal gain highlights the difficulty in learning effective RL policies on long horizon resource allocation tasks.
[76] arXiv:2502.14783 (cross-list from cs.IT) [pdf, html, other]: Title: Tracking and Assigning Jobs to a Markov Machine

Subhankar Banerjee, Sennur Ulukus

Subjects: Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)

We consider a time-slotted communication system with a machine, a cloud server, and a sampler. Job requests from the users are queued on the server to be completed by the machine. The machine has two states, namely, a busy state and a free state. The server can assign a job to the machine in a first-in-first-served manner. If the machine is free, it completes the job request from the server; otherwise, it drops the request. Upon dropping a job request, the server is penalized. When the machine is in the free state, the machine can get into the busy state with an internal job. When the server does not assign a job request to the machine, the state of the machine evolves as a symmetric Markov chain. If the machine successfully accepts the job request from the server, the state of the machine goes to the busy state and follows a different dynamics compared to the dynamics when the machine goes to the busy state due to an internal job. The sampler samples the state of the machine and sends it to the server via an error-free channel. Thus, the server can estimate the state of the machine, upon receiving an update from the source. If the machine is in the free state but the estimated state at the server is busy, the sampler pays a cost. We incorporate the concept of the age of incorrect information to model the cost of the sampler. We aim to find an optimal sampling policy such that the cost of the sampler plus the penalty on the machine gets minimized. We formulate this problem in a Markov decision process framework and find how an optimal policy changes with several associated parameters. We show that a threshold policy is optimal for this problem. We show a necessary and sufficient condition for a threshold policy to be optimal. Finally, we find the optimal threshold without bounding the state space.
[77] arXiv:2502.14784 (cross-list from cs.NI) [pdf, html, other]: Title: Online Resource Management for the Uplink of Wideband Hybrid Beamforming System

Yuan Quan, Haseen Rahman, Catherine Rosenberg

Comments: This paper has been accepted by 2025 IEEE International Conference on Communications

Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)

This paper studies the radio resource management (RRM) for the \emph{uplink} (UL) of a cellular system with codebook-based \emph{hybrid beamforming}. We consider the often neglected but highly practical multi-channel case with fewer radio frequency chains in the base station than user equipment (UEs) in the cell, assuming one RF chain per UE. As for any UL RRM, a per-time slot solution is needed as the allocation of power to subchannels by a UE can only be done once it knows which subchannels it has been allocated. The RRM in this system comprises beam selection, user selection and power allocation, three steps that are intricately coupled and we will show that the order in which they are performed does impact performance and so does the amount of coupling that we take into account. Specifically, we propose 4 online sequential solutions with different orders in which the steps are called and of different complexities, i.e., different levels of coupling between the steps. Our extensive numerical campaign for a mmWave system shows how a well-designed heuristic that takes some level of couplings between the steps can make the performance exceedingly better than a benchmark.
[78] arXiv:2502.14803 (cross-list from cs.RO) [pdf, html, other]: Title: Planning, scheduling, and execution on the Moon: the CADRE technology demonstration mission

Gregg Rabideau, Joseph Russino, Andrew Branch, Nihal Dhamani, Tiago Stegun Vaquero, Steve Chien, Jean-Pierre de la Croix, Federico Rossi

Comments: To be presented at AAMAS 2025

Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

NASA's Cooperative Autonomous Distributed Robotic Exploration (CADRE) mission, slated for flight to the Moon's Reiner Gamma region in 2025/2026, is designed to demonstrate multi-agent autonomous exploration of the Lunar surface and sub-surface. A team of three robots and a base station will autonomously explore a region near the lander, collecting the data required for 3D reconstruction of the surface with no human input; and then autonomously perform distributed sensing with multi-static ground penetrating radars (GPR), driving in formation while performing coordinated radar soundings to create a map of the subsurface. At the core of CADRE's software architecture is a novel autonomous, distributed planning, scheduling, and execution (PS&E) system. The system coordinates the robots' activities, planning and executing tasks that require multiple robots' participation while ensuring that each individual robot's thermal and power resources stay within prescribed bounds, and respecting ground-prescribed sleep-wake cycles. The system uses a centralized-planning, distributed-execution paradigm, and a leader election mechanism ensures robustness to failures of individual agents. In this paper, we describe the architecture of CADRE's PS&E system; discuss its design rationale; and report on verification and validation (V&V) testing of the system on CADRE's hardware in preparation for deployment on the Moon.

[79] arXiv:1804.02980 (replaced) [pdf, other]: Title: Compact Formulation of the First Evolution Equation for Optimal Control Computation

Sheng Zhang, Fei Liao, Wei-Qi Qian

Comments: arXiv admin note: substantial text overlap with arXiv:1802.04663, arXiv:1801.10486, arXiv:1801.01383, arXiv:1802.02140, arXiv:1801.07395

Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)

The first evolution equation is derived under the Variation Evolving Method (VEM) that seeks optimal solutions with the variation evolution principle. To improve the performance, its compact form is developed. By replacing the states and costates variation evolution with that of the controls, the dimension-reduced Evolution Partial Differential Equation (EPDE) only solves the control variables along the variation time to get the optimal solution, and its definite conditions may be arbitrary. With this equation, the scale of the resulting Initial-value Problem (IVP), transformed via the semi-discrete method, is significantly reduced. Illustrative examples are solved and it is shown that the compact form evolution equation outperforms the primary form in the precision, and the efficiency may be higher for the dense discretization. Moreover, in discussing the connections to the classic iteration methods, it is uncovered that the computation scheme of the gradient method is the discrete implementation of the third evolution equation, and the compact form of the first evolution equation is a continuous realization of the Newton type iteration mechanism.
[80] arXiv:2203.07655 (replaced) [pdf, html, other]: Title: Joint Time-Vertex Fractional Fourier Transform

Tuna Alikaşifoğlu, Bünyamin Kartal, Eray Özgünay, Aykut Koç

Subjects: Signal Processing (eess.SP); Social and Information Networks (cs.SI)

Graph signal processing (GSP) facilitates the analysis of high-dimensional data on non-Euclidean domains by utilizing graph signals defined on graph vertices. In addition to static data, each vertex can provide continuous time-series signals, transforming graph signals into time-series signals on each vertex. The joint time-vertex Fourier transform (JFT) framework offers spectral analysis capabilities to analyze these joint time-vertex signals. Analogous to the fractional Fourier transform (FRT) extending the ordinary Fourier transform (FT), we introduce the joint time-vertex fractional Fourier transform (JFRT) as a generalization of JFT. The JFRT enables fractional analysis for joint time-vertex processing by extending Fourier analysis to fractional orders in both temporal and vertex domains. We theoretically demonstrate that JFRT generalizes JFT and maintains properties such as index additivity, reversibility, reduction to identity, and unitarity for specific graph topologies. Additionally, we derive Tikhonov regularization-based denoising in the JFRT domain, ensuring robust and well-behaved solutions. Comprehensive numerical experiments on synthetic and real-world datasets highlight the effectiveness of JFRT in denoising and clustering tasks that outperform state-of-the-art approaches.
[81] arXiv:2204.03077 (replaced) [pdf, html, other]: Title: Control Barrier Function based Attack-Recovery with Provable Guarantees

Kunal Garg, Ricardo G. Sanfelice, Alvaro A. Cardenas

Comments: V1: Conference version (IEEE CDC'2022) V2: Journal version (submitted to IEEE Transactions on Automatic Control)

Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)

This paper studies provable security guarantees for cyber-physical systems (CPS) under actuator attacks. In particular, we consider CPS safety and propose a new attack detection mechanism based on zeroing control barrier function (ZCBF) conditions. In addition, we design an adaptive recovery mechanism based on how close the system is to violating safety. We show that under certain conditions, the attack-detection mechanism is sound, i.e., there are no false negatives for adversarial attacks. We propose sufficient conditions for the initial conditions and input constraints so that the resulting CPS is secure by design. We also propose a novel hybrid control to account for attack detection delays and avoid Zeno behavior. Next, to efficiently compute the set of initial conditions, we propose a sampling-based method to verify whether a set is a viability domain. Specifically, we devise a method for checking a modified barrier function condition on a finite set of points to assess whether a set can be rendered forward invariant. Then, we propose an iterative algorithm to compute the set of initial conditions and input constraints set to limit the effect of an adversary if it compromises vulnerable inputs. Finally, we use a Quadratic Programming (QP) approach for online recovery (as well as nominal) control synthesis. We demonstrate the effectiveness of the proposed method in a simulation case study involving a quadrotor with an attack on its motors.
[82] arXiv:2210.06330 (replaced) [pdf, html, other]: Title: CoRRECT: A Deep Unfolding Framework for Motion-Corrected Quantitative R2* Mapping

Xiaojian Xu, Weijie Gan, Satya V.V.N. Kothapalli, Dmitriy A. Yablonskiy, Ulugbek S. Kamilov

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Quantitative MRI (qMRI) refers to a class of MRI methods for quantifying the spatial distribution of biological tissue parameters. Traditional qMRI methods usually deal separately with artifacts arising from accelerated data acquisition, involuntary physical motion, and magnetic-field inhomogeneities, leading to suboptimal end-to-end performance. This paper presents CoRRECT, a unified deep unfolding (DU) framework for qMRI consisting of a model-based end-to-end neural network, a method for motion-artifact reduction, and a self-supervised learning scheme. The network is trained to produce R2* maps whose k-space data matches the real data by also accounting for motion and field inhomogeneities. When deployed, CoRRECT only uses the k-space data without any pre-computed parameters for motion or inhomogeneity correction. Our results on experimentally collected multi-Gradient-Recalled Echo (mGRE) MRI data show that CoRRECT recovers motion and inhomogeneity artifact-free R2* maps in highly accelerated acquisition settings. This work opens the door to DU methods that can integrate physical measurement models, biophysical signal models, and learned prior models for high-quality qMRI.
[83] arXiv:2212.12322 (replaced) [pdf, html, other]: Title: Infrared Image Super-Resolution: Systematic Review, and Future Trends

Yongsong Huang, Tomo Miyazaki, Xiaofeng Liu, Shinichiro Omachi

Comments: This work has been submitted to the Pattern Recognition for possible publication

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Image Super-Resolution (SR) is essential for a wide range of computer vision and image processing tasks. Investigating infrared (IR) image (or thermal images) super-resolution is a continuing concern within the development of deep learning. This survey aims to provide a comprehensive perspective of IR image super-resolution, including its applications, hardware imaging system dilemmas, and taxonomy of image processing methodologies. In addition, the datasets and evaluation metrics in IR image super-resolution tasks are also discussed. Furthermore, the deficiencies in current technologies and possible promising directions for the community to explore are highlighted. To cope with the rapid development in this field, we intend to regularly update the relevant excellent work at \url{this https URL
[84] arXiv:2311.08816 (replaced) [pdf, html, other]: Title: Texture and Noise Dual Adaptation for Infrared Image Super-Resolution

Yongsong Huang, Tomo Miyazaki, Xiaofeng Liu, Yafei Dong, Shinichiro Omachi

Comments: Accepted by Pattern Recognition

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Recent efforts have explored leveraging visible light images to enrich texture details in infrared (IR) super-resolution. However, this direct adaptation approach often becomes a double-edged sword, as it improves texture at the cost of introducing noise and blurring artifacts. To address these challenges, we propose the Target-oriented Domain Adaptation SRGAN (DASRGAN), an innovative framework specifically engineered for robust IR super-resolution model adaptation. DASRGAN operates on the synergy of two key components: 1) Texture-Oriented Adaptation (TOA) to refine texture details meticulously, and 2) Noise-Oriented Adaptation (NOA), dedicated to minimizing noise transfer. Specifically, TOA uniquely integrates a specialized discriminator, incorporating a prior extraction branch, and employs a Sobel-guided adversarial loss to align texture distributions effectively. Concurrently, NOA utilizes a noise adversarial loss to distinctly separate the generative and Gaussian noise pattern distributions during adversarial training. Our extensive experiments confirm DASRGAN's superiority. Comparative analyses against leading methods across multiple benchmarks and upsampling factors reveal that DASRGAN sets new state-of-the-art performance standards. Code are available at \url{this https URL}.
[85] arXiv:2311.11782 (replaced) [pdf, html, other]: Title: Robust Tumor Segmentation with Hyperspectral Imaging and Graph Neural Networks

Mayar Lotfy Mostafa, Anna Alperovich, Tommaso Giannantonio, Bjorn Barz, Xiaohan Zhang, Felix Holm, Nassir Navab, Felix Boehm, Carolin Schwamborn, Thomas K. Hoffmann, Patrick J. Schuler

Comments: 18 pages, 5 figures, The German Conference on Pattern Recognition (GCPR) 2024

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Segmenting the boundary between tumor and healthy tissue during surgical cancer resection poses a significant challenge. In recent years, Hyperspectral Imaging (HSI) combined with Machine Learning (ML) has emerged as a promising solution. However, due to the extensive information contained within the spectral domain, most ML approaches primarily classify individual HSI (super-)pixels, or tiles, without taking into account their spatial context. In this paper, we propose an improved methodology that leverages the spatial context of tiles for more robust and smoother segmentation. To address the irregular shapes of tiles, we utilize Graph Neural Networks (GNNs) to propagate context information across neighboring regions. The features for each tile within the graph are extracted using a Convolutional Neural Network (CNN), which is trained simultaneously with the subsequent GNN. Moreover, we incorporate local image quality metrics into the loss function to enhance the training procedure's robustness against low-quality regions in the training images. We demonstrate the superiority of our proposed method using a clinical ex vivo dataset consisting of 51 HSI images from 30 patients. Despite the limited dataset, the GNN-based model significantly outperforms context-agnostic approaches, accurately distinguishing between healthy and tumor tissues, even in images from previously unseen patients. Furthermore, we show that our carefully designed loss function, accounting for local image quality, results in additional improvements. Our findings demonstrate that context-aware GNN algorithms can robustly find tumor demarcations on HSI images, ultimately contributing to better surgery success and patient outcome.
[86] arXiv:2312.11061 (replaced) [pdf, html, other]: Title: Stability Analysis of Compartmental and Cooperative Systems

Sondre Wiersdalen, Mike Pereira, Annika Lang, Gabor Szederkenyi, Jean Auriol, Balazs Kulcsar

Subjects: Systems and Control (eess.SY); Dynamical Systems (math.DS)

The present article considers stability of the solutions to nonlinear and nonautonomous compartmental systems governed by ordinary differential equations (ODEs). In particular, compartmental systems with a right-hand side that can be written as a product of a matrix function and vector function. Sufficient, and on occasion necessary, conditions on the matrix function are provided to conclude exponential stability of the null solution. The conditions involve verifying that the matrix function takes its values in a set of compartmental matrices on a certain canonical form, and are easy to check. Similar conditions are provided to establish incremental exponential stability for compartmental systems governed by cooperative systems of ODEs. The solutions to such systems satisfy a so-called ordering. Systems that are cooperative in a box, are shown to be incrementally asymptotically stable if and only if every pair of initially ordered solutions converge to each other. Traffic Reaction Models are used to illustrate the results, which are numerical schemes to solve conservation laws in one spatial dimension. Suitable conditions on the flux function of the conservation law are given such that the numerical scheme gives rise to an incrementally exponentially stable system.
[87] arXiv:2403.17338 (replaced) [pdf, html, other]: Title: Reinforcement Learning-based Receding Horizon Control using Adaptive Control Barrier Functions for Safety-Critical Systems

Ehsan Sabouni, H.M. Sabbir Ahmad, Vittorio Giammarino, Christos G. Cassandras, Ioannis Ch. Paschalidis, Wenchao Li

Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)

Optimal control methods provide solutions to safety-critical problems but easily become intractable. Control Barrier Functions (CBFs) have emerged as a popular technique that facilitates their solution by provably guaranteeing safety, through their forward invariance property, at the expense of some performance loss. This approach involves defining a performance objective alongside CBF-based safety constraints that must always be enforced. Unfortunately, both performance and solution feasibility can be significantly impacted by two key factors: (i) the selection of the cost function and associated parameters, and (ii) the calibration of parameters within the CBF-based constraints, which capture the trade-off between performance and conservativeness. %as well as infeasibility. To address these challenges, we propose a Reinforcement Learning (RL)-based Receding Horizon Control (RHC) approach leveraging Model Predictive Control (MPC) with CBFs (MPC-CBF). In particular, we parameterize our controller and use bilevel optimization, where RL is used to learn the optimal parameters while MPC computes the optimal control input. We validate our method by applying it to the challenging automated merging control problem for Connected and Automated Vehicles (CAVs) at conflicting roadways. Results demonstrate improved performance and a significant reduction in the number of infeasible cases compared to traditional heuristic approaches used for tuning CBF-based controllers, showcasing the effectiveness of the proposed method.
[88] arXiv:2406.00341 (replaced) [pdf, html, other]: Title: DSCA: A Digital Subtraction Angiography Sequence Dataset and Spatio-Temporal Model for Cerebral Artery Segmentation

Jiong Zhang, Qihang Xie, Lei Mou, Dan Zhang, Da Chen, Caifeng Shan, Yitian Zhao, Ruisheng Su, Mengguo Guo

Comments: Published by TMI

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Cerebrovascular diseases (CVDs) remain a leading cause of global disability and mortality. Digital Subtraction Angiography (DSA) sequences, recognized as the gold standard for diagnosing CVDs, can clearly visualize the dynamic flow and reveal pathological conditions within the cerebrovasculature. Therefore, precise segmentation of cerebral arteries (CAs) and classification between their main trunks and branches are crucial for physicians to accurately quantify diseases. However, achieving accurate CA segmentation in DSA sequences remains a challenging task due to small vessels with low contrast, and ambiguity between vessels and residual skull structures. Moreover, the lack of publicly available datasets limits exploration in the field. In this paper, we introduce a DSA Sequence-based Cerebral Artery segmentation dataset (DSCA), the publicly accessible dataset designed specifically for pixel-level semantic segmentation of CAs. Additionally, we propose DSANet, a spatio-temporal network for CA segmentation in DSA sequences. Unlike existing DSA segmentation methods that focus only on a single frame, the proposed DSANet introduces a separate temporal encoding branch to capture dynamic vessel details across multiple frames. To enhance small vessel segmentation and improve vessel connectivity, we design a novel TemporalFormer module to capture global context and correlations among sequential frames. Furthermore, we develop a Spatio-Temporal Fusion (STF) module to effectively integrate spatial and temporal features from the encoder. Extensive experiments demonstrate that DSANet outperforms other state-of-the-art methods in CA segmentation, achieving a Dice of 0.9033.
[89] arXiv:2406.12323 (replaced) [pdf, html, other]: Title: Hybrid Beamforming Design for Near-Field ISAC with Modular XL-MIMO

Chunwei Meng, Dingyou Ma, Zhaolin Wang, Yuanwei Liu, Zhiqing Wei, Zhiyong Feng

Subjects: Signal Processing (eess.SP)

A novel modular extremely large-scale multiple-input-multiple-output (XL-MIMO) integrated sensing and communication (ISAC) framework is proposed in this paper. We consider a downlink ISAC scenario and exploit the modular array architecture to enhance the communication spectral efficiency and sensing resolution while reducing the channel modeling complexity by employing the hybrid spherical and planar wavefront model. Considering the hybrid digital-analog structure inherent to modular arrays, we formulate a joint analog-digital beamforming design problem based on the communication spectral efficiency and sensing signal-to-clutter-plus-noise ratio (SCNR). By exploring the structural similarity of the communication and sensing channels, it is proved that the optimal transmit covariance matrix lies in the subspace spanned by the subarray response vectors, yielding a closed-form solution for the optimal analog beamformer. Consequently, the joint design problem is transformed into a low-dimensional rank-constrained digital beamformer optimization. We first propose a manifold optimization method that directly optimizes the digital beamformer on the rank-constrained Stiefel manifold. Additionally, we develop an semidefinite relaxation (SDR)-based approach that relaxes the rank constraint and employ the randomization technique to obtain a near-optimal solution. Simulation results demonstrate the effectiveness of the proposed modular XL-MIMO ISAC framework and algorithms.
[90] arXiv:2408.16340 (replaced) [pdf, html, other]: Title: Learned Image Transmission with Hierarchical Variational Autoencoder

Guangyi Zhang, Hanlei Li, Yunlong Cai, Qiyu Hu, Guanding Yu, Runmin Zhang

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In this paper, we introduce an innovative hierarchical joint source-channel coding (HJSCC) framework for image transmission, utilizing a hierarchical variational autoencoder (VAE). Our approach leverages a combination of bottom-up and top-down paths at the transmitter to autoregressively generate multiple hierarchical representations of the original image. These representations are then directly mapped to channel symbols for transmission by the JSCC encoder. We extend this framework to scenarios with a feedback link, modeling transmission over a noisy channel as a probabilistic sampling process and deriving a novel generative formulation for JSCC with feedback. Compared with existing approaches, our proposed HJSCC provides enhanced adaptability by dynamically adjusting transmission bandwidth, encoding these representations into varying amounts of channel symbols. Extensive experiments on images of varying resolutions demonstrate that our proposed model outperforms existing baselines in rate-distortion performance and maintains robustness against channel noise. The source code will be made available upon acceptance.
[91] arXiv:2409.17010 (replaced) [pdf, html, other]: Title: MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events

Xiaoyu Yang, Qiujia Li, Chao Zhang, Phil Woodland

Comments: This work has been submitted to the IEEE for possible publication

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

With the advances in deep learning, the performance of end-to-end (E2E) single-task models for speech and audio processing has been constantly improving. However, it is still challenging to build a general-purpose model with high performance on multiple tasks, since different speech and audio processing tasks usually require different training data, input features, or model architectures to achieve optimal performance. In this work, MT2KD, a novel two-stage multi-task learning framework is proposed to build a general-purpose speech and audio encoder that jointly performs three fundamental tasks: automatic speech recognition (ASR), audio tagging (AT) and speaker verification (SV). In the first stage, multi-teacher knowledge distillation (KD) is applied to align the feature spaces of three single-task high-performance teacher encoders into a single student encoder using the same unlabelled data. In the second stage, multi-task supervised fine-tuning is carried out by initialising the model from the first stage and training on the separate labelled data of each single task. Experiments demonstrate that the proposed multi-task training pipeline significantly outperforms a baseline model trained with multi-task learning from scratch. The final system achieves good performance on ASR, AT and SV: with less than 4% relative word-error-rate increase on ASR, only 1.9 lower mean averaged precision on AT and 0.23% absolute higher equal error rate on SV compared to the best-performing single-task encoders, using only a 66M total model parameters.
[92] arXiv:2409.18734 (replaced) [pdf, html, other]: Title: On Adaptive Frequency Sampling for Data-driven Model Order Reduction Applied to Antenna Responses

Lucas Åkerstedt, Darwin Blanco, B. L. G. Jonsson

Comments: 10 pages, 9 figures

Subjects: Systems and Control (eess.SY); Computational Physics (physics.comp-ph)

Frequency domain sweeps of array antennas are well-known to be time-intensive, and different surrogate models have been used to improve the performance. Data-driven model order reduction algorithms, such as the Loewner framework and vector fitting, can be integrated with these adaptive error estimates, in an iterative algorithm, to reduce the number of full-wave simulations required to accurately capture the requested frequency behavior of multiport array antennas. In this work, we propose two novel adaptive methods exploiting a block matrix function which is a key part of the Loewner framework generating system approach. The first algorithm leverages an inherent matrix parameter freedom in the block matrix function to identify frequency points with large errors, whereas the second utilizes the condition number of the block matrix function. Both methods effectively provide frequency domain error estimates, which are essential for improved performance. Numerical experiments on multiport array antenna S-parameters demonstrate the effectiveness of our proposed algorithms within the Loewner framework, where the proposed algorithms reach the smallest errors for the smallest number of frequency points chosen.
[93] arXiv:2409.18847 (replaced) [pdf, html, other]: Title: Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

Annie Chu, Patrick O'Reilly, Julia Barnett, Bryan Pardo

Comments: Accepted to ICASSP 2025. Source code and audio examples: this https URL

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

This work introduces Text2FX, a method that leverages CLAP embeddings and differentiable digital signal processing to control audio effects, such as equalization and reverberation, using open-vocabulary natural language prompts (e.g., "make this sound in-your-face and bold"). Text2FX operates without retraining any models, relying instead on single-instance optimization within the existing embedding space, thus enabling a flexible, scalable approach to open-vocabulary sound transformations through interpretable and disentangled FX manipulation. We show that CLAP encodes valuable information for controlling audio effects and propose two optimization approaches using CLAP to map text to audio effect parameters. While we demonstrate with CLAP, this approach is applicable to any shared text-audio embedding space. Similarly, while we demonstrate with equalization and reverberation, any differentiable audio effect may be controlled. We conduct a listener study with diverse text prompts and source audio to evaluate the quality and alignment of these methods with human perception. Demos and code are available at this http URL.
[94] arXiv:2410.11097 (replaced) [pdf, html, other]: Title: DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

Yingahao Aaron Li, Rithesh Kumar, Zeyu Jin

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)

Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude. This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization. The audio samples are available at this https URL.
[95] arXiv:2410.12815 (replaced) [pdf, html, other]: Title: Expanding Over-the-Air Computation with Frequency Modulations

Marc Martinez-Gost, Ana Pérez-Neira, Miguel Ángel Lagunas

Comments: Journal paper submitted to IEEE Transactions on Communications. arXiv admin note: text overlap with arXiv:2402.15461

Subjects: Signal Processing (eess.SP)

In this study we introduce Logarithmic Frequency Shift Keying (Log-FSK), a novel frequency modulation for over-the-air computation (AirComp). Log-FSK leverages non-linear signal processing to produce AirComp in the frequency domain, this is, the maximum frequency of the received signal corresponds to the sum of the individual transmitted frequencies. The demodulation procedure relies on the inverse Discrete Cosine Transform (DCT) and the extraction of the maximum frequency component. Log-FSK enables the computation of functions beyond the sum by incorporating nomographic function representation. Furthermore, unlike existing AirComp modulations, Log-FSK allows to compute several functions in a single transmission. We evaluate the capabilities of the scheme in an additive white Gaussian noise (AWGN) and flat-fading channels. To demonstrate its practicality, we present specific applications and experimental results showcasing the effectiveness of Log-FSK AirComp within linear Wireless Sensor Networks (WSN). Our numerical results show that Log-FSK outperform linear analog modulations in terms of MSE and power consumption.
[96] arXiv:2411.06612 (replaced) [pdf, html, other]: Title: An exact active sensing strategy for a class of bio-inspired systems

Debojyoti Biswas, Eduardo D. Sontag, Noah J. Cowan

Subjects: Systems and Control (eess.SY); Dynamical Systems (math.DS)

We consider a general class of translation-invariant systems with a specific category of output nonlinearities motivated by biological sensing. We show that no dynamic output feedback can stabilize this class of systems to an isolated equilibrium point. To overcome this fundamental limitation, we propose a simple control scheme that includes a low-amplitude periodic forcing function akin to so-called "active sensing" in biology, together with nonlinear output feedback. Our analysis shows that this approach leads to the emergence of an exponentially stable limit cycle. These findings offer a provably stable active sensing strategy and may thus help to rationalize the active sensing movements made by animals as they perform certain motor behaviors.
[97] arXiv:2412.07175 (replaced) [pdf, html, other]: Title: Robust Feature Engineering Techniques for Designing Efficient Motor Imagery-Based BCI-Systems

Syed Saim Gardezi, Soyiba Jawed, Mahnoor Khan, Muneeba Bukhari, Rizwan Ahmed Khan

Comments: 26 pages

Subjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

A multitude of individuals across the globe grapple with motor disabilities. Neural prosthetics utilizing Brain-Computer Interface (BCI) technology exhibit promise for improving motor rehabilitation outcomes. The intricate nature of EEG data poses a significant hurdle for current BCI systems. Recently, a qualitative repository of EEG signals tied to both upper and lower limb execution of motor and motor imagery tasks has been unveiled. Despite this, the productivity of the Machine Learning (ML) Models that were trained on this dataset was alarmingly deficient, and the evaluation framework seemed insufficient. To enhance outcomes, robust feature engineering (signal processing) methodologies are implemented. A collection of time domain, frequency domain, and wavelet-derived features was obtained from 16-channel EEG signals, and the Maximum Relevance Minimum Redundancy (MRMR) approach was employed to identify the four most significant features. For classification K Nearest Neighbors (KNN), Support Vector Machine (SVM), Decision Tree (DT), and Naïve Bayes (NB) models were implemented with these selected features, evaluating their effectiveness through metrics such as testing accuracy, precision, recall, and F1 Score. By leveraging SVM with a Gaussian Kernel, a remarkable maximum testing accuracy of 92.50% for motor activities and 95.48% for imagery activities is achieved. These results are notably more dependable and gratifying compared to the previous study, where the peak accuracy was recorded at 74.36%. This research work provides an in-depth analysis of the MI Limb EEG dataset and it will help in designing and developing simple, cost-effective and reliable BCI systems for neuro-rehabilitation.
[98] arXiv:2501.00990 (replaced) [pdf, html, other]: Title: Cyber-physical Defense for Heterogeneous Multi-agent Systems Against Exponentially Unbounded Attacks on Signed Digraphs

Yichao Wang, Mohamadamin Rajabinezhad, Yi Zhang, Shan Zuo

Subjects: Systems and Control (eess.SY)

Cyber-physical systems (CPSs) are subjected to attacks on both cyber and physical spaces. In reality, the attackers could launch exponentially unbounded false data injection (EU-FDI) attacks, which are more destructive and could lead to the system's collapse or instability. Existing literature generally addresses bounded attack signals and/or bounded-first-order-derivative attack signals, which exposes the CPSs to significant threats. In contrast, this paper proposes a fully-distributed attack-resilient bi-layer defense framework to address the bipartite output containment problem for heterogeneous multi-agent systems on signed digraphs, in the presence of EU-FDI attacks on both cyber-physical layer (CPL) and observer layer (OL). First, we design attack-resilient dynamic compensators that utilize data communicated on the OL to estimate the convex combinations of the states and negative states of the leaders. The attack-resilient compensators address the EU-FDI attacks on the OL and guarantee the uniformly ultimately bounded (UUB) estimation of the leaders' states. Then, by using the compensators' states, fully-distributed attack-resilient controllers are designed on the CPL to further address the EU-FDI attacks on the actuators. Rigorous mathematical proof based on Lyapunov stability analysis is provided, establishing the theoretical soundness of the proposed bi-layer resilient defense framework, by preserving the UUB consensus and stability against EU-FDI attacks on both CPL and OL. Finally, a comparative case study for heterogeneous multi-agent systems validate the enhanced resilience of the proposed defense strategies.
[99] arXiv:2502.06980 (replaced) [pdf, html, other]: Title: Electromagnetic Channel Statistics for Continuous-Aperture Array (CAPA) Systems

Chongjun Ouyang, Boqun Zhao, Xingqi Zhang, Yuanwei Liu

Comments: 4 pages

Subjects: Signal Processing (eess.SP)

The channel statistics of a continuous-aperture array (CAPA)-based channel are analyzed using its continuous electromagnetic (EM) properties. The received signal-to-noise ratio (SNR) is discussed under isotropic scattering conditions. Using Landau's theorem, the eigenvalues of the autocorrelation of the EM fading channel are shown to exhibit a step-like behavior. Building on this, closed-form expressions for the probability distribution of the SNR and the average channel capacity are derived. Numerical results are provided to validate the accuracy of the derivations.
[100] arXiv:2502.06997 (replaced) [pdf, html, other]: Title: Conditional diffusion model with spatial attention and latent embedding for medical image segmentation

Behzad Hejrati, Soumyanil Banerjee, Carri Glide-Hurst, Ming Dong

Comments: 13 pages, 5 figures, 3 tables, Accepted in MICCAI 2024

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Diffusion models have been used extensively for high quality image and video generation tasks. In this paper, we propose a novel conditional diffusion model with spatial attention and latent embedding (cDAL) for medical image segmentation. In cDAL, a convolutional neural network (CNN) based discriminator is used at every time-step of the diffusion process to distinguish between the generated labels and the real ones. A spatial attention map is computed based on the features learned by the discriminator to help cDAL generate more accurate segmentation of discriminative regions in an input image. Additionally, we incorporated a random latent embedding into each layer of our model to significantly reduce the number of training and sampling time-steps, thereby making it much faster than other diffusion models for image segmentation. We applied cDAL on 3 publicly available medical image segmentation datasets (MoNuSeg, Chest X-ray and Hippocampus) and observed significant qualitative and quantitative improvements with higher Dice scores and mIoU over the state-of-the-art algorithms. The source code is publicly available at this https URL.
[101] arXiv:2201.13375 (replaced) [pdf, other]: Title: Structural Stability Properties of Antithetic Integral (Rein) Control with Output Inhibition

Corentin Briat, Mustafa Khammash

Comments: 72 pages, 22 figures

Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY); Molecular Networks (q-bio.MN)

Perfect adaptation is a well-studied biochemical homeostatic behavior lying at the core of biochemical regulation. While the concepts of homeostasis and perfect adaptation are not new, their underlying mechanisms and associated biochemical regulation motifs are not yet fully understood. Insights from control theory unraveled the connections between perfect adaptation and integral control, a prevalent engineering control strategy. In particular, the recently introduced Antithetic Integral Controller (AIC) has been shown to successfully ensure perfect adaptation properties to the network it is connected to. The complementary structure of the two molecules the AIC relies upon allows for a versatile way to control biochemical networks, a property which gave rise to an important body of literature pertaining to mathematically elucidating its properties, generalizing its structure, and developing experimental methods for its implementation. The Antithetic Integral Rein Controller (AIRC), an extension of the AIC in which both controller molecules are used for control, holds many promises as it supposedly overcomes certain limitations of the AIC. We focus here on an AIRC structure with output inhibition that combines two AICs in a single structure. We demonstrate that rhis controller ensure structural stability and structural perfect adaptation properties for the controlled network under mild assumptions, meaning that this property is independent of the parameters of the network and the controller. The results are very general and valid for the class of unimolecular mass-action networks as well as more general networks, including cooperative and Michaelis-Menten networks. We also provide a systematic and accessible computational way for verifying whether a given network satisfies the conditions under which the structural property would hold.
[102] arXiv:2312.04398 (replaced) [pdf, other]: Title: Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning

Yongqi Dong, Xingmin Lu, Ruohan Li, Wei Song, Bart van Arem, Haneen Farah

Comments: 26 pages, 7 figures, accepted by the 103rd Transportation Research Board (TRB) Annual Meeting, under review by Transportation Research Record: Journal of the Transportation Research Board

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)

The burgeoning navigation services using digital maps provide great convenience to drivers. Nevertheless, the presence of anomalies in lane rendering map images occasionally introduces potential hazards, as such anomalies can be misleading to human drivers and consequently contribute to unsafe driving conditions. In response to this concern and to accurately and effectively detect the anomalies, this paper transforms lane rendering image anomaly detection into a classification problem and proposes a four-phase pipeline consisting of data pre-processing, self-supervised pre-training with the masked image modeling (MiM) method, customized fine-tuning using cross-entropy based loss with label smoothing, and post-processing to tackle it leveraging state-of-the-art deep learning techniques, especially those involving Transformer models. Various experiments verify the effectiveness of the proposed pipeline. Results indicate that the proposed pipeline exhibits superior performance in lane rendering image anomaly detection, and notably, the self-supervised pre-training with MiM can greatly enhance the detection accuracy while significantly reducing the total training time. For instance, employing the Swin Transformer with Uniform Masking as self-supervised pretraining (Swin-Trans-UM) yielded a heightened accuracy at 94.77% and an improved Area Under The Curve (AUC) score of 0.9743 compared with the pure Swin Transformer without pre-training (Swin-Trans) with an accuracy of 94.01% and an AUC of 0.9498. The fine-tuning epochs were dramatically reduced to 41 from the original 280. In conclusion, the proposed pipeline, with its incorporation of self-supervised pre-training using MiM and other advanced deep learning techniques, emerges as a robust solution for enhancing the accuracy and efficiency of lane rendering image anomaly detection in digital navigation systems.
[103] arXiv:2410.16505 (replaced) [pdf, html, other]: Title: Do Audio-Language Models Understand Linguistic Variations?

Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha

Comments: Accepted to NAACL 2025

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Open-vocabulary audio language models (ALMs), like Contrastive Language Audio Pretraining (CLAP), represent a promising new paradigm for audio-text retrieval using natural language queries. In this paper, for the first time, we perform controlled experiments on various benchmarks to show that existing ALMs struggle to generalize to linguistic variations in textual queries. To address this issue, we propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations agnostic to linguistic variations. Specifically, we reformulate the contrastive loss used in CLAP architectures by introducing a multi-view contrastive learning objective, where paraphrases are treated as different views of the same audio scene and use this for training. Our proposed approach improves the text-to-audio retrieval performance of CLAP by 0.8%-13% across benchmarks and enhances robustness to linguistic variation.
[104] arXiv:2410.23773 (replaced) [pdf, other]: Title: Towards Generative Ray Path Sampling for Faster Point-to-Point Ray Tracing

Jérome Eertmans, Nicola Di Cicco, Claude Oestges, Laurent Jacques, Enrico M. Vittuci, Vittorio Degli-Esposti

Comments: 6 pages, 6 figures, accepted at IEEE ICMLCN 2025

Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)

Radio propagation modeling is essential in telecommunication research, as radio channels result from complex interactions with environmental objects. Recently, Machine Learning has been attracting attention as a potential alternative to computationally demanding tools, like Ray Tracing, which can model these interactions in detail. However, existing Machine Learning approaches often attempt to learn directly specific channel characteristics, such as the coverage map, making them highly specific to the frequency and material properties and unable to fully capture the underlying propagation mechanisms. Hence, Ray Tracing, particularly the Point-to-Point variant, remains popular to accurately identify all possible paths between transmitter and receiver nodes. Still, path identification is computationally intensive because the number of paths to be tested grows exponentially while only a small fraction is valid. In this paper, we propose a Machine Learning-aided Ray Tracing approach to efficiently sample potential ray paths, significantly reducing the computational load while maintaining high accuracy. Our model dynamically learns to prioritize potentially valid paths among all possible paths and scales linearly with scene complexity. Unlike recent alternatives, our approach is invariant with translation, scaling, or rotation of the geometry, and avoids dependency on specific environment characteristics.
[105] arXiv:2411.00570 (replaced) [pdf, other]: Title: Incentive-based Platoon Formation: Optimizing the Personal Benefit for Drivers

Julian Heinovski, Doğanalp Ergenç, Kirsten Thommes, Falko Dressler

Subjects: Multiagent Systems (cs.MA); Systems and Control (eess.SY)

Platooning or cooperative adaptive cruise control (CACC) has been investigated for decades, but debate about its lasting impact is still ongoing. While the benefits of platooning and the formation of platoons are well understood for trucks, they are less clear for passenger cars, which have a higher heterogeneity in trips and drivers' preferences. Most importantly, it remains unclear how to form platoons of passenger cars in order to optimize the personal benefit for the individual driver. To this end, in this paper, we propose a novel platoon formation algorithm that optimizes the personal benefit for drivers of individual passenger cars. For computing vehicle-to-platoon assignments, the algorithm utilizes a new metric that we propose to evaluate the personal benefits of various driving systems, including platooning. By combining fuel and travel time costs into a single monetary value, drivers can estimate overall trip costs according to a personal monetary value for time spent. This provides an intuitive way for drivers to understand and compare the benefits of driving systems like human driving, adaptive cruise control (ACC), and, of course, platooning. Unlike previous similarity-based methods, our proposed algorithm forms platoons only when beneficial for the driver, rather than solely for platooning. We demonstrate the new metric for the total trip cost in a numerical analysis and explain its interpretation. Results of a large-scale simulation study demonstrate that our proposed platoon formation algorithm outperforms normal ACC as well as previous similarity-based platooning approaches by balancing fuel savings and travel time, independent of traffic and drivers' time cost.
[106] arXiv:2502.13433 (replaced) [pdf, html, other]: Title: MATS: An Audio Language Model under Text-only Supervision

Wen Wang, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen

Comments: 19 pages,11 figures

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

Large audio-language models (LALMs), built upon powerful Large Language Models (LLMs), have exhibited remarkable audio comprehension and reasoning capabilities. However, the training of LALMs demands a large corpus of audio-language pairs, which requires substantial costs in both data collection and training resources. In this paper, we propose MATS, an audio-language multimodal LLM designed to handle Multiple Audio task using solely Text-only Supervision. By leveraging pre-trained audio-language alignment models such as CLAP, we develop a text-only training strategy that projects the shared audio-language latent space into LLM latent space, endowing the LLM with audio comprehension capabilities without relying on audio data during training. To further bridge the modality gap between audio and language embeddings within CLAP, we propose the Strongly-related noisy text with audio (Santa) mechanism. Santa maps audio embeddings into CLAP language embedding space while preserving essential information from the audio input. Extensive experiments demonstrate that MATS, despite being trained exclusively on text data, achieves competitive performance compared to recent LALMs trained on large-scale audio-language pairs.
[107] arXiv:2502.13713 (replaced) [pdf, html, other]: Title: TALKPLAY: Multimodal Music Recommendation with Large Language Models

Seungheon Doh, Keunwoo Choi, Juhan Nam

Subjects: Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)

We present TalkPlay, a multimodal music recommendation system that reformulates the recommendation task as large language model token generation. TalkPlay represents music through an expanded token vocabulary that encodes multiple modalities - audio, lyrics, metadata, semantic tags, and playlist co-occurrence. Using these rich representations, the model learns to generate recommendations through next-token prediction on music recommendation conversations, that requires learning the associations natural language query and response, as well as music items. In other words, the formulation transforms music recommendation into a natural language understanding task, where the model's ability to predict conversation tokens directly optimizes query-item relevance. Our approach eliminates traditional recommendation-dialogue pipeline complexity, enabling end-to-end learning of query-aware music recommendations. In the experiment, TalkPlay is successfully trained and outperforms baseline methods in various aspects, demonstrating strong context understanding as a conversational music recommender.
[108] arXiv:2502.13777 (replaced) [pdf, html, other]: Title: Herglotz-NET: Implicit Neural Representation of Spherical Data with Harmonic Positional Encoding

Théo Hanon, Nicolas Mil-Homens Cavaco, John Kiely, Laurent Jacques

Comments: Keywords: Herglotz, spherical harmonics, spectral analysis, implicit neural representation. Remarks: 4 pages + 1 reference page, 4 figures (submitted to SAMPTA2025)

Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)

Representing and processing data in spherical domains presents unique challenges, primarily due to the curvature of the domain, which complicates the application of classical Euclidean techniques. Implicit neural representations (INRs) have emerged as a promising alternative for high-fidelity data representation; however, to effectively handle spherical domains, these methods must be adapted to the inherent geometry of the sphere to maintain both accuracy and stability. In this context, we propose Herglotz-NET (HNET), a novel INR architecture that employs a harmonic positional encoding based on complex Herglotz mappings. This encoding yields a well-posed representation on the sphere with interpretable and robust spectral properties. Moreover, we present a unified expressivity analysis showing that any spherical-based INR satisfying a mild condition exhibits a predictable spectral expansion that scales with network depth. Our results establish HNET as a scalable and flexible framework for accurate modeling of spherical data.

Total of 108 entries

Showing up to 2000 entries per page: fewer | more | all

Electrical Engineering and Systems Science

Showing new listings for Friday, 21 February 2025

New submissions (showing 48 of 48 entries)

Cross submissions (showing 30 of 30 entries)

Replacement submissions (showing 30 of 30 entries)