Computer Science
See recent articles
Showing new listings for Friday, 25 April 2025
- [1] arXiv:2504.16933 [pdf, html, other]
-
Title: Predictive Process Monitoring: a comparison survey between different type of event logsSubjects: Software Engineering (cs.SE)
The application of Predictive Process Monitoring (PPM) techniques is becoming increasingly widespread due to their capacity to provide organizations with accurate predictions regarding the future behavior of business processes, thereby facilitating more informed decision-making. A plethora of solutions have been proposed in the literature employing these techniques, yet they differ from one another due to a number of factors. However, in light of the growing recognition of the value of object-centric event logs, including in the context of PPM, this survey focuses on the differences among PPM techniques employed with different event logs, namely traditional event logs and object-centric event logs. In addition, the reviewed methods are classified according to the prediction task they address and the specific methodologies they employ.
- [2] arXiv:2504.16934 [pdf, html, other]
-
Title: Finding Important Stack Frames in Large SystemsComments: 2 pages, 1 figureSubjects: Software Engineering (cs.SE)
In this work, we developed, integrated, and tested a feature that automatically highlights potentially important frames in stack traces. The feature was implemented in the internal bug-processing tool at JetBrains that processes tens of millions of stack traces. We surveyed 18 developers at JetBrains who provided valuable feedback on the idea and the implementation.
- [3] arXiv:2504.16936 [pdf, html, other]
-
Title: Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and RobustnessSubjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our findings provide insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.
- [4] arXiv:2504.16937 [pdf, html, other]
-
Title: A Framework for the Assurance of AI-Enabled SystemsAriel S. Kapusta (1), David Jin (2), Peter M. Teague (2), Robert A. Houston (2), Jonathan B. Elliott (2), Grace Y. Park (2), Shelby S. Holdren (3) ((1) The MITRE Corporation, (2) Office of the Chief Digital and Artificial Intelligence Officer, (3) John Hopkins University Applied Physics Laboratory)Comments: 12 pages, 2 figures, published in conference proceedings of SPIE Defense and Commercial Sensing conference on Assurance and Security for AI-enabled Systems 2025Subjects: Artificial Intelligence (cs.AI)
The United States Department of Defense (DOD) looks to accelerate the development and deployment of AI capabilities across a wide spectrum of defense applications to maintain strategic advantages. However, many common features of AI algorithms that make them powerful, such as capacity for learning, large-scale data ingestion, and problem-solving, raise new technical, security, and ethical challenges. These challenges may hinder adoption due to uncertainty in development, testing, assurance, processes, and requirements. Trustworthiness through assurance is essential to achieve the expected value from AI.
This paper proposes a claims-based framework for risk management and assurance of AI systems that addresses the competing needs for faster deployment, successful adoption, and rigorous evaluation. This framework supports programs across all acquisition pathways provide grounds for sufficient confidence that an AI-enabled system (AIES) meets its intended mission goals without introducing unacceptable risks throughout its lifecycle. The paper's contributions are a framework process for AI assurance, a set of relevant definitions to enable constructive conversations on the topic of AI assurance, and a discussion of important considerations in AI assurance. The framework aims to provide the DOD a robust yet efficient mechanism for swiftly fielding effective AI capabilities without overlooking critical risks or undermining stakeholder trust. - [5] arXiv:2504.16938 [pdf, html, other]
-
Title: Rational Inference in Formal Concept AnalysisSubjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Defeasible conditionals are a form of non-monotonic inference which enable the expression of statements like "if $\phi$ then normally $\psi$". The KLM framework defines a semantics for the propositional case of defeasible conditionals by construction of a preference ordering over possible worlds. The pattern of reasoning induced by these semantics is characterised by consequence relations satisfying certain desirable properties of non-monotonic reasoning. In FCA, implications are used to describe dependencies between attributes. However, these implications are unsuitable to reason with erroneous data or data prone to exceptions. Until recently, the topic of non-monotonic inference in FCA has remained largely uninvestigated. In this paper, we provide a construction of the KLM framework for defeasible reasoning in FCA and show that this construction remains faithful to the principle of non-monotonic inference described in the original framework. We present an additional argument that, while remaining consistent with the original ideas around non-monotonic reasoning, the defeasible reasoning we propose in FCA offers a more contextual view on inference, providing the ability for more relevant conclusions to be drawn when compared to the propositional case.
- [6] arXiv:2504.16939 [pdf, html, other]
-
Title: A Desideratum for Conversational Agents: Capabilities, Challenges, and Future DirectionsEmre Can Acikgoz, Cheng Qian, Hongru Wang, Vardhan Dongre, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür, Gokhan TurSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Recent advances in Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users. Yet, fundamental questions about their capabilities, limitations, and paths forward remain open. This survey paper presents a desideratum for next-generation Conversational Agents - what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human-level intelligence. To that end, we systematically analyze LLM-driven Conversational Agents by organizing their capabilities into three primary dimensions: (i) Reasoning - logical, systematic thinking inspired by human intelligence for decision making, (ii) Monitor - encompassing self-awareness and user interaction monitoring, and (iii) Control - focusing on tool utilization and policy following. Building upon this, we introduce a novel taxonomy by classifying recent work on Conversational Agents around our proposed desideratum. We identify critical research gaps and outline key directions, including realistic evaluations, long-term multi-turn reasoning skills, self-evolution capabilities, collaborative and multi-agent task completion, personalization, and proactivity. This work aims to provide a structured foundation, highlight existing limitations, and offer insights into potential future research directions for Conversational Agents, ultimately advancing progress toward Artificial General Intelligence (AGI). We maintain a curated repository of papers at: this https URL.
- [7] arXiv:2504.16942 [pdf, html, other]
-
Title: S2Vec: Self-Supervised Geospatial EmbeddingsShushman Choudhury, Elad Aharoni, Chandrakumari Suvarna, Iveel Tsogsuren, Abdul Rahman Kreidieh, Chun-Ta Lu, Neha AroraComments: To be submitted to ACM Transactions on Spatial Algorithms and SystemsSubjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Scalable general-purpose representations of the built environment are crucial for geospatial artificial intelligence applications. This paper introduces S2Vec, a novel self-supervised framework for learning such geospatial embeddings. S2Vec uses the S2 Geometry library to partition large areas into discrete S2 cells, rasterizes built environment feature vectors within cells as images, and applies masked autoencoding on these rasterized images to encode the feature vectors. This approach yields task-agnostic embeddings that capture local feature characteristics and broader spatial relationships. We evaluate S2Vec on three large-scale socioeconomic prediction tasks, showing its competitive performance against state-of-the-art image-based embeddings. We also explore the benefits of combining S2Vec embeddings with image-based embeddings downstream, showing that such multimodal fusion can often improve performance. Our results highlight how S2Vec can learn effective general-purpose geospatial representations and how it can complement other data modalities in geospatial artificial intelligence.
- [8] arXiv:2504.16943 [pdf, html, other]
-
Title: Flexibility of German gas-fired generation: evidence from clustering empirical operationComments: 29 pages, 6 figures, 6 tablesSubjects: Computers and Society (cs.CY); Machine Learning (cs.LG)
A key input to energy models are assumptions about the flexibility of power generation units, i.e., how quickly and often they can start up. These assumptions are usually calibrated on the technical characteristics of the units, such as installed capacity or technology type. However, even if power generation units technically can dispatch flexibly, service obligations and market incentives may constrain their operation. Here, we cluster over 60% of German national gas generation (generation units of 100 MWp or above) based on their empirical flexibility. We process the hourly dispatch of sample units between 2019 and 2023 using a novel deep learning approach, that transforms time series into easy-to-cluster representations. We identify two clusters of peaker units and two clusters of non-peaker units, whose different empirical flexibility is quantified by cluster-level ramp rates. Non-peaker units, around half of the sample, are empirically less flexible than peakers, and make up for more than 83% of sample must-run generation. Regulatory changes addressing the low market responsiveness of non-peakers are needed to unlock their flexibility.
- [9] arXiv:2504.16944 [pdf, html, other]
-
Title: Burning some myths on privacy properties of social networks against active attacksSubjects: Social and Information Networks (cs.SI); Combinatorics (math.CO)
This work focuses on showing some arguments addressed to dismantle the extended idea about that social networks completely lacks of privacy properties. We consider the so-called active attacks to the privacy of social networks and the counterpart $(k,\ell)$-anonymity measure, which is used to quantify the privacy satisfied by a social network against active attacks. To this end, we make use of the graph theoretical concept of $k$-metric antidimensional graphs for which the case $k=1$ represents those graphs achieving the worst scenario in privacy whilst considering the $(k,\ell)$-anonymity measure.
As a product of our investigation, we present a large number of computational results stating that social networks might not be as insecure as one often thinks. In particular, we develop a large number of experiments on random graphs which show that the number of $1$-metric antidimensional graphs is indeed ridiculously small with respect to the total number of graphs that can be considered. Moreover, we search on several real networks in order to check if they are $1$-metric antidimensional, and obtain that none of them are such. Along the way, we show some theoretical studies on the mathematical properties of the $k$-metric antidimensional graphs for any suitable $k\ge 1$. In addition, we also describe some operations on graphs that are $1$-metric antidimensional so that they get embedded into another larger graphs that are not such, in order to obscure their privacy properties against active attacks. - [10] arXiv:2504.16946 [pdf, html, other]
-
Title: MobileCity: An Efficient Framework for Large-Scale Urban Behavior SimulationSubjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
Generative agents offer promising capabilities for simulating realistic urban behaviors. However, existing methods oversimplify transportation choices in modern cities, and require prohibitive computational resources for large-scale population simulation. To address these limitations, we first present a virtual city that features multiple functional buildings and transportation modes. Then, we conduct extensive surveys to model behavioral choices and mobility preferences among population groups. Building on these insights, we introduce a simulation framework that captures the complexity of urban mobility while remaining scalable, enabling the simulation of over 4,000 agents. To assess the realism of the generated behaviors, we perform a series of micro and macro-level analyses. Beyond mere performance comparison, we explore insightful experiments, such as predicting crowd density from movement patterns and identifying trends in vehicle preferences across agent demographics.
- [11] arXiv:2504.16947 [pdf, html, other]
-
Title: SCRAG: Social Computing-Based Retrieval Augmented Generation for Community Response Forecasting in Social Media EnvironmentsSubjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
This paper introduces SCRAG, a prediction framework inspired by social computing, designed to forecast community responses to real or hypothetical social media posts. SCRAG can be used by public relations specialists (e.g., to craft messaging in ways that avoid unintended misinterpretations) or public figures and influencers (e.g., to anticipate social responses), among other applications related to public sentiment prediction, crisis management, and social what-if analysis. While large language models (LLMs) have achieved remarkable success in generating coherent and contextually rich text, their reliance on static training data and susceptibility to hallucinations limit their effectiveness at response forecasting in dynamic social media environments. SCRAG overcomes these challenges by integrating LLMs with a Retrieval-Augmented Generation (RAG) technique rooted in social computing. Specifically, our framework retrieves (i) historical responses from the target community to capture their ideological, semantic, and emotional makeup, and (ii) external knowledge from sources such as news articles to inject time-sensitive context. This information is then jointly used to forecast the responses of the target community to new posts or narratives. Extensive experiments across six scenarios on the X platform (formerly Twitter), tested with various embedding models and LLMs, demonstrate over 10% improvements on average in key evaluation metrics. A concrete example further shows its effectiveness in capturing diverse ideologies and nuances. Our work provides a social computing tool for applications where accurate and concrete insights into community responses are crucial.
- [12] arXiv:2504.16948 [pdf, html, other]
-
Title: Intrinsic Barriers to Explaining Deep Foundation ModelsSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Deep Foundation Models (DFMs) offer unprecedented capabilities but their increasing complexity presents profound challenges to understanding their internal workings-a critical need for ensuring trust, safety, and accountability. As we grapple with explaining these systems, a fundamental question emerges: Are the difficulties we face merely temporary hurdles, awaiting more sophisticated analytical techniques, or do they stem from \emph{intrinsic barriers} deeply rooted in the nature of these large-scale models themselves? This paper delves into this critical question by examining the fundamental characteristics of DFMs and scrutinizing the limitations encountered by current explainability methods when confronted with this inherent challenge. We probe the feasibility of achieving satisfactory explanations and consider the implications for how we must approach the verification and governance of these powerful technologies.
- [13] arXiv:2504.16956 [pdf, html, other]
-
Title: Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological FidelitySubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Genomics (q-bio.GN)
Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range dependencies. In this work, we introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling. Leveraging the Bi-Mamba architecture, GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines. The model is pretrained on nearly 30 million cells and incorporates biologically informed objectives, including pathway-aware contrastive loss and rank-based gene encoding. We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness. These results position GeneMamba as a practical and powerful alternative to transformer-based methods, advancing the development of biologically grounded, scalable tools for large-scale single-cell data analysis.
- [14] arXiv:2504.16960 [pdf, html, other]
-
Title: A Coding-Enhanced Jamming Approach for Secure Semantic Communication over Wiretap ChannelsSubjects: Information Theory (cs.IT); Image and Video Processing (eess.IV)
As semantic communication (SemCom) gains increasing attention as a novel communication paradigm, ensuring the security of transmitted semantic information over open wireless channels becomes crucial. Existing secure SemCom solutions often lack explicit control over security. To address this, we propose a coding-enhanced jamming approach for secure SemCom over wiretap channels. This approach integrates deep joint source and channel coding (DeepJSCC) with neural network-based digital modulation, enabling controlled jamming through two-layer superposition coding. The outer constellation sequence encodes the source image, while the inner constellation sequence, derived from a secret image, acts as the jamming signal. By minimizing the mutual information between the outer and inner constellation sequences, the jamming effect is enhanced. The jamming signal is superposed on the outer constellation sequence, preventing the eavesdropper from recovering the source image. The power allocation coefficient (PAC) in the superposition coding can be adjusted to control system security. Experiments show that our approach matches existing methods in security while significantly improving reconstruction performance across varying channel signal-to-noise ratios (SNRs) and compression ratios.
- [15] arXiv:2504.16961 [pdf, html, other]
-
Title: A Novel Graph Transformer Framework for Gene Regulatory Network InferenceSubjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Genomics (q-bio.GN); Molecular Networks (q-bio.MN); Quantitative Methods (q-bio.QM)
The inference of gene regulatory networks (GRNs) is a foundational stride towards deciphering the fundamentals of complex biological systems. Inferring a possible regulatory link between two genes can be formulated as a link prediction problem. Inference of GRNs via gene coexpression profiling data may not always reflect true biological interactions, as its susceptibility to noise and misrepresenting true biological regulatory relationships. Most GRN inference methods face several challenges in the network reconstruction phase. Therefore, it is important to encode gene expression values, leverege the prior knowledge gained from the available inferred network structures and positional informations of the input network nodes towards inferring a better and more confident GRN network reconstruction. In this paper, we explore the integration of multiple inferred networks to enhance the inference of Gene Regulatory Networks (GRNs). Primarily, we employ autoencoder embeddings to capture gene expression patterns directly from raw data, preserving intricate biological signals. Then, we embed the prior knowledge from GRN structures transforming them into a text-like representation using random walks, which are then encoded with a masked language model, BERT, to generate global embeddings for each gene across all networks. Additionally, we embed the positional encodings of the input gene networks to better identify the position of each unique gene within the graph. These embeddings are integrated into graph transformer-based model, termed GT-GRN, for GRN inference. The GT-GRN model effectively utilizes the topological structure of the ground truth network while incorporating the enriched encoded information. Experimental results demonstrate that GT-GRN significantly outperforms existing GRN inference methods, achieving superior accuracy and highlighting the robustness of our approach.
- [16] arXiv:2504.16964 [pdf, other]
-
Title: Social sustainability through engagement in a training context with tools such as the Native Podcast and Facebook social networkDanielle Mbambe Bebey (DICEN-IDF)Comments: in French languageJournal-ref: Humanisme num\'erique et durabilit\'e sociale, EUTIC; Maison des Sciences Humaines Bordeaux (MSHBORDEAUX), Oct 2023, Bordeaux & Online, FranceSubjects: Computers and Society (cs.CY); Multimedia (cs.MM); Social and Information Networks (cs.SI)
The social dimension of sustainability seems to have been a notion rarely addressed in the literature (Dubois et al., 2001) until the early 2000s. The EUTIC 2023 symposium provides an opportunity to take up this topical issue. To this end, we are presenting an engagement process that is part of a sustainable development dynamic, based on digital tools inspired by everyday life, for applications in the context of training, with a view to lifelong learning. Our work, which stems from the information and communication sciences, is rooted in a multi-disciplinary approach that we believe can be echoed in a variety of disciplines, but which it is interesting to challenge, hence the purpose of this contribution.
- [17] arXiv:2504.16966 [pdf, html, other]
-
Title: Structuring Competency-Based Courses Through Skill TreesComments: Submitted for publication to Koli Calling '25Subjects: Computers and Society (cs.CY)
Computer science education has seen two important trends. One has been a shift from raw theory towards skills: competency-based teaching. Another has been increasing student numbers, with as a result more automation in teaching. When automating education, it is crucial to properly structure courses, both to manage digitalized educational resources and to facilitate automated coaching algorithms. Currently existing structuring methodologies are focused around theory and not around skills, and are incapable of modeling the dependency links between skills. Because of this, a new didactic framework is needed.
This paper presents a new method of structuring educational contents around skills: something that a student is expected to be able to do. It defines Skill Trees that show dependencies between skills, and subsequently couples these to Concept Trees that contain intuitive ideas/notional machines. Due to the algorithmic nature of computer science, this step-wise approach is especially well-suited to this field of education. Next to formal definitions on Skill Trees and Concept Trees, guidelines are given on how to design them and how to plan a course using them.
The Skill Trees framework has been applied to improve the structure of a university database course. Student interviews indicated reduced confusion/stress and less study time required for students to meet their desired skill level. - [18] arXiv:2504.16968 [pdf, html, other]
-
Title: Backslash: Rate Constrained Optimized Training of Large Language ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (Backslash), a novel training-time compression approach based on rate-distortion optimization (RDO). Backslash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that Backslash can reduce memory usage by 60\% - 90\% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, Backslash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80\% pruning rates), and enables network simplification for accelerated inference on edge devices.
- [19] arXiv:2504.16969 [pdf, html, other]
-
Title: Engineering the Law-Machine Learning Translation Problem: Developing Legally Aligned ModelsComments: 16 pages, 1 figureSubjects: Computers and Society (cs.CY); Machine Learning (cs.LG)
Organizations developing machine learning-based (ML) technologies face the complex challenge of achieving high predictive performance while respecting the law. This intersection between ML and the law creates new complexities. As ML model behavior is inferred from training data, legal obligations cannot be operationalized in source code directly. Rather, legal obligations require "indirect" operationalization. However, choosing context-appropriate operationalizations presents two compounding challenges: (1) laws often permit multiple valid operationalizations for a given legal obligation-each with varying degrees of legal adequacy; and, (2) each operationalization creates unpredictable trade-offs among the different legal obligations and with predictive performance. Evaluating these trade-offs requires metrics (or heuristics), which are in turn difficult to validate against legal obligations. Current methodologies fail to fully address these interwoven challenges as they either focus on legal compliance for traditional software or on ML model development without adequately considering legal complexities. In response, we introduce a five-stage interdisciplinary framework that integrates legal and ML-technical analysis during ML model development. This framework facilitates designing ML models in a legally aligned way and identifying high-performing models that are legally justifiable. Legal reasoning guides choices for operationalizations and evaluation metrics, while ML experts ensure technical feasibility, performance optimization and an accurate interpretation of metric values. This framework bridges the gap between more conceptual analysis of law and ML models' need for deterministic specifications. We illustrate its application using a case study in the context of anti-money laundering.
- [20] arXiv:2504.16970 [pdf, html, other]
-
Title: STFM: A Spatio-Temporal Information Fusion Model Based on Phase Space Reconstruction for Sea Surface Temperature PredictionComments: 19 pages, 14 figuresSubjects: Machine Learning (cs.LG)
The sea surface temperature (SST), a key environmental parameter, is crucial to optimizing production planning, making its accurate prediction a vital research topic. However, the inherent nonlinearity of the marine dynamic system presents significant challenges. Current forecasting methods mainly include physics-based numerical simulations and data-driven machine learning approaches. The former, while describing SST evolution through differential equations, suffers from high computational complexity and limited applicability, whereas the latter, despite its computational benefits, requires large datasets and faces interpretability challenges. This study presents a prediction framework based solely on data-driven techniques. Using phase space reconstruction, we construct initial-delay attractor pairs with a mathematical homeomorphism and design a Spatio-Temporal Fusion Mapping (STFM) to uncover their intrinsic connections. Unlike conventional models, our method captures SST dynamics efficiently through phase space reconstruction and achieves high prediction accuracy with minimal training data in comparative tests
- [21] arXiv:2504.16972 [pdf, other]
-
Title: Unsupervised Time-Series Signal Analysis with Autoencoders and Vision Transformers: A Review of Architectures and ApplicationsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
The rapid growth of unlabeled time-series data in domains such as wireless communications, radar, biomedical engineering, and the Internet of Things (IoT) has driven advancements in unsupervised learning. This review synthesizes recent progress in applying autoencoders and vision transformers for unsupervised signal analysis, focusing on their architectures, applications, and emerging trends. We explore how these models enable feature extraction, anomaly detection, and classification across diverse signal types, including electrocardiograms, radar waveforms, and IoT sensor data. The review highlights the strengths of hybrid architectures and self-supervised learning, while identifying challenges in interpretability, scalability, and domain generalization. By bridging methodological innovations and practical applications, this work offers a roadmap for developing robust, adaptive models for signal intelligence.
- [22] arXiv:2504.16974 [pdf, html, other]
-
Title: Seeing The Words: Evaluating AI-generated Biblical ArtSubjects: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
The past years witnessed a significant amount of Artificial Intelligence (AI) tools that can generate images from texts. This triggers the discussion of whether AI can generate accurate images using text from the Bible with respect to the corresponding biblical contexts and backgrounds. Despite some existing attempts at a small scale, little work has been done to systematically evaluate these generated images. In this work, we provide a large dataset of over 7K images using biblical text as prompts. These images were evaluated with multiple neural network-based tools on various aspects. We provide an assessment of accuracy and some analysis from the perspective of religion and aesthetics. Finally, we discuss the use of the generated images and reflect on the performance of the AI generators.
- [23] arXiv:2504.16977 [pdf, html, other]
-
Title: Tokenization Matters: Improving Zero-Shot NER for Indic LanguagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Tokenization is a critical component of Natural Language Processing (NLP), especially for low resource languages, where subword segmentation influences vocabulary structure and downstream task accuracy. Although Byte Pair Encoding (BPE) is a standard tokenization method in multilingual language models, its suitability for Named Entity Recognition (NER) in low resource Indic languages remains underexplored due to its limitations in handling morphological complexity. In this work, we systematically compare BPE, SentencePiece, and Character Level tokenization strategies using IndicBERT for NER tasks in low resource Indic languages like Assamese, Bengali, Marathi, and Odia, as well as extremely low resource Indic languages like Santali, Manipuri, and Sindhi. We assess both intrinsic linguistic properties tokenization efficiency, out of vocabulary (OOV) rates, and morphological preservation as well as extrinsic downstream performance, including fine tuning and zero shot cross lingual transfer.
Our experiments show that SentencePiece is a consistently better performing approach than BPE for NER in low resource Indic Languages, particularly in zero shot cross lingual settings, as it better preserves entity consistency. While BPE provides the most compact tokenization form, it is not capable of generalization because it misclassifies or even fails to recognize entity labels when tested on unseen languages. In contrast, SentencePiece constitutes a better linguistic structural preservation model, benefiting extremely low resource and morphologically rich Indic languages, such as Santali and Manipuri, for superior entity recognition, as well as high generalization across scripts, such as Sindhi, written in Arabic. The results point to SentencePiece as the more effective tokenization strategy for NER within multilingual and low resource Indic NLP applications. - [24] arXiv:2504.16980 [pdf, other]
-
Title: Safety Pretraining: Toward the Next Generation of Safe AIPratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, Andy Zou, Zacharcy C. Lipton, J. Zico KolterSubjects: Machine Learning (cs.LG)
As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. We present a data-centric pretraining framework that builds safety into the model from the start. Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date (100B tokens) generated via recontextualization of harmful web data; (iii) RefuseWeb and Moral Education datasets that convert harmful prompts into refusal dialogues and web-style educational material; (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content and steer away inference from harmful generations; and (v) safety evaluations measuring base model behavior before instruction tuning. Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% with no performance degradation on standard LLM safety benchmarks.
- [25] arXiv:2504.17004 [pdf, html, other]
-
Title: (Im)possibility of Automated Hallucination Detection in Large Language ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Is automated hallucination detection possible? In this work, we introduce a theoretical framework to analyze the feasibility of automatically detecting hallucinations produced by large language models (LLMs). Inspired by the classical Gold-Angluin framework for language identification and its recent adaptation to language generation by Kleinberg and Mullainathan, we investigate whether an algorithm, trained on examples drawn from an unknown target language $K$ (selected from a countable collection) and given access to an LLM, can reliably determine whether the LLM's outputs are correct or constitute hallucinations.
First, we establish an equivalence between hallucination detection and the classical task of language identification. We prove that any hallucination detection method can be converted into a language identification method, and conversely, algorithms solving language identification can be adapted for hallucination detection. Given the inherent difficulty of language identification, this implies that hallucination detection is fundamentally impossible for most language collections if the detector is trained using only correct examples from the target language.
Second, we show that the use of expert-labeled feedback, i.e., training the detector with both positive examples (correct statements) and negative examples (explicitly labeled incorrect statements), dramatically changes this conclusion. Under this enriched training regime, automated hallucination detection becomes possible for all countable language collections.
These results highlight the essential role of expert-labeled examples in training hallucination detectors and provide theoretical support for feedback-based methods, such as reinforcement learning with human feedback (RLHF), which have proven critical for reliable LLM deployment. - [26] arXiv:2504.17006 [pdf, html, other]
-
Title: A Systematic Approach to Design Real-World Human-in-the-Loop Deep Reinforcement Learning: Salient Features, Challenges and Trade-offsJalal Arabneydi, Saiful Islam, Srijita Das, Sai Krishna Gottipati, William Duguay, Cloderic Mars, Matthew E. Taylor, Matthew Guzdial, Antoine Fagette, Younes ZeroualiComments: This is a result of the collaboration by JACOBB, AMII(Alberta Machine Intelligence Institute), Thales and AI Redefined (AIR) in 2021-2023Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
With the growing popularity of deep reinforcement learning (DRL), human-in-the-loop (HITL) approach has the potential to revolutionize the way we approach decision-making problems and create new opportunities for human-AI collaboration. In this article, we introduce a novel multi-layered hierarchical HITL DRL algorithm that comprises three types of learning: self learning, imitation learning and transfer learning. In addition, we consider three forms of human inputs: reward, action and demonstration. Furthermore, we discuss main challenges, trade-offs and advantages of HITL in solving complex problems and how human information can be integrated in the AI solution systematically. To verify our technical results, we present a real-world unmanned aerial vehicles (UAV) problem wherein a number of enemy drones attack a restricted area. The objective is to design a scalable HITL DRL algorithm for ally drones to neutralize the enemy drones before they reach the area. To this end, we first implement our solution using an award-winning open-source HITL software called Cogment. We then demonstrate several interesting results such as (a) HITL leads to faster training and higher performance, (b) advice acts as a guiding direction for gradient methods and lowers variance, and (c) the amount of advice should neither be too large nor too small to avoid over-training and under-training. Finally, we illustrate the role of human-AI cooperation in solving two real-world complex scenarios, i.e., overloaded and decoy attacks.
- [27] arXiv:2504.17008 [pdf, html, other]
-
Title: Relationship between Hölder Divergence and Functional Density Power Divergence: Intersection and GeneralizationComments: 20 pages, 1 figureSubjects: Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
In this study, we discuss the relationship between two families of density-power-based divergences with functional degrees of freedom -- the Hölder divergence and the functional density power divergence (FDPD) -- based on their intersection and generalization. These divergence families include the density power divergence and the $\gamma$-divergence as special cases. First, we prove that the intersection of the Hölder divergence and the FDPD is limited to a general divergence family introduced by Jones et al. (Biometrika, 2001). Subsequently, motivated by the fact that Hölder's inequality is used in the proofs of nonnegativity for both the Hölder divergence and the FDPD, we define a generalized divergence family, referred to as the $\xi$-Hölder divergence. The nonnegativity of the $\xi$-Hölder divergence is established through a combination of the inequalities used to prove the nonnegativity of the Hölder divergence and the FDPD. Furthermore, we derive an inequality between the composite scoring rules corresponding to different FDPDs based on the $\xi$-Hölder divergence. Finally, we prove that imposing the mathematical structure of the Hölder score on a composite scoring rule results in the $\xi$-Hölder divergence.
- [28] arXiv:2504.17012 [pdf, html, other]
-
Title: Universal Methods for Nonlinear Spectral ProblemsSubjects: Numerical Analysis (math.NA); Spectral Theory (math.SP)
Nonlinear spectral problems arise across a range of fields, including mechanical vibrations, fluid-solid interactions, and photonic crystals. Discretizing infinite-dimensional nonlinear spectral problems often introduces significant computational challenges, particularly spectral pollution and invisibility, which can distort or obscure the true underlying spectrum. We present the first general, convergent computational method for computing the spectra and pseudospectra of nonlinear spectral problems. Our approach uses new results on nonlinear injection moduli and requires only minimal continuity assumptions: specifically, continuity with respect to the gap metric on operator graphs, making it applicable to a broad class of problems. We use the Solvability Complexity Index (SCI) hierarchy, which has recently been used to resolve the classical linear problem, to systematically classify the computational complexity of nonlinear spectral problems. Our results establish the optimality of the method and reveal that Hermiticity does not necessarily simplify the computational complexity of these nonlinear problems. Comprehensive examples -- including nonlinear shifts, Klein--Gordon equations, wave equations with acoustic boundary conditions, time-fractional beam equations, and biologically inspired delay differential equations -- demonstrate the robustness, accuracy, and broad applicability of our methodology.
- [29] arXiv:2504.17017 [pdf, html, other]
-
Title: Neural Theorem Proving: Generating and Structuring Proofs for Formal VerificationComments: Accepted to the Proceedings of the 19th Conference on Neurosymbolic Learning and Reasoning (NeSy 2025)Subjects: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Formally verifying properties of software code has been a highly desirable task, especially with the emergence of LLM-generated code. In the same vein, they provide an interesting avenue for the exploration of formal verification and mechanistic interpretability. Since the introduction of code-specific models, despite their successes in generating code in Lean4 and Isabelle, the task of generalized theorem proving still remains far from being fully solved and will be a benchmark for reasoning capability in LLMs. In this work, we introduce a framework that generates whole proofs in a formal language to be used within systems that utilize the power of built-in tactics and off-the-shelf automated theorem provers. Our framework includes 3 components: generating natural language statements of the code to be verified, an LLM that generates formal proofs for the given statement, and a module employing heuristics for building the final proof. To train the LLM, we employ a 2-stage fine-tuning process, where we first use SFT-based training to enable the model to generate syntactically correct Isabelle code and then RL-based training that encourages the model to generate proofs verified by a theorem prover. We validate our framework using the miniF2F-test benchmark and the Isabelle proof assistant and design a use case to verify the correctness of the AWS S3 bucket access policy code. We also curate a dataset based on the FVEL\textsubscript{\textnormal{ER}} dataset for future training tasks.
- [30] arXiv:2504.17018 [pdf, html, other]
-
Title: LLM impact on BLV programmingComments: Submitted to ASSETS 2025Subjects: Human-Computer Interaction (cs.HC)
Large Language Models (LLMs) are rapidly becoming integral to a wide range of tools, tasks, and problem-solving processes, especially in software development. Originally designed for natural language processing tasks such as text generation, LLMs are increasingly being used to assist both professionals and students in writing code. This growing reliance on LLM-based tools is reshaping programming workflows and task execution. In this study, we explore the impact of these technologies on blind and low-vision (BLV) developers. Our review of existing literature indicates that while LLMs help mitigate some of the challenges faced by BLV programmers, they also introduce new forms of inaccessibility. We conducted an evaluation of five popular LLM-powered integrated development environments (IDEs), assessing their performance across a comprehensive set of programming tasks. Our findings highlight several unsupported scenarios, instances of incorrect model output, and notable limitations in interaction support for specific tasks. Through observing BLV developers as they engaged in coding activities, we uncovered key interaction barriers that go beyond model accuracy or code generation quality. This paper outlines the challenges and corresponding opportunities for improving accessibility in the context of generative AI-assisted programming. Addressing these issues can meaningfully enhance the programming experience for BLV developers. As the generative AI revolution continues to unfold, it must also address the unique burdens faced by this community.
- [31] arXiv:2504.17019 [pdf, html, other]
-
Title: Identifying Approximate Minimizers under Stochastic UncertaintyComments: 26 pages, 7 figuresSubjects: Data Structures and Algorithms (cs.DS)
We study a fundamental stochastic selection problem involving $n$ independent random variables, each of which can be queried at some cost. Given a tolerance level $\delta$, the goal is to find a value that is $\delta$-approximately minimum (or maximum) over all the random variables, at minimum expected cost. A solution to this problem is an adaptive sequence of queries, where the choice of the next query may depend on previously-observed values. Two variants arise, depending on whether the goal is to find a $\delta$-minimum value or a $\delta$-minimizer. When all query costs are uniform, we provide a $4$-approximation algorithm for both variants. When query costs are non-uniform, we provide a $5.83$-approximation algorithm for the $\delta$-minimum value and a $7.47$-approximation for the $\delta$-minimizer. All our algorithms rely on non-adaptive policies (that perform a fixed sequence of queries), so we also upper bound the corresponding ''adaptivity'' gaps. Our analysis relates the stopping probabilities in the algorithm and optimal policies, where a key step is in proving and using certain stochastic dominance properties.
- [32] arXiv:2504.17020 [pdf, html, other]
-
Title: Analyzing Value Functions of States in Parametric Markov ChainsComments: Published as part of the book "Principles of Verification: Cycling the Probabilistic Landscape: Essays Dedicated to Joost-Pieter Katoen on the Occasion of His 60th Birthday, Part II"Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
Parametric Markov chains (pMC) are used to model probabilistic systems with unknown or partially known probabilities. Although (universal) pMC verification for reachability properties is known to be coETR-complete, there have been efforts to approach it using potentially easier-to-check properties such as asking whether the pMC is monotonic in certain parameters. In this paper, we first reduce monotonicity to asking whether the reachability probability from a given state is never less than that of another given state. Recent results for the latter property imply an efficient algorithm to collapse same-value equivalence classes, which in turn preserves verification results and monotonicity. We implement our algorithm to collapse "trivial" equivalence classes in the pMC and show empirical evidence for the following: First, the collapse gives reductions in size for some existing benchmarks and significant reductions on some custom benchmarks; Second, the collapse speeds up existing algorithms to check monotonicity and parameter lifting, and hence can be used as a fast pre-processing step in practice.
- [33] arXiv:2504.17022 [pdf, html, other]
-
Title: Molecular Communication Channel as a Physical Reservoir ComputerSubjects: Emerging Technologies (cs.ET)
Molecular Communication (MC) channels inherently possess significant memory and nonlinear dynamics due to diffusion and receptor kinetics, often posing challenges for reliable data transmission. This work reconceptualizes these intrinsic properties as computational resources by framing a canonical point-to-point MC channel, consisting of ligand diffusion and reversible ligand-receptor binding at a spherical receiver, as a physical reservoir computer (PRC). We utilize the time-varying fraction of bound receptors as the reservoir's internal state, employing time-multiplexing to generate high-dimensional virtual nodes without explicit recurrence. Only a linear readout layer is trained via ridge regression. Through deterministic mean-field modeling and particle-based spatial stochastic simulations, we demonstrate the MC system's capability for complex temporal processing by successfully performing next-step prediction on standard chaotic time-series benchmarks (Mackey-Glass and NARMA10). Performance, quantified by Normalized Root Mean Square Error (NRMSE), exhibits a non-monotonic dependence on key system parameters (receptor kinetic rates, diffusion coefficient, transmitter-receiver distance), revealing optimal operational regimes. These findings validate the potential of using MC channel as effective and low-complexity computational substrate.
- [34] arXiv:2504.17023 [pdf, html, other]
-
Title: What Makes for a Good Saliency Map? Comparing Strategies for Evaluating Saliency Maps in Explainable AI (XAI)Comments: 27 pages, 7 figures, 4 tablesSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Saliency maps are a popular approach for explaining classifications of (convolutional) neural networks. However, it remains an open question as to how best to evaluate salience maps, with three families of evaluation methods commonly being used: subjective user measures, objective user measures, and mathematical metrics. We examine three of the most popular saliency map approaches (viz., LIME, Grad-CAM, and Guided Backpropagation) in a between subject study (N=166) across these families of evaluation methods. We test 1) for subjective measures, if the maps differ with respect to user trust and satisfaction; 2) for objective measures, if the maps increase users' abilities and thus understanding of a model; 3) for mathematical metrics, which map achieves the best ratings across metrics; and 4) whether the mathematical metrics can be associated with objective user measures. To our knowledge, our study is the first to compare several salience maps across all these evaluation methods$-$with the finding that they do not agree in their assessment (i.e., there was no difference concerning trust and satisfaction, Grad-CAM improved users' abilities best, and Guided Backpropagation had the most favorable mathematical metrics). Additionally, we show that some mathematical metrics were associated with user understanding, although this relationship was often counterintuitive. We discuss these findings in light of general debates concerning the complementary use of user studies and mathematical metrics in the evaluation of explainable AI (XAI) approaches.
- [35] arXiv:2504.17025 [pdf, html, other]
-
Title: Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary AdaptationLuca Moroni, Giovanni Puccetti, Pere-Lluis Huguet Cabot, Andrei Stefan Bejgu, Edoardo Barba, Alessio Miaschi, Felice Dell'Orletta, Andrea Esuli, Roberto NavigliSubjects: Computation and Language (cs.CL)
The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.
- [36] arXiv:2504.17028 [pdf, html, other]
-
Title: Democracy of AI Numerical Weather Models: An Example of Global Forecasting with FourCastNetv2 Made by a University Research Lab Using GPUIman Khadir, Shane Stevenson, Henry Li, Kyle Krick, Abram Burrows, David Hall, Stan Posey, Samuel S.P. ShenComments: 12 pages, 8 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
This paper demonstrates the feasibility of democratizing AI-driven global weather forecasting models among university research groups by leveraging Graphics Processing Units (GPUs) and freely available AI models, such as NVIDIA's FourCastNetv2. FourCastNetv2 is an NVIDIA's advanced neural network for weather prediction and is trained on a 73-channel subset of the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5) dataset at single levels and different pressure levels. Although the training specifications for FourCastNetv2 are not released to the public, the training documentation of the model's first generation, FourCastNet, is available to all users. The training had 64 A100 GPUs and took 16 hours to complete. Although NVIDIA's models offer significant reductions in both time and cost compared to traditional Numerical Weather Prediction (NWP), reproducing published forecasting results presents ongoing challenges for resource-constrained university research groups with limited GPU availability. We demonstrate both (i) leveraging FourCastNetv2 to create predictions through the designated application programming interface (API) and (ii) utilizing NVIDIA hardware to train the original FourCastNet model. Further, this paper demonstrates the capabilities and limitations of NVIDIA A100's for resource-limited research groups in universities. We also explore data management, training efficiency, and model validation, highlighting the advantages and challenges of using limited high-performance computing resources. Consequently, this paper and its corresponding GitHub materials may serve as an initial guide for other university research groups and courses related to machine learning, climate science, and data science to develop research and education programs on AI weather forecasting, and hence help democratize the AI NWP in the digital economy.
- [37] arXiv:2504.17033 [pdf, html, other]
-
Title: Breaking the Sorting Barrier for Directed Single-Source Shortest PathsComments: 17 pagesSubjects: Data Structures and Algorithms (cs.DS)
We give a deterministic $O(m\log^{2/3}n)$-time algorithm for single-source shortest paths (SSSP) on directed graphs with real non-negative edge weights in the comparison-addition model. This is the first result to break the $O(m+n\log n)$ time bound of Dijkstra's algorithm on sparse graphs, showing that Dijkstra's algorithm is not optimal for SSSP.
- [38] arXiv:2504.17038 [pdf, html, other]
-
Title: SCALAR: A Part-of-speech Tagger for IdentifiersChristian D. Newman, Brandon Scholten, Sophia Testa, Joshua A. C. Behler, Syreen Banabilah, Michael L. Collard, Michael J. Decker, Mohamed Wiem Mkaouer, Marcos Zampieri, Eman Abdullah AlOmar, Reem Alsuhaibani, Anthony Peruma, Jonathan I. MaleticSubjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github
- [39] arXiv:2504.17039 [pdf, html, other]
-
Title: Dense Air Pollution Estimation from Sparse in-situ Measurements and Satellite DataSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper addresses the critical environmental challenge of estimating ambient Nitrogen Dioxide (NO$_2$) concentrations, a key issue in public health and environmental policy. Existing methods for satellite-based air pollution estimation model the relationship between satellite and in-situ measurements at select point locations. While these approaches have advanced our ability to provide air quality estimations on a global scale, they come with inherent limitations. The most notable limitation is the computational intensity required for generating comprehensive estimates over extensive areas. Motivated by these limitations, this study introduces a novel dense estimation technique. Our approach seeks to balance the accuracy of high-resolution estimates with the practicality of computational constraints, thereby enabling efficient and scalable global environmental assessment. By utilizing a uniformly random offset sampling strategy, our method disperses the ground truth data pixel location evenly across a larger patch. At inference, the dense estimation method can then generate a grid of estimates in a single step, significantly reducing the computational resources required to provide estimates for larger areas. Notably, our approach also surpasses the results of existing point-wise methods by a significant margin of $9.45\%$, achieving a Mean Absolute Error (MAE) of $4.98\ \mu\text{g}/\text{m}^3$. This demonstrates both high accuracy and computational efficiency, highlighting the applicability of our method for global environmental assessment. Furthermore, we showcase the method's adaptability and robustness by applying it to diverse geographic regions. Our method offers a viable solution to the computational challenges of large-scale environmental monitoring.
- [40] arXiv:2504.17040 [pdf, html, other]
-
Title: DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: this https URL.
- [41] arXiv:2504.17044 [pdf, html, other]
-
Title: Approaches to Responsible Governance of GenAI in OrganizationsSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The rapid evolution of Generative AI (GenAI) has introduced unprecedented opportunities while presenting complex challenges around ethics, accountability, and societal impact. This paper draws on a literature review, established governance frameworks, and industry roundtable discussions to identify core principles for integrating responsible GenAI governance into diverse organizational structures. Our objective is to provide actionable recommendations for a balanced, risk-based governance approach that enables both innovation and oversight. Findings emphasize the need for adaptable risk assessment tools, continuous monitoring practices, and cross-sector collaboration to establish trustworthy GenAI. These insights provide a structured foundation and Responsible GenAI Guide (ResAI) for organizations to align GenAI initiatives with ethical, legal, and operational best practices.
- [42] arXiv:2504.17045 [pdf, html, other]
-
Title: Dynamic Superblock Pruning for Fast Learned Sparse RetrievalComments: 6 pages, 3 figures, SIGIR 25Subjects: Information Retrieval (cs.IR)
This paper proposes superblock pruning (SP) during top-k online document retrieval for learned sparse representations. SP structures the sparse index as a set of superblocks on a sequence of document blocks and conducts a superblock-level selection to decide if some superblocks can be pruned before visiting their child blocks. SP generalizes the previous flat block or cluster-based pruning, allowing the early detection of groups of documents that cannot or are less likely to appear in the final top-k list. SP can accelerate sparse retrieval in a rank-safe or approximate manner under a high-relevance competitiveness constraint. Our experiments show that the proposed scheme significantly outperforms state-of-the-art baselines on MS MARCO passages on a single-threaded CPU.
- [43] arXiv:2504.17046 [pdf, other]
-
Title: Enhanced load balancing technique for SDN controllers: A multi-threshold approach with migration of switchesSubjects: Networking and Internet Architecture (cs.NI)
Deploying multiple controllers in the control panel of software-defined networks increases scalability, availability, and performance, but it also brings challenges, such as controller overload. To address this, load-balancing techniques are employed in software-defined networks. Controller load balancing can be categorized into two main approaches: (1) single-level thresholds and (2) multi-level thresholds. However, previous studies have predominantly relied on single-level thresholds, which result in an imprecise classification of controllers or have assumed uniform controller capacities in multi-level threshold methods. This study explores controller load balancing with a focus on utilizing multi-level thresholds to accurately assess controller status. Switch migration operations are utilized to achieve load balancing, considering factors such as the degree of load imbalance of the target controller and migration efficiency. This includes evaluating the post-migration status of the target controller and the distance between the migrating switch and the target controller to select the appropriate target controller and migrating switch. The proposed scheme reduces controller response time, migration costs, communication overhead, and throughput rate. Results demonstrate that our scheme outperforms others regarding response time and overall performance.
- [44] arXiv:2504.17050 [pdf, html, other]
-
Title: Mapping Trafficking Networks: A Data-Driven Approach to Disrupt Human Trafficking Post Russia-Ukraine ConflictJournal-ref: The 2024 International Conference on Computational Science and Computational Intelligence (December 11-13, 2024)Subjects: Computers and Society (cs.CY)
This study proposes a prototype for locating important individuals and financial exchanges in networks of people trafficking that have grown during the conflict between Russia and Ukraine. It focuses on the role of digital platforms, cryptocurrencies, and the dark web in facilitating these operations. The research maps trafficking networks and identifies key players and financial flows by utilizing open-source intelligence (OSINT), social network analysis (SNA), and blockchain analysis. The results show how cryptocurrencies are used for anonymous transactions and imply that upsetting central coordinators may cause wider networks to become unstable. In order to combat human trafficking, the study emphasizes the significance of real-time data sharing between international law enforcement. It also identifies future directions for the development of improved monitoring tools and cooperative platforms.
- [45] arXiv:2504.17051 [pdf, html, other]
-
Title: Exploring the Untapped: Student Perceptions and Participation in OSSItalo Santos, Katia Romero Felizardo, Bianca Trinkereinch, Daniel M. German, Igor Steinmacher, Marco A. GerosaSubjects: Software Engineering (cs.SE)
Open Source Software (OSS) projects offer valuable opportunities to train the next generation of software engineers while benefiting projects and society as a whole. While research has extensively explored student participation in OSS and its use in software engineering education, student participation in OSS is still low, and the perspectives of students who have never contributed remain underexplored. This study aims to investigate the relationship between students' interest in contributing to OSS and their perceptions of barriers and motivational factors. We developed a theoretical model to understand the relationship between students' perceptions of OSS and their interest in contributing. We then surveyed students majoring in computer science and related fields (N=241). Using structural equation modeling techniques, we tested the model and found that intrinsic and internalized extrinsic motivations are positively associated with interest in contributing to OSS projects, while the impact of extrinsic motivation varies by gender. Comparatively, we found no significant relationship between barriers and interest in contributing. Students suggested several ways to make projects more attractive, including increasing awareness of the importance of OSS. Our findings can help communities better prepare to integrate students and encourage educators to enhance interest in OSS by linking participation to specific motivational factors.
- [46] arXiv:2504.17052 [pdf, html, other]
-
Title: Do Words Reflect Beliefs? Evaluating Belief Depth in Large Language ModelsComments: 20 pages, 9 figuresSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) are increasingly shaping political discourse, yet their responses often display inconsistency when subjected to scrutiny. While prior research has primarily categorized LLM outputs as left- or right-leaning to assess their political stances, a critical question remains: Do these responses reflect genuine internal beliefs or merely surface-level alignment with training data? To address this, we propose a novel framework for evaluating belief depth by analyzing (1) argumentative consistency and (2) uncertainty quantification. We evaluate 12 LLMs on 19 economic policies from the Political Compass Test, challenging their belief stability with both supportive and opposing arguments. Our analysis reveals that LLMs exhibit topic-specific belief stability rather than a uniform ideological stance. Notably, up to 95% of left-leaning models' responses and 89% of right-leaning models' responses remain consistent under the challenge, enabling semantic entropy to achieve high accuracy (AUROC=0.78), effectively distinguishing between surface-level alignment from genuine belief. These findings call into question the assumption that LLMs maintain stable, human-like political ideologies, emphasizing the importance of conducting topic-specific reliability assessments for real-world applications.
- [47] arXiv:2504.17054 [pdf, html, other]
-
Title: Cyber Value At Risk Model for IoT EcosystemsJournal-ref: The 2024 International Conference on Computational Science and Computational Intelligence (December, 2024)Subjects: Computers and Society (cs.CY)
The Internet of Things (IoT) presents unique cybersecurity challenges due to its interconnected nature and diverse application domains. This paper explores the application of Cyber Value-at-Risk (Cy-VaR) models to assess and mitigate cybersecurity risks in IoT environments. Cy-VaR, rooted in Value at Risk principles, provides a framework to quantify the potential financial impacts of cybersecurity incidents. Initially developed to evaluate overall risk exposure across scenarios, our approach extends Cy-VaR to consider specific IoT layers: perception, network, and application. Each layer encompasses distinct functionalities and vulnerabilities, from sensor data acquisition (perception layer) to secure data transmission (network layer) and application-specific services (application layer). By calculating Cy- VaR for each layer and scenario, organizations can prioritize security investments effectively. This paper discusses methodologies and models, including scenario-based Cy-VaR and layer-specific risk assessments, emphasizing their application in enhancing IoT cybersecurity resilience.
- [48] arXiv:2504.17055 [pdf, other]
-
Title: Psychological Effect of AI driven marketing tools for beauty/facial feature enhancementSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
AI-powered facial assessment tools are reshaping how individuals evaluate appearance and internalize social judgments. This study examines the psychological impact of such tools on self-objectification, self-esteem, and emotional responses, with attention to gender differences. Two samples used distinct versions of a facial analysis tool: one overtly critical (N=75; M=22.9 years), and another more neutral (N=51; M=19.9 years). Participants completed validated self-objectification and self-esteem scales and custom items measuring emotion, digital/physical appearance enhancement (DAE, PAEE), and perceived social emotion (PSE). Results revealed consistent links between high self-objectification, low self-esteem, and increased appearance enhancement behaviors across both versions. Despite softer framing, the newer tool still evoked negative emotional responses (U=1466.5, p=0.013), indicating implicit feedback may reinforce appearance-related insecurities. Gender differences emerged in DAE (p=0.025) and PSE (p<0.001), with females more prone to digital enhancement and less likely to perceive emotional impact in others. These findings reveal how AI tools may unintentionally reinforce and amplify existing social biases and underscore the critical need for responsible AI design and development. Future research will investigate how human ideologies embedded in the training data of such tools shape their evaluative outputs, and how these, in turn, influence user attitudes and decisions.
- [49] arXiv:2504.17056 [pdf, other]
-
Title: Evaluating energy inefficiency in energy-poor households in India: A frontier analysis approachComments: 42 pages 7 Figures 5 Tables Arnab Jana led and supervised the study. Vallary Gupta analyzed the dataset, executed the SFA model, prepared graphics and wrote the manuscript. Dr. Ahana Sarkar coordinated the data collection, interpretation of model results and design of policy implications. Dr. Chirag Deb provided technical support. All authors reviewed and approved the final manuscriptSubjects: Computers and Society (cs.CY)
Energy-poor households often compromise their thermal comfort and refrain from operating mechanical cooling devices to avoid high electricity bills. This is compounded by certain behavioral practices like retention of older, less efficient appliances, resulting in missed energy savings. Thus, the need to enhance efficiency becomes critical in these households. However, due to a lack of comprehensive data in India, little is understood about their electricity consumption patterns and usage efficiency. Estimating inefficiency and assessing its determinants is crucial for improving their quality of life. This study measures the inefficiency in electricity consumption due to household practices and appliances in social housing in Mumbai, India. It considers technological determinants in addition to socio-economic variables. The study employs primary data collected from rehabilitation housing and slums in Mumbai. Stochastic frontier analysis, a parametric approach, is applied to estimate indicators of electricity consumption and inefficiency. While household size and workforce participation significantly affect consumption behavior in rehabilitation housing, it is limited to the workforce in slums. The ownership of appliances, except for washing machines in slums, also exhibits considerable impacts. The mean efficiency scores of 83% and 91% for rehabilitation housing and slums, respectively, empirically quantify the potential savings achievable. Factors that positively influence inefficiency include the duration of operating refrigerators, washing machines, iron, and AC. These results hold implications for enhancing the uptake of efficient appliances in addition to accelerating energy efficiency retrofits in the region. Policies should focus on awareness and the development of appliance markets through incentives.
- [50] arXiv:2504.17058 [pdf, html, other]
-
Title: Statistical Guarantees in Synthetic Data through Conformal Adversarial GenerationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The generation of high-quality synthetic data presents significant challenges in machine learning research, particularly regarding statistical fidelity and uncertainty quantification. Existing generative models produce compelling synthetic samples but lack rigorous statistical guarantees about their relation to the underlying data distribution, limiting their applicability in critical domains requiring robust error bounds. We address this fundamental limitation by presenting a novel framework that incorporates conformal prediction methodologies into Generative Adversarial Networks (GANs). By integrating multiple conformal prediction paradigms including Inductive Conformal Prediction (ICP), Mondrian Conformal Prediction, Cross-Conformal Prediction, and Venn-Abers Predictors, we establish distribution-free uncertainty quantification in generated samples. This approach, termed Conformalized GAN (cGAN), demonstrates enhanced calibration properties while maintaining the generative power of traditional GANs, producing synthetic data with provable statistical guarantees. We provide rigorous mathematical proofs establishing finite-sample validity guarantees and asymptotic efficiency properties, enabling the reliable application of synthetic data in high-stakes domains including healthcare, finance, and autonomous systems.
- [51] arXiv:2504.17059 [pdf, html, other]
-
Title: Integrating Graph Theoretical Approaches in Cybersecurity Education CSCI-RTEDJournal-ref: The 2024 International Conference on Computational Science and Computational IntelligenceSubjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY)
As cybersecurity threats continue to evolve, the need for advanced tools to analyze and understand complex cyber environments has become increasingly critical. Graph theory offers a powerful framework for modeling relationships within cyber ecosystems, making it highly applicable to cybersecurity. This paper focuses on the development of an enriched version of the widely recognized NSL-KDD dataset, incorporating graph-theoretical concepts to enhance its practical value. The enriched dataset provides a resource for students and professionals to engage in hands-on analysis, enabling them to explore graph-based methodologies for identifying network behavior and vulnerabilities. To validate the effectiveness of this dataset, we employed IBM Auto AI, demonstrating its capability in real-world applications such as classification and threat prediction. By addressing the need for graph-theoretical datasets, this study provides a practical tool for equipping future cybersecurity professionals with the skills necessary to confront complex cyber challenges.
- [52] arXiv:2504.17062 [pdf, html, other]
-
Title: ePBR: Extended PBR Materials in Image SynthesisComments: 8 pages without references, 7 figures, accepted in CVPRW 2025Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
Realistic indoor or outdoor image synthesis is a core challenge in computer vision and graphics. The learning-based approach is easy to use but lacks physical consistency, while traditional Physically Based Rendering (PBR) offers high realism but is computationally expensive. Intrinsic image representation offers a well-balanced trade-off, decomposing images into fundamental components (intrinsic channels) such as geometry, materials, and illumination for controllable synthesis. However, existing PBR materials struggle with complex surface models, particularly high-specular and transparent surfaces. In this work, we extend intrinsic image representations to incorporate both reflection and transmission properties, enabling the synthesis of transparent materials such as glass and windows. We propose an explicit intrinsic compositing framework that provides deterministic, interpretable image synthesis. With the Extended PBR (ePBR) Materials, we can effectively edit the materials with precise controls.
- [53] arXiv:2504.17065 [pdf, html, other]
-
Title: Antenna Near-Field Reconstruction from Far-Field Data Using Convolutional Neural NetworksSubjects: Machine Learning (cs.LG)
Electromagnetic field reconstruction is crucial in many applications, including antenna diagnostics, electromagnetic interference analysis, and system modeling. This paper presents a deep learning-based approach for Far-Field to Near-Field (FF-NF) transformation using Convolutional Neural Networks (CNNs). The goal is to reconstruct near-field distributions from the far-field data of an antenna without relying on explicit analytical transformations. The CNNs are trained on paired far-field and near-field data and evaluated using mean squared error (MSE). The best model achieves a training error of 0.0199 and a test error of 0.3898. Moreover, visual comparisons between the predicted and true near-field distributions demonstrate the model's effectiveness in capturing complex electromagnetic field behavior, highlighting the potential of deep learning in electromagnetic field reconstruction.
- [54] arXiv:2504.17066 [pdf, html, other]
-
Title: Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score MatchingSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Software Engineering (cs.SE); Machine Learning (stat.ML)
Fairness-aware learning aims to mitigate discrimination against specific protected social groups (e.g., those categorized by gender, ethnicity, age) while minimizing predictive performance loss. Despite efforts to improve fairness in machine learning, prior studies have shown that many models remain unfair when measured against various fairness metrics. In this paper, we examine whether the way training and testing data are sampled affects the reliability of reported fairness metrics. Since training and test sets are often randomly sampled from the same population, bias present in the training data may still exist in the test data, potentially skewing fairness assessments. To address this, we propose FairMatch, a post-processing method that applies propensity score matching to evaluate and mitigate bias. FairMatch identifies control and treatment pairs with similar propensity scores in the test set and adjusts decision thresholds for different subgroups accordingly. For samples that cannot be matched, we perform probabilistic calibration using fairness-aware loss functions. Experimental results demonstrate that our approach can (a) precisely locate subsets of the test data where the model is unbiased, and (b) significantly reduce bias on the remaining data. Overall, propensity score matching offers a principled way to improve both fairness evaluation and mitigation, without sacrificing predictive performance.
- [55] arXiv:2504.17067 [pdf, html, other]
-
Title: PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Accurate depth estimation enhances endoscopy navigation and diagnostics, but obtaining ground-truth depth in clinical settings is challenging. Synthetic datasets are often used for training, yet the domain gap limits generalization to real data. We propose a novel image-to-image translation framework that preserves structure while generating realistic textures from clinical data. Our key innovation integrates Stable Diffusion with ControlNet, conditioned on a latent representation extracted from a Per-Pixel Shading (PPS) map. PPS captures surface lighting effects, providing a stronger structural constraint than depth maps. Experiments show our approach produces more realistic translations and improves depth estimation over GAN-based MI-CycleGAN. Our code is publicly accessible at this https URL.
- [56] arXiv:2504.17068 [pdf, html, other]
-
Title: In-Context Learning can distort the relationship between sequence likelihoods and biological fitnessSubjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Language models have emerged as powerful predictors of the viability of biological sequences. During training these models learn the rules of the grammar obeyed by sequences of amino acids or nucleotides. Once trained, these models can take a sequence as input and produce a likelihood score as an output; a higher likelihood implies adherence to the learned grammar and correlates with experimental fitness measurements. Here we show that in-context learning can distort the relationship between fitness and likelihood scores of sequences. This phenomenon most prominently manifests as anomalously high likelihood scores for sequences that contain repeated motifs. We use protein language models with different architectures trained on the masked language modeling objective for our experiments, and find transformer-based models to be particularly vulnerable to this effect. This behavior is mediated by a look-up operation where the model seeks the identity of the masked position by using the other copy of the repeated motif as a reference. This retrieval behavior can override the model's learned priors. This phenomenon persists for imperfectly repeated sequences, and extends to other kinds of biologically relevant features such as reversed complement motifs in RNA sequences that fold into hairpin structures.
- [57] arXiv:2504.17069 [pdf, html, other]
-
Title: Distilling semantically aware orders for autoregressive image generationRishav Pramanik, Antoine Poupon, Juan A. Rodriguez, Masih Aminbeidokhti, David Vazquez, Christopher Pal, Zhaozheng Yin, Marco PedersoliSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Autoregressive patch-based image generation has recently shown competitive results in terms of image quality and scalability. It can also be easily integrated and scaled within Vision-Language models. Nevertheless, autoregressive models require a defined order for patch generation. While a natural order based on the dictation of the words makes sense for text generation, there is no inherent generation order that exists for image generation. Traditionally, a raster-scan order (from top-left to bottom-right) guides autoregressive image generation models. In this paper, we argue that this order is suboptimal, as it fails to respect the causality of the image content: for instance, when conditioned on a visual description of a sunset, an autoregressive model may generate clouds before the sun, even though the color of clouds should depend on the color of the sun and not the inverse. In this work, we show that first by training a model to generate patches in any-given-order, we can infer both the content and the location (order) of each patch during generation. Secondly, we use these extracted orders to finetune the any-given-order model to produce better-quality images. Through our experiments, we show on two datasets that this new generation method produces better images than the traditional raster-scan approach, with similar training costs and no extra annotations.
- [58] arXiv:2504.17070 [pdf, html, other]
-
Title: Robo-Troj: Attacking LLM-based Task PlannersMohaiminul Al Nahian, Zainab Altaweel, David Reitano, Sabbir Ahmed, Saumitra Lohokare, Shiqi Zhang, Adnan Siraj RakinSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Robots need task planning methods to achieve goals that require more than individual actions. Recently, large language models (LLMs) have demonstrated impressive performance in task planning. LLMs can generate a step-by-step solution using a description of actions and the goal. Despite the successes in LLM-based task planning, there is limited research studying the security aspects of those systems. In this paper, we develop Robo-Troj, the first multi-trigger backdoor attack for LLM-based task planners, which is the main contribution of this work. As a multi-trigger attack, Robo-Troj is trained to accommodate the diversity of robot application domains. For instance, one can use unique trigger words, e.g., "herical", to activate a specific malicious behavior, e.g., cutting hand on a kitchen robot. In addition, we develop an optimization method for selecting the trigger words that are most effective. Through demonstrating the vulnerability of LLM-based planners, we aim to promote the development of secured robot systems.
- [59] arXiv:2504.17073 [pdf, html, other]
-
Title: Sparse Phased Array Optimization Using Deep LearningSubjects: Machine Learning (cs.LG)
Antenna arrays are widely used in wireless communication, radar systems, radio astronomy, and military defense to enhance signal strength, directivity, and interference suppression. We introduce a deep learning-based optimization approach that enhances the design of sparse phased arrays by reducing grating lobes. This approach begins by generating sparse array configurations to address the non-convex challenges and extensive degrees of freedom inherent in array design. We use neural networks to approximate the non-convex cost function that estimates the energy ratio between the main and side lobes. This differentiable approximation facilitates cost function minimization through gradient descent, optimizing the antenna elements' coordinates and leading to an improved layout. Additionally, we incorporate a tailored penalty mechanism that includes various physical and design constraints into the optimization process, enhancing its robustness and practical applicability. We demonstrate the effectiveness of our method by applying it to the ten array configurations with the lowest initial costs, achieving further cost reductions ranging from 411% to 643%, with an impressive average improvement of 552%. By significantly reducing side lobe levels in antenna arrays, this breakthrough paves the way for ultra-precise beamforming, enhanced interference mitigation, and next-generation wireless and radar systems with unprecedented efficiency and clarity.
- [60] arXiv:2504.17074 [pdf, html, other]
-
Title: Conditional Diffusion-Based Retrieval of Atmospheric CO2 from Earth Observing SpectroscopyWilliam R. Keely, Otto Lamminpää, Steffen Mauceri, Sean M. R. Crowell, Christopher W. O'Dell, Gregory R. McGarraghComments: Published as a workshop paper in "Tackling Climate Change with Machine Learning", ICLR 2025. this https URLJournal-ref: William Keely. Conditional diffusion-based retrieval of atmospheric co2 from earth observing spec- troscopy. In ICLR 2025 Workshop on Tackling Climate Change with Machine Learning, 2025Subjects: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
Satellite-based estimates of greenhouse gas (GHG) properties from observations of reflected solar spectra are integral for understanding and monitoring complex terrestrial systems and their impact on the carbon cycle due to their near global coverage. Known as retrieval, making GHG concentration estimations from these observations is a non-linear Bayesian inverse problem, which is operationally solved using a computationally expensive algorithm called Optimal Estimation (OE), providing a Gaussian approximation to a non-Gaussian posterior. This leads to issues in solver algorithm convergence, and to unrealistically confident uncertainty estimates for the retrieved quantities. Upcoming satellite missions will provide orders of magnitude more data than the current constellation of GHG observers. Development of fast and accurate retrieval algorithms with robust uncertainty quantification is critical. Doing so stands to provide substantial climate impact of moving towards the goal of near continuous real-time global monitoring of carbon sources and sinks which is essential for policy making. To achieve this goal, we propose a diffusion-based approach to flexibly retrieve a Gaussian or non-Gaussian posterior, for NASA's Orbiting Carbon Observatory-2 spectrometer, while providing a substantial computational speed-up over the current operational state-of-the-art.
- [61] arXiv:2504.17075 [pdf, html, other]
-
Title: Agree to Disagree? A Meta-Evaluation of LLM MisgenderingComments: Work in progressSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Numerous methods have been proposed to measure LLM misgendering, including probability-based evaluations (e.g., automatically with templatic sentences) and generation-based evaluations (e.g., with automatic heuristics or human validation). However, it has gone unexamined whether these evaluation methods have convergent validity, that is, whether their results align. Therefore, we conduct a systematic meta-evaluation of these methods across three existing datasets for LLM misgendering. We propose a method to transform each dataset to enable parallel probability- and generation-based evaluation. Then, by automatically evaluating a suite of 6 models from 3 families, we find that these methods can disagree with each other at the instance, dataset, and model levels, conflicting on 20.2% of evaluation instances. Finally, with a human evaluation of 2400 LLM generations, we show that misgendering behaviour is complex and goes far beyond pronouns, which automatic evaluations are not currently designed to capture, suggesting essential disagreement with human evaluations. Based on our findings, we provide recommendations for future evaluations of LLM misgendering. Our results are also more widely relevant, as they call into question broader methodological conventions in LLM evaluation, which often assume that different evaluation methods agree.
- [62] arXiv:2504.17076 [pdf, html, other]
-
Title: Scene-Aware Location Modeling for Data Augmentation in Automotive Object DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Generative image models are increasingly being used for training data augmentation in vision tasks. In the context of automotive object detection, methods usually focus on producing augmented frames that look as realistic as possible, for example by replacing real objects with generated ones. Others try to maximize the diversity of augmented frames, for example by pasting lots of generated objects onto existing backgrounds. Both perspectives pay little attention to the locations of objects in the scene. Frame layouts are either reused with little or no modification, or they are random and disregard realism entirely. In this work, we argue that optimal data augmentation should also include realistic augmentation of layouts. We introduce a scene-aware probabilistic location model that predicts where new objects can realistically be placed in an existing scene. By then inpainting objects in these locations with a generative model, we obtain much stronger augmentation performance than existing approaches. We set a new state of the art for generative data augmentation on two automotive object detection tasks, achieving up to $2.8\times$ higher gains than the best competing approach ($+1.4$ vs. $+0.5$ mAP boost). We also demonstrate significant improvements for instance segmentation.
- [63] arXiv:2504.17079 [pdf, html, other]
-
Title: A Novel Hybrid Approach Using an Attention-Based Transformer + GRU Model for Predicting Cryptocurrency PricesSubjects: Machine Learning (cs.LG); Applications (stat.AP)
In this article, we introduce a novel deep learning hybrid model that integrates attention Transformer and Gated Recurrent Unit (GRU) architectures to improve the accuracy of cryptocurrency price predictions. By combining the Transformer's strength in capturing long-range patterns with the GRU's ability to model short-term and sequential trends, the hybrid model provides a well-rounded approach to time series forecasting. We apply the model to predict the daily closing prices of Bitcoin and Ethereum based on historical data that include past prices, trading volumes, and the Fear and Greed index. We evaluate the performance of our proposed model by comparing it with four other machine learning models: two are non-sequential feedforward models: Radial Basis Function Network (RBFN) and General Regression Neural Network (GRNN), and two are bidirectional sequential memory-based models: Bidirectional Long-Short-Term Memory (BiLSTM) and Bidirectional Gated Recurrent Unit (BiGRU). The performance of the model is assessed using several metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE), along with statistical validation through the nonparametric Friedman test followed by a post hoc Wilcoxon signed rank test. The results demonstrate that our hybrid model consistently achieves superior accuracy, highlighting its effectiveness for financial prediction tasks. These findings provide valuable insights for improving real-time decision making in cryptocurrency markets and support the growing use of hybrid deep learning models in financial analytics.
- [64] arXiv:2504.17080 [pdf, other]
-
Title: Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic ManipulatorsJoohwan Seo, Nikhil Potu Surya Prakash, Soomi Lee, Arvind Kruthiventy, Megan Teng, Jongeun Choi, Roberto HorowitzComments: Submitted to Control Decision Conference (CDC) 2025Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity. Building upon the unified force-impedance control (UFIC) and our previous work on geometric impedance control (GIC), we develop the geometric unified force impedance control (GUFIC) to account for the SE(3) manifold structure in the controller formulation using a differential geometric perspective. As in the case of the UFIC, the GUFIC utilizes energy tank augmentation for both force-tracking and impedance control to guarantee the manipulator's passivity relative to external forces. This ensures that the end effector maintains safe contact interaction with uncertain environments and tracks a desired interaction force. Moreover, we resolve a non-causal implementation problem in the UFIC formulation by introducing velocity and force fields. Due to its formulation on SE(3), the proposed GUFIC inherits the desirable SE(3) invariance and equivariance properties of the GIC, which helps increase sample efficiency in machine learning applications where a learning algorithm is incorporated into the control law. The proposed control law is validated in a simulation environment under scenarios requiring tracking an SE(3) trajectory, incorporating both position and orientation, while exerting a force on a surface. The codes are available at this https URL.
- [65] arXiv:2504.17083 [pdf, html, other]
-
Title: How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary StudyComments: Accepted at GenAICHI 2025 @ ACM CHI 2025Subjects: Computation and Language (cs.CL)
What makes an interaction with the LLM more preferable for the user? While it is intuitive to assume that information accuracy in the LLM's responses would be one of the influential variables, recent studies have found that inaccurate LLM's responses could still be preferable when they are perceived to be more authoritative, certain, well-articulated, or simply verbose. These variables interestingly fall under the broader category of language style, implying that the style in the LLM's responses might meaningfully influence users' preferences. This hypothesized dynamic could have double-edged consequences: enhancing the overall user experience while simultaneously increasing their susceptibility to risks such as LLM's misinformation or hallucinations. In this short paper, we present our preliminary studies in exploring this subject. Through a series of exploratory and experimental user studies, we found that LLM's language style does indeed influence user's preferences, but how and which language styles influence the preference varied across different user populations, and more interestingly, moderated by the user's very own individual traits. As a preliminary work, the findings in our studies should be interpreted with caution, particularly given the limitations in our samples, which still need wider demographic diversity and larger sample sizes. Our future directions will first aim to address these limitations, which would enable a more comprehensive joint effect analysis between the language style, individual traits, and preferences, and further investigate the potential causal relationship between and beyond these variables.
- [66] arXiv:2504.17087 [pdf, html, other]
-
Title: Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM JudgmentsComments: 12 pages, 5 figures, 6 tablesSubjects: Artificial Intelligence (cs.AI)
Large language models (LLMs) are being widely applied across various fields, but as tasks become more complex, evaluating their responses is increasingly challenging. Compared to human evaluators, the use of LLMs to support performance evaluation offers a more efficient alternative. However, most studies focus mainly on aligning LLMs' judgments with human preferences, overlooking the existence of biases and mistakes in human judgment. Furthermore, how to select suitable LLM judgments given multiple potential LLM responses remains underexplored. To address these two aforementioned issues, we propose a three-stage meta-judge selection pipeline: 1) developing a comprehensive rubric with GPT-4 and human experts, 2) using three advanced LLM agents to score judgments, and 3) applying a threshold to filter out low-scoring judgments. Compared to methods using a single LLM as both judge and meta-judge, our pipeline introduces multi-agent collaboration and a more comprehensive rubric. Experimental results on the JudgeBench dataset show about 15.55\% improvement compared to raw judgments and about 8.37\% improvement over the single-agent baseline. Our work demonstrates the potential of LLMs as meta-judges and lays the foundation for future research on constructing preference datasets for LLM-as-a-judge reinforcement learning.
- [67] arXiv:2504.17091 [pdf, other]
-
Title: Co-CoT: A Prompt-Based Framework for Collaborative Chain-of-Thought ReasoningComments: 5 pageSubjects: Computation and Language (cs.CL)
Due to the proliferation of short-form content and the rapid adoption of AI, opportunities for deep, reflective thinking have significantly diminished, undermining users' critical thinking and reducing engagement with the reasoning behind AI-generated outputs. To address this issue, we propose an Interactive Chain-of-Thought (CoT) Framework that enhances human-centered explainability and responsible AI usage by making the model's inference process transparent, modular, and user-editable. The framework decomposes reasoning into clearly defined blocks that users can inspect, modify, and re-execute, encouraging active cognitive engagement rather than passive consumption. It further integrates a lightweight edit-adaptation mechanism inspired by preference learning, allowing the system to align with diverse cognitive styles and user intentions. Ethical transparency is ensured through explicit metadata disclosure, built-in bias checkpoint functionality, and privacy-preserving safeguards. This work outlines the design principles and architecture necessary to promote critical engagement, responsible interaction, and inclusive adaptation in AI systems aimed at addressing complex societal challenges.
- [68] arXiv:2504.17096 [pdf, html, other]
-
Title: Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed ClustersFoteini Strati, Zhendong Zhang, George Manos, Ixeia Sánchez Périz, Qinghao Hu, Tiancheng Chen, Berk Buzcu, Song Han, Pamela Delgado, Ana KlimovicSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
The high GPU demand of ML training makes it hard to allocate large homogeneous clusters of high-end GPUs in a single availability zone. Leveraging heterogeneous GPUs available within and across zones can improve throughput at a reasonable cost. However, training ML models on heterogeneous resources introduces significant challenges, such as stragglers and a large search space of possible job configurations. Current systems lack support for efficiently training models on heterogeneous resources. We present Sailor, a system that automates distributed training over heterogeneous, geo-distributed, and dynamically available resources. Sailor combines an efficient search space exploration algorithm, accurate runtime and memory footprint simulation, and a distributed training framework that supports different types of heterogeneity to optimize training throughput and cost.
- [69] arXiv:2504.17097 [pdf, html, other]
-
Title: Parallelizing the Approximate Minimum Degree Ordering Algorithm: Strategies and EvaluationComments: 11 pages, 7 figures, 5 tablesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS)
The approximate minimum degree algorithm is widely used before numerical factorization to reduce fill-in for sparse matrices. While considerable attention has been given to the numerical factorization process, less focus has been placed on parallelizing the approximate minimum degree algorithm itself. In this paper, we explore different parallelization strategies, and introduce a novel parallel framework that leverages multiple elimination on distance-2 independent sets. Our evaluation shows that parallelism within individual elimination steps is limited due to low computational workload and significant memory contention. In contrast, our proposed framework overcomes these challenges by parallelizing the work across elimination steps. To the best of our knowledge, our implementation is the first scalable shared memory implementation of the approximate minimum degree algorithm. Experimental results show that we achieve up to an 8.30x speedup using 64 threads over the state-of-the-art sequential implementation in SuiteSparse.
- [70] arXiv:2504.17099 [pdf, html, other]
-
Title: GeoRDF2Vec Learning Location-Aware Entity Representations in Knowledge GraphsComments: 18 pages, ESWC 2025Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Many knowledge graphs contain a substantial number of spatial entities, such as cities, buildings, and natural landmarks. For many of these entities, exact geometries are stored within the knowledge graphs. However, most existing approaches for learning entity representations do not take these geometries into account. In this paper, we introduce a variant of RDF2Vec that incorporates geometric information to learn location-aware embeddings of entities. Our approach expands different nodes by flooding the graph from geographic nodes, ensuring that each reachable node is considered. Based on the resulting flooded graph, we apply a modified version of RDF2Vec that biases graph walks using spatial weights. Through evaluations on multiple benchmark datasets, we demonstrate that our approach outperforms both non-location-aware RDF2Vec and GeoTransE.
- [71] arXiv:2504.17103 [pdf, html, other]
-
Title: Subframework-based Bearing Rigidity Maintenance Control in Multirobot NetworksComments: 6 pagesSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
This work presents a novel approach for analyzing and controlling bearing rigidity in multi-robot networks with dynamic topology. By decomposing the system's framework into subframeworks, we express bearing rigidity, a global property, as a set of local properties, with rigidity eigenvalues serving as natural local rigidity metrics. We propose a decentralized, scalable, gradient-based controller that uses only bearing measurements to execute mission-specific commands. The controller preserves bearing rigidity by maintaining rigidity eigenvalues above a threshold, and also avoids inter-robot collisions. Simulations confirm the scheme's effectiveness, with information exchange confined to subframeworks, underscoring its scalability and practicality.
- [72] arXiv:2504.17106 [pdf, html, other]
-
Title: Transactional Cloud Applications: Status Quo, Challenges, and OpportunitiesComments: Version accepted as a tutorial in SIGMOD'25Subjects: Databases (cs.DB); Software Engineering (cs.SE)
Transactional cloud applications such as payment, booking, reservation systems, and complex business workflows are currently being rewritten for deployment in the cloud. This migration to the cloud is happening mainly for reasons of cost and scalability. Over the years, application developers have used different migration approaches, such as microservice frameworks, actors, and stateful dataflow systems.
The migration to the cloud has brought back data management challenges traditionally handled by database management systems. Those challenges include ensuring state consistency, maintaining durability, and managing the application lifecycle. At the same time, the shift to a distributed computing infrastructure introduced new issues, such as message delivery, task scheduling, containerization, and (auto)scaling.
Although the data management community has made progress in developing analytical and transactional database systems, transactional cloud applications have received little attention in database research. This tutorial aims to highlight recent trends in the area and discusses open research challenges for the data management community. - [73] arXiv:2504.17109 [pdf, html, other]
-
Title: Discovering the Precursors of Traffic Breakdowns Using Spatiotemporal Graph Attribution NetworksSubjects: Machine Learning (cs.LG)
Understanding and predicting the precursors of traffic breakdowns is critical for improving road safety and traffic flow management. This paper presents a novel approach combining spatiotemporal graph neural networks (ST-GNNs) with Shapley values to identify and interpret traffic breakdown precursors. By extending Shapley explanation methods to a spatiotemporal setting, our proposed method bridges the gap between black-box neural network predictions and interpretable causes. We demonstrate the method on the Interstate-24 data, and identify that road topology and abrupt braking are major factors that lead to traffic breakdowns.
- [74] arXiv:2504.17110 [pdf, other]
-
Title: An Entropy Stable Formulation of Two-equation Turbulence Models with Particular Reference to the k-epsilon ModelComments: 50 pages, 13 figuresSubjects: Computational Engineering, Finance, and Science (cs.CE)
Consistency and stability are two essential ingredients in the design of numerical algorithms for partial differential equations. Robust algorithms can be developed by incorporating nonlinear physical stability principles in their design, such as the entropy production inequality (i.e., the Clausius-Duhem inequality or second law of thermodynamics), rather than by simply adding artificial viscosity (a common approach). This idea is applied to the k-epsilon and two-equation turbulence models by introducing space-time averaging. Then, a set of entropy variables can be defined which leads to a symmetric system of advective-diffusive equations. Positivity and symmetry of the equations require certain constraints on the turbulence diffusivity coefficients and the turbulence source terms. With these, we are able to design entropy producing two-equation turbulence models and, in particular, the k-epsilon model.
- [75] arXiv:2504.17111 [pdf, html, other]
-
Title: Transferring Spatial Filters via Tangent Space Alignment in Motor Imagery BCIsSubjects: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
We propose a method to improve subject transfer in motor imagery BCIs by aligning covariance matrices on a Riemannian manifold, followed by computing a new common spatial patterns (CSP) based spatial filter. We explore various ways to integrate information from multiple subjects and show improved performance compared to standard CSP. Across three datasets, our method shows marginal improvements over standard CSP; however, when training data are limited, the improvements become more significant.
- [76] arXiv:2504.17113 [pdf, html, other]
-
Title: Cybernetic Governance in a Coliving HouseComments: 19 pages, 5 figures, earlier working version at this https URLSubjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
We report an 18-month field experiment in distributed digital institutions: a nine-bedroom Los Angeles coliving house that runs without managers, while sustaining 98% occupancy and below-market rents.
Drawing on Elinor Ostrom's commons theory, we outline design principles and three digital mechanisms that form the institutional core: 1) A continuous-auction chore scheduler turns regenerative labor into a time-indexed points market; residents meet a 100-point monthly obligation by claiming tasks whose value rises linearly with neglect. 2) A pairwise-preference layer lets participants asynchronously reprioritize tasks, translating meta-governance into low-cognition spot inputs. 3) A symbolic "hearts" ledger tracks norm compliance through automated enforcement, lightweight challenges, and peer-awarded karma. Together, these mechanisms operationalize cybernetic principles--human sensing, machine bookkeeping, real-time feedback--while minimizing dependence on privileged roles.
Our exploratory data (567 chore claims, 255 heart events, and 551 group purchases) show that such tooling can sustain reliable commons governance without continuous leadership, offering a transferable design palette for online communities, coliving houses, and other digitally mediated collectives. - [77] arXiv:2504.17117 [pdf, html, other]
-
Title: AI for Accessible Education: Personalized Audio-Based Learning for Blind StudentsComments: 4 pages, CHI 2025 Workshop on Augmented Educators and AI: Shaping the Future of Human and AI Cooperation in LearningSubjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Blind and visually impaired (BVI) students face significant challenges in traditional educational settings. While screen readers and braille materials offer some accessibility, they often lack interactivity and real-time adaptability to individual learning needs. This paper presents Audemy, an AI-powered audio-based learning platform designed to provide personalized, accessible, and engaging educational experiences for BVI students. Audemy uses adaptive learning techniques to customize content based on student accuracy, pacing preferences, and engagement patterns. The platform has been iteratively developed with input from over 20 educators specializing in accessibility and currently serves over 2,000 BVI students. Educator insights show key considerations for accessible AI, including the importance of engagement, intuitive design, compatibility with existing assistive technologies, and the role of positive reinforcement in maintaining student motivation. Beyond accessibility, this paper explores the ethical implications of AI in education, emphasizing data privacy, security, and transparency. Audemy demonstrates how AI can empower BVI students with personalized and equitable learning opportunities, advancing the broader goal of inclusive education.
- [78] arXiv:2504.17118 [pdf, html, other]
-
Title: Path Integral Methods for Synthesizing and Preventing Stealthy Attacks in Nonlinear Cyber-Physical SystemsSubjects: Systems and Control (eess.SY); Information Theory (cs.IT)
This paper studies the synthesis and mitigation of stealthy attacks in nonlinear cyber-physical systems (CPS). To quantify stealthiness, we employ the Kullback-Leibler (KL) divergence, a measure rooted in hypothesis testing and detection theory, which captures the trade-off between an attacker's desire to remain stealthy and her goal of degrading system performance. First, we synthesize the worst-case stealthy attack in nonlinear CPS using the path integral approach. Second, we consider how a controller can mitigate the impact of such stealthy attacks by formulating a minimax KL control problem, yielding a zero-sum game between the attacker and the controller. Again, we leverage a path integral-based solution that computes saddle-point policies for both players through Monte Carlo simulations. We validate our approach using unicycle navigation and cruise control problems, demonstrating how an attacker can covertly drive the system into unsafe regions, and how the controller can adapt her policy to combat the worst-case attacks.
- [79] arXiv:2504.17119 [pdf, html, other]
-
Title: The Rise of Small Language Models in Healthcare: A Comprehensive SurveyComments: 35 pages, 7 tables, 5 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Despite substantial progress in healthcare applications driven by large language models (LLMs), growing concerns around data privacy, and limited resources; the small language models (SLMs) offer a scalable and clinically viable solution for efficient performance in resource-constrained environments for next-generation healthcare informatics. Our comprehensive survey presents a taxonomic framework to identify and categorize them for healthcare professionals and informaticians. The timeline of healthcare SLM contributions establishes a foundational framework for analyzing models across three dimensions: NLP tasks, stakeholder roles, and the continuum of care. We present a taxonomic framework to identify the architectural foundations for building models from scratch; adapting SLMs to clinical precision through prompting, instruction fine-tuning, and reasoning; and accessibility and sustainability through compression techniques. Our primary objective is to offer a comprehensive survey for healthcare professionals, introducing recent innovations in model optimization and equipping them with curated resources to support future research and development in the field. Aiming to showcase the groundbreaking advancements in SLMs for healthcare, we present a comprehensive compilation of experimental results across widely studied NLP tasks in healthcare to highlight the transformative potential of SLMs in healthcare. The updated repository is available at Github
- [80] arXiv:2504.17121 [pdf, html, other]
-
Title: Evaluating Argon2 Adoption and Effectiveness in Real-World SoftwareComments: 22 pages, 4 figures, 6 tables. Submitted to ARES 2025 conferenceSubjects: Cryptography and Security (cs.CR)
Modern password hashing remains a critical defense against credential cracking, yet the transition from theoretically secure algorithms to robust real-world implementations remains fraught with challenges. This paper presents a dual analysis of Argon2, the Password Hashing Competition winner, combining attack simulations quantifying how parameter configurations impact guessing costs under realistic budgets, with the first large-scale empirical study of Argon2 adoption across public GitHub software repositories. Our economic model, validated against cryptocurrency mining benchmarks, demonstrates that OWASP's recommended 46 MiB configuration reduces compromise rates by 42.5% compared to SHA-256 at \$1/account attack budgets for strong user passwords. However, memory-hardness exhibits diminishing returns as increasing allocations to RFC 9106's 2048 MiB provides just 23.3% (\$1) and 17.7% (\$20) additional protection despite 44.5 times greater memory demands. Crucially, both configurations fail to mitigate risks from weak passwords, with 96.9-99.8% compromise rates for RockYou-like credentials regardless of algorithm choice. Our repository analysis shows accelerating Argon2 adoption, yet weak configuration practices: 46.6% of deployments use weaker-than-OWASP parameters. Surprisingly, sensitive applications (password managers, encryption tools) show no stronger configurations than general software. Our findings highlight that a secure algorithm alone cannot ensure security, effective parameter guidance and developer education remain essential for realizing Argon2's theoretical advantages.
- [81] arXiv:2504.17128 [pdf, html, other]
-
Title: PACE: A Framework for Learning and Control in Linear Incomplete-Information Differential GamesComments: Accepted to 7th Annual Conference on Learning for Dynamics and Control (L4DC) 2025. Camera-ready version using the official PMLR template. The full version including appendix and proofsSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
In this paper, we address the problem of a two-player linear quadratic differential game with incomplete information, a scenario commonly encountered in multi-agent control, human-robot interaction (HRI), and approximation methods for solving general-sum differential games. While solutions to such linear differential games are typically obtained through coupled Riccati equations, the complexity increases when agents have incomplete information, particularly when neither is aware of the other's cost function. To tackle this challenge, we propose a model-based Peer-Aware Cost Estimation (PACE) framework for learning the cost parameters of the other agent. In PACE, each agent treats its peer as a learning agent rather than a stationary optimal agent, models their learning dynamics, and leverages this dynamic to infer the cost function parameters of the other agent. This approach enables agents to infer each other's objective function in real time based solely on their previous state observations and dynamically adapt their control policies. Furthermore, we provide a theoretical guarantee for the convergence of parameter estimation and the stability of system states in PACE. Additionally, in our numerical studies, we demonstrate how modeling the learning dynamics of the other agent benefits PACE, compared to approaches that approximate the other agent as having complete information, particularly in terms of stability and convergence speed.
- [82] arXiv:2504.17129 [pdf, html, other]
-
Title: Peer-Aware Cost Estimation in Nonlinear General-Sum Dynamic Games for Mutual Learning and Intent InferenceSubjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Robotics (cs.RO)
Human-robot interactions can be modeled as incomplete-information general-sum dynamic games since the objective functions of both agents are not explicitly known to each other. However, solving for equilibrium policies for such games presents a major challenge, especially if the games involve nonlinear underlying dynamics. To simplify the problem, existing work often assumes that one agent is an expert with complete information about its peer, which can lead to biased estimates and failures in coordination. To address this challenge, we propose a nonlinear peer-aware cost estimation (N-PACE) algorithm for general-sum dynamic games. In N-PACE, using iterative linear quadratic (LQ) approximation of the nonlinear general-sum game, each agent explicitly models the learning dynamics of its peer agent while inferring their objective functions, leading to unbiased fast learning in inferring the unknown objective function of the peer agent, which is critical for task completion and safety assurance. Additionally, we demonstrate how N-PACE enables \textbf{intent communication} in such multi-agent systems by explicitly modeling the peer's learning dynamics.
- [83] arXiv:2504.17130 [pdf, html, other]
-
Title: Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" ControlSubjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
Large language models (LLMs) have transformed the way we access information. These models are often tuned to refuse to comply with requests that are considered harmful and to produce responses that better align with the preferences of those who control the models. To understand how this "censorship" works. We use representation engineering techniques to study open-weights safety-tuned models. We present a method for finding a refusal--compliance vector that detects and controls the level of censorship in model outputs. We also analyze recent reasoning LLMs, distilled from DeepSeek-R1, and uncover an additional dimension of censorship through "thought suppression". We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector
- [84] arXiv:2504.17132 [pdf, html, other]
-
Title: Latent Video Dataset DistillationComments: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Dataset distillation has demonstrated remarkable effectiveness in high-compression scenarios for image datasets. While video datasets inherently contain greater redundancy, existing video dataset distillation methods primarily focus on compression in the pixel space, overlooking advances in the latent space that have been widely adopted in modern text-to-image and text-to-video models. In this work, we bridge this gap by introducing a novel video dataset distillation approach that operates in the latent space using a state-of-the-art variational encoder. Furthermore, we employ a diversity-aware data selection strategy to select both representative and diverse samples. Additionally, we introduce a simple, training-free method to further compress the distilled latent dataset. By combining these techniques, our approach achieves a new state-of-the-art performance in dataset distillation, outperforming prior methods on all datasets, e.g. on HMDB51 IPC 1, we achieve a 2.6% performance increase; on MiniUCF IPC 5, we achieve a 7.8% performance increase.
- [85] arXiv:2504.17137 [pdf, html, other]
-
Title: MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation EvaluationComments: Accepted to NAACL2025 FindingsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settings\footnote{The MIRAGE code and data are available at this https URL.
- [86] arXiv:2504.17139 [pdf, html, other]
-
Title: Opt-ODENet: A Neural ODE Framework with Differentiable QP Layers for Safe and Stable Control Design (longer version)Comments: 19 pagesSubjects: Systems and Control (eess.SY)
Designing controllers that achieve task objectives while ensuring safety is a key challenge in control systems. This work introduces Opt-ODENet, a Neural ODE framework with a differentiable Quadratic Programming (QP) optimization layer to enforce constraints as hard requirements. Eliminating the reliance on nominal controllers or large datasets, our framework solves the optimal control problem directly using Neural ODEs. Stability and convergence are ensured through Control Lyapunov Functions (CLFs) in the loss function, while Control Barrier Functions (CBFs) embedded in the QP layer enforce real-time safety. By integrating the differentiable QP layer with Neural ODEs, we demonstrate compatibility with the adjoint method for gradient computation, enabling the learning of the CBF class-$\mathcal{K}$ function and control network parameters. Experiments validate its effectiveness in balancing safety and performance.
- [87] arXiv:2504.17140 [pdf, html, other]
-
Title: Scalable Permutation-Aware Modeling for Temporal Set PredictionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Temporal set prediction involves forecasting the elements that will appear in the next set, given a sequence of prior sets, each containing a variable number of elements. Existing methods often rely on intricate architectures with substantial computational overhead, which hampers their scalability. In this work, we introduce a novel and scalable framework that leverages permutation-equivariant and permutation-invariant transformations to efficiently model set dynamics. Our approach significantly reduces both training and inference time while maintaining competitive performance. Extensive experiments on multiple public benchmarks show that our method achieves results on par with or superior to state-of-the-art models across several evaluation metrics. These results underscore the effectiveness of our model in enabling efficient and scalable temporal set prediction.
- [88] arXiv:2504.17146 [pdf, html, other]
-
Title: Utilizing Dynamic Time Warping for Pandemic Surveillance: Understanding the Relationship between Google Trends Network Metrics and COVID-19 IncidencesComments: Submitted and currently under review at the IEEE AMLDS 2025Subjects: Computers and Society (cs.CY); Social and Information Networks (cs.SI)
The premise of network statistics derived from Google Trends data to foresee COVID-19 disease progression is gaining momentum in infodemiology. This approach was applied in Metro Manila, National Capital Region, Philippines. Through dynamic time warping (DTW), the temporal alignment was quantified between network metrics and COVID-19 case trajectories, and systematically explored 320 parameter configurations including two network metrics (network density and clustering coefficient), two data preprocessing methods (Rescaling Daily Data and MSV), multiple thresholds, two correlation window sizes, and Sakoe-Chiba band constraints. Results from the Kruskal-Wallis tests revealed that five of the six parameters significantly influenced alignment quality, with the disease comparison type (active cases vs. confirmed cases) demonstrating the strongest effect. The optimal configuration, which is using the network density statistic with a Rescaling Daily Data transformation, a threshold of 0.8, a 15-day window, and a 50-day radius constraint, achieved a DTW score of 36.30. This indicated substantial temporal alignment with the COVID-19 confirmed cases data. The discoveries demonstrate that network metrics rooted from online search behavior can serve as complementary indicators for epidemic surveillance in urban locations like Metro Manila. This strategy leverages the Philippines' extensive online usage during the pandemic to provide potentially valuable early signals of disease spread, and offers a supplementary tool for public health monitoring in resource-limited situations.
- [89] arXiv:2504.17150 [pdf, html, other]
-
Title: DashGuide: Authoring Interactive Dashboard Tours for Guiding Dashboard UsersSubjects: Human-Computer Interaction (cs.HC)
Dashboard guidance helps dashboard users better navigate interactive features, understand the underlying data, and assess insights they can potentially extract from dashboards. However, authoring dashboard guidance is a time consuming task, and embedding guidance into dashboards for effective delivery is difficult to realize. In this work, we contribute DashGuide, a framework and system to support the creation of interactive dashboard guidance with minimal authoring input. Given a dashboard and a communication goal, DashGuide captures a sequence of author-performed interactions to generate guidance materials delivered as playable step-by-step overlays, a.k.a., dashboard tours. Authors can further edit and refine individual tour steps while receiving generative assistance. We also contribute findings from a formative assessment with 9 dashboard creators, which helped inform the design of DashGuide; and findings from an evaluation of DashGuide with 12 dashboard creators, suggesting it provides an improved authoring experience that balances efficiency, expressiveness, and creative freedom.
- [90] arXiv:2504.17156 [pdf, other]
-
Title: Waveform-Logmel Audio Neural Networks for Respiratory Sound ClassificationSubjects: Sound (cs.SD)
Auscultatory analysis using an electronic stethoscope has attracted increasing attention in the clinical diagnosis of respiratory diseases. Recently, neural networks have been applied to assist in respiratory sound classification with achievements. However, it remains challenging due to the scarcity of abnormal respiratory sound. In this paper, we propose a novel architecture, namely Waveform-Logmel audio neural networks (WLANN), which uses both waveform and log-mel spectrogram as the input features and uses Bidirectional Gated Recurrent Units (Bi-GRU) to context model the fused features. Experimental results of our WLANN applied to SPRSound respiratory dataset show that the proposed framework can effectively distinguish pathological respiratory sound classes, outperforming the previous studies, with 90.3% in sensitivity and 93.6% in total score. Our study demonstrates the high effectiveness of the WLANN in the diagnosis of respiratory diseases.
- [91] arXiv:2504.17160 [pdf, html, other]
-
Title: OUI Need to Talk About Weight Decay: A New Perspective on Overfitting DetectionComments: 10 pages, 3 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
We introduce the Overfitting-Underfitting Indicator (OUI), a novel tool for monitoring the training dynamics of Deep Neural Networks (DNNs) and identifying optimal regularization hyperparameters. Specifically, we validate that OUI can effectively guide the selection of the Weight Decay (WD) hyperparameter by indicating whether a model is overfitting or underfitting during training without requiring validation data. Through experiments on DenseNet-BC-100 with CIFAR- 100, EfficientNet-B0 with TinyImageNet and ResNet-34 with ImageNet-1K, we show that maintaining OUI within a prescribed interval correlates strongly with improved generalization and validation scores. Notably, OUI converges significantly faster than traditional metrics such as loss or accuracy, enabling practitioners to identify optimal WD (hyperparameter) values within the early stages of training. By leveraging OUI as a reliable indicator, we can determine early in training whether the chosen WD value leads the model to underfit the training data, overfit, or strike a well-balanced trade-off that maximizes validation scores. This enables more precise WD tuning for optimal performance on the tested datasets and DNNs. All code for reproducing these experiments is available at this https URL.
- [92] arXiv:2504.17162 [pdf, other]
-
Title: A Comprehensive Review on RNA Subcellular Localization PredictionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Genomics (q-bio.GN); Subcellular Processes (q-bio.SC)
The subcellular localization of RNAs, including long non-coding RNAs (lncRNAs), messenger RNAs (mRNAs), microRNAs (miRNAs) and other smaller RNAs, plays a critical role in determining their biological functions. For instance, lncRNAs are predominantly associated with chromatin and act as regulators of gene transcription and chromatin structure, while mRNAs are distributed across the nucleus and cytoplasm, facilitating the transport of genetic information for protein synthesis. Understanding RNA localization sheds light on processes like gene expression regulation with spatial and temporal precision. However, traditional wet lab methods for determining RNA localization, such as in situ hybridization, are often time-consuming, resource-demanding, and costly. To overcome these challenges, computational methods leveraging artificial intelligence (AI) and machine learning (ML) have emerged as powerful alternatives, enabling large-scale prediction of RNA subcellular localization. This paper provides a comprehensive review of the latest advancements in AI-based approaches for RNA subcellular localization prediction, covering various RNA types and focusing on sequence-based, image-based, and hybrid methodologies that combine both data types. We highlight the potential of these methods to accelerate RNA research, uncover molecular pathways, and guide targeted disease treatments. Furthermore, we critically discuss the challenges in AI/ML approaches for RNA subcellular localization, such as data scarcity and lack of benchmarks, and opportunities to address them. This review aims to serve as a valuable resource for researchers seeking to develop innovative solutions in the field of RNA subcellular localization and beyond.
- [93] arXiv:2504.17163 [pdf, html, other]
-
Title: PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion RecognitionComments: The source code will be publicly available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Electroencephalography (EEG) signals provide a promising and involuntary reflection of brain activity related to emotional states, offering significant advantages over behavioral cues like facial expressions. However, EEG signals are often noisy, affected by artifacts, and vary across individuals, complicating emotion recognition. While multimodal approaches have used Peripheral Physiological Signals (PPS) like GSR to complement EEG, they often overlook the dynamic synchronization and consistent semantics between the modalities. Additionally, the temporal dynamics of emotional fluctuations across different time resolutions in PPS remain underexplored. To address these challenges, we propose PhysioSync, a novel pre-training framework leveraging temporal and cross-modal contrastive learning, inspired by physiological synchronization phenomena. PhysioSync incorporates Cross-Modal Consistency Alignment (CM-CA) to model dynamic relationships between EEG and complementary PPS, enabling emotion-related synchronizations across modalities. Besides, it introduces Long- and Short-Term Temporal Contrastive Learning (LS-TCL) to capture emotional synchronization at different temporal resolutions within modalities. After pre-training, cross-resolution and cross-modal features are hierarchically fused and fine-tuned to enhance emotion recognition. Experiments on DEAP and DREAMER datasets demonstrate PhysioSync's advanced performance under uni-modal and cross-modal conditions, highlighting its effectiveness for EEG-centered emotion recognition.
- [94] arXiv:2504.17164 [pdf, html, other]
-
Title: Range and Topology Mutation Based Wireless AgilitySubjects: Emerging Technologies (cs.ET)
In this paper, we present formal foundations for two wireless agility techniques: (1) Random Range Mutation (RNM) that allows for periodic changes of AP coverage range randomly, and (2) Ran- dom Topology Mutation (RTM) that allows for random motion and placement of APs in the wireless infrastructure. The goal of these techniques is to proactively defend against targeted attacks (e.g., DoS and eavesdropping) by forcing the wireless clients to change their AP association randomly. We apply Satisfiability Modulo The- ories (SMT) and Answer Set Programming (ASP) based constraint solving methods that allow for optimizing wireless AP mutation while maintaining service requirements including coverage, secu- rity and energy properties under incomplete information about the adversary strategies. Our evaluation validates the feasibility, scalability, and effectiveness of the formal methods based technical approaches.
- [95] arXiv:2504.17170 [pdf, html, other]
-
Title: Improving Human-Autonomous Vehicle Interaction in Complex SystemsComments: PhD Dissertation from University of California, San Diego; 175 pagesSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Unresolved questions about how autonomous vehicles (AVs) should meet the informational needs of riders hinder real-world adoption. Complicating our ability to satisfy rider needs is that different people, goals, and driving contexts have different criteria for what constitutes interaction success. Unfortunately, most human-AV research and design today treats all people and situations uniformly. It is crucial to understand how an AV should communicate to meet rider needs, and how communications should change when the human-AV complex system changes. I argue that understanding the relationships between different aspects of the human-AV system can help us build improved and adaptable AV communications. I support this argument using three empirical studies. First, I identify optimal communication strategies that enhance driving performance, confidence, and trust for learning in extreme driving environments. Findings highlight the need for task-sensitive, modality-appropriate communications tuned to learner cognitive limits and goals. Next, I highlight the consequences of deploying faulty communication systems and demonstrate the need for context-sensitive communications. Third, I use machine learning (ML) to illuminate personal factors predicting trust in AVs, emphasizing the importance of tailoring designs to individual traits and concerns. Together, this dissertation supports the necessity of transparent, adaptable, and personalized AV systems that cater to individual needs, goals, and contextual demands. By considering the complex system within which human-AV interactions occur, we can deliver valuable insights for designers, researchers, and policymakers. This dissertation also provides a concrete domain to study theories of human-machine joint action and situational awareness, and can be used to guide future human-AI interaction research. [shortened for arxiv]
- [96] arXiv:2504.17171 [pdf, html, other]
-
Title: Augmenting Captions with Emotional Cues: An AR Interface for Real-Time Accessible CommunicationSubjects: Human-Computer Interaction (cs.HC)
This paper introduces an augmented reality (AR) captioning framework designed to support Deaf and Hard of Hearing (DHH) learners in STEM classrooms by integrating non-verbal emotional cues into live transcriptions. Unlike conventional captioning systems that offer only plain text, our system fuses real-time speech recognition with affective and visual signal interpretation, including facial movements, gestures, and vocal tone, to produce emotionally enriched captions. These enhanced captions are rendered in an AR interface developed with Unity and provide contextual annotations such as speaker tone markers (e.g., "concerned") and gesture indicators (e.g., "nods"). The system leverages live camera and microphone input, processed through AI models to detect multimodal cues. Findings from preliminary evaluations suggest that this AR-based captioning approach significantly enhances comprehension and reduces cognitive effort compared to standard captions. Our work emphasizes the potential of immersive environments for inclusive, emotion-aware educational accessibility.
- [97] arXiv:2504.17173 [pdf, html, other]
-
Title: Lessons from Deploying Learning-based CSI Localization on a Large-Scale ISAC PlatformSubjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
In recent years, Channel State Information (CSI), recognized for its fine-grained spatial characteristics, has attracted increasing attention in WiFi-based indoor localization. However, despite its potential, CSI-based approaches have yet to achieve the same level of deployment scale and commercialization as those based on Received Signal Strength Indicator (RSSI). A key limitation lies in the fact that most existing CSI-based systems are developed and evaluated in controlled, small-scale environments, limiting their generalizability. To bridge this gap, we explore the deployment of a large-scale CSI-based localization system involving over 400 Access Points (APs) in a real-world building under the Integrated Sensing and Communication (ISAC) paradigm. We highlight two critical yet often overlooked factors: the underutilization of unlabeled data and the inherent heterogeneity of CSI measurements. To address these challenges, we propose a novel CSI-based learning framework for WiFi localization, tailored for large-scale ISAC deployments on the server side. Specifically, we employ a novel graph-based structure to model heterogeneous CSI data and reduce redundancy. We further design a pretext pretraining task that incorporates spatial and temporal priors to effectively leverage large-scale unlabeled CSI data. Complementarily, we introduce a confidence-aware fine-tuning strategy to enhance the robustness of localization results. In a leave-one-smartphone-out experiment spanning five floors and 25, 600 m2, we achieve a median localization error of 2.17 meters and a floor accuracy of 99.49%. This performance corresponds to an 18.7% reduction in mean absolute error (MAE) compared to the best-performing baseline.
- [98] arXiv:2504.17177 [pdf, html, other]
-
Title: A Genealogy of Multi-Sensor Foundation Models in Remote SensingComments: 20 pages, submitted to ACM SigSpatial, currently under peer reviewSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Foundation models have garnered increasing attention for representation learning in remote sensing, primarily adopting approaches that have demonstrated success in computer vision with minimal domain-specific modification. However, the development and application of foundation models in this field are still burgeoning, as there are a variety of competing approaches that each come with significant benefits and drawbacks. This paper examines these approaches along with their roots in the computer vision field in order to characterize potential advantages and pitfalls while outlining future directions to further improve remote sensing-specific foundation models. We discuss the quality of the learned representations and methods to alleviate the need for massive compute resources. We place emphasis on the multi-sensor aspect of Earth observations, and the extent to which existing approaches leverage multiple sensors in training foundation models in relation to multi-modal foundation models. Finally, we identify opportunities for further harnessing the vast amounts of unlabeled, seasonal, and multi-sensor remote sensing observations.
- [99] arXiv:2504.17178 [pdf, html, other]
-
Title: How to Grow an LSM-tree? Towards Bridging the Gap Between Theory and PracticeComments: Accepted by SIGMOD 2025Subjects: Databases (cs.DB)
LSM-tree based key-value stores are widely adopted as the data storage backend in modern big data applications. The LSM-tree grows with data ingestion, by either adding levels with fixed level capacities (dubbed as vertical scheme) or increasing level capacities with fixed number of levels (dubbed as horizontal scheme). The vertical scheme leads the trend in recent system designs in RocksDB, LevelDB, and WiredTiger, whereas the horizontal scheme shows a decline in being adopted in the industry. The growth scheme profoundly impacts the LSM system performance in various aspects such as read, write and space costs. This paper attempts to give a new insight into a fundamental design question -- how to grow an LSM-tree to attain more desirable performance?
Our analysis highlights the limitations of the vertical scheme in achieving an optimal read-write trade-off and the horizontal scheme in managing space cost effectively. Building on the analysis, we present a novel approach, Vertiorizon, which combines the strengths of both the vertical and horizontal schemes to achieve a superior balance between lookup, update, and space costs. Its adaptive design makes it highly compatible with a wide spectrum of workloads. Compared to the vertical scheme, Vertiorizon significantly improves the read-write performance trade-off. In contrast to the horizontal scheme, Vertiorizon greatly extends the trade-off range by a non-trivial generalization of Bentley and Saxe's theory, while substantially reducing space costs. When integrated with RocksDB, Vertiorizon demonstrates better write performance than the vertical scheme, while incurring about six times less additional space cost compared to the horizontal scheme. - [100] arXiv:2504.17179 [pdf, html, other]
-
Title: AUTHENTICATION: Identifying Rare Failure Modes in Autonomous Vehicle Perception Systems using Adversarially Guided Diffusion ModelsComments: 8 pages, 10 figures. Accepted to IEEE Conference on Artificial Intelligence (CAI), 2025Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Autonomous Vehicles (AVs) rely on artificial intelligence (AI) to accurately detect objects and interpret their surroundings. However, even when trained using millions of miles of real-world data, AVs are often unable to detect rare failure modes (RFMs). The problem of RFMs is commonly referred to as the "long-tail challenge", due to the distribution of data including many instances that are very rarely seen. In this paper, we present a novel approach that utilizes advanced generative and explainable AI techniques to aid in understanding RFMs. Our methods can be used to enhance the robustness and reliability of AVs when combined with both downstream model training and testing. We extract segmentation masks for objects of interest (e.g., cars) and invert them to create environmental masks. These masks, combined with carefully crafted text prompts, are fed into a custom diffusion model. We leverage the Stable Diffusion inpainting model guided by adversarial noise optimization to generate images containing diverse environments designed to evade object detection models and expose vulnerabilities in AI systems. Finally, we produce natural language descriptions of the generated RFMs that can guide developers and policymakers to improve the safety and reliability of AV systems.
- [101] arXiv:2504.17180 [pdf, html, other]
-
Title: We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic FeedbackSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Current text-to-video (T2V) generation models are increasingly popular due to their ability to produce coherent videos from textual prompts. However, these models often struggle to generate semantically and temporally consistent videos when dealing with longer, more complex prompts involving multiple objects or sequential events. Additionally, the high computational costs associated with training or fine-tuning make direct improvements impractical. To overcome these limitations, we introduce \(\projectname\), a novel zero-training video refinement pipeline that leverages neuro-symbolic feedback to automatically enhance video generation, achieving superior alignment with the prompts. Our approach first derives the neuro-symbolic feedback by analyzing a formal video representation and pinpoints semantically inconsistent events, objects, and their corresponding frames. This feedback then guides targeted edits to the original video. Extensive empirical evaluations on both open-source and proprietary T2V models demonstrate that \(\projectname\) significantly enhances temporal and logical alignment across diverse prompts by almost $40\%$.
- [102] arXiv:2504.17181 [pdf, html, other]
-
Title: Evaluating Learned Query Performance Prediction Models at LinkedIn: Challenges, Opportunities, and FindingsSubjects: Databases (cs.DB)
Recent advancements in learning-based query performance prediction models have demonstrated remarkable efficacy. However, these models are predominantly validated using synthetic datasets focused on cardinality or latency estimations. This paper explores the application of these models to LinkedIn's complex real-world OLAP queries executed on Trino, addressing four primary research questions: (1) How do these models perform on real-world industrial data with limited information? (2) Can these models generalize to new tasks, such as CPU time prediction and classification? (3) What additional information available from the query plan could be utilized by these models to enhance their performance? (4) What are the theoretical performance limits of these models given the available data? To address these questions, we evaluate several models-including TLSTM, TCNN, QueryFormer, and XGBoost, against the industrial query workload at LinkedIn, and extend our analysis to CPU time regression and classification tasks. We also propose a multi-task learning approach to incorporate underutilized operator-level metrics that could enhance model understanding. Additionally, we empirically analyze the inherent upper bound that can be achieved from the models.
- [103] arXiv:2504.17185 [pdf, other]
-
Title: P$_\ell$-Kyber: Packing $\ell$ Plaintexts and Lattice Coding for KyberComments: 8 Tables, 1 FigureSubjects: Cryptography and Security (cs.CR); Information Theory (cs.IT)
In this work, we propose a joint design of encoding and encryption processes for KEMs like Kyber, without assuming the independence of the decoding noise entries. Our design features two techniques: ciphertext packing and lattice packing. First, we extend the Peikert-Vaikuntanathan-Waters (PVW) method to the Kyber: $\ell$ plaintexts are packed into a single ciphertext. This scheme is referred to as P$_\ell$-Kyber. We prove that the P$_\ell$-Kyber is IND-CCA secure under the M-LWE hardness assumption. We show that the decryption decoding noise entries across the $\ell$ plaintexts (also known as layers) are mutually independent. Second, we propose a cross-layer lattice encoding scheme for the P$_\ell$-Kyber, where every $\ell$ cross-layer information symbols are encoded to a lattice point. This way we obtain a \emph{coded} P$_\ell$-Kyber, where the decoding noise entries for each lattice point are mutually independent. Therefore, the decryption failure rate (DFR) analysis does not require the assumption of independence among the decryption decoding noise entries. Both DFR and communication cost (CER) are greatly decreased thanks to ciphertext packing and lattice packing. Finally, we demonstrate that with $\ell=24$ and Leech lattice encoder, the proposed coded P$_\ell$-KYBER1024 achieves DFR $<2^{-281}$ and CER $ = 4.6$, i.e., a decrease of CER by $90\%$ compared to KYBER1024.
- [104] arXiv:2504.17186 [pdf, html, other]
-
Title: MAT-DiSMech: A Discrete Differential Geometry-based Computational Tool for Simulation of Rods, Shells, and Soft RobotsComments: Total 25 pages, 8 figures, open-source code available at this https URLSubjects: Robotics (cs.RO)
Accurate and efficient simulation tools are essential in robotics, enabling the visualization of system dynamics and the validation of control laws before committing resources to physical experimentation. Developing physically accurate simulation tools is particularly challenging in soft robotics, largely due to the prevalence of geometrically nonlinear deformation. A variety of robot simulators tackle this challenge by using simplified modeling techniques -- such as lumped mass models -- which lead to physical inaccuracies in real-world applications. On the other hand, high-fidelity simulation methods for soft structures, like finite element analysis, offer increased accuracy but lead to higher computational costs. In light of this, we present a Discrete Differential Geometry-based simulator that provides a balance between physical accuracy and computational speed. Building on an extensive body of research on rod and shell-based representations of soft robots, our tool provides a pathway to accurately model soft robots in a computationally tractable manner. Our open-source MATLAB-based framework is capable of simulating the deformations of rods, shells, and their combinations, primarily utilizing implicit integration techniques. The software design is modular for the user to customize the code, for example, add new external forces and impose custom boundary conditions. The implementations for prevalent forces encountered in robotics, including gravity, contact, kinetic and viscous friction, and aerodynamic drag, have been provided. We provide several illustrative examples that showcase the capabilities and validate the physical accuracy of the simulator. The open-source code is available at this https URL. We anticipate that the proposed simulator can serve as an effective digital twin tool, enhancing the Sim2Real pathway in soft robotics research.
- [105] arXiv:2504.17189 [pdf, other]
-
Title: Metadata Augmentation using NLP, Machine Learning and AI chatbots: A comparisonSubjects: Digital Libraries (cs.DL)
Recent advances in machine learning and artificial intelligence have provided more alternatives for the implementation of repetitive or monotonous tasks. However, the development of AI tools has not been straightforward, and use case exploration and workflow integration are still ongoing challenges. In this work, we present a detailed qualitative analysis of the performance and user experience of popular commercial AI chatbots when used for document classification with limited data. We report the results for a real-world example of metadata augmentation in academic libraries environment. We compare the results of AI chatbots with other machine learning and natural language processing methods such as XGBoost and BERT-based fine tuning, and share insights from our experience. We found that AI chatbots perform similarly among them while outperforming the machine learning methods we tested, showing their advantage when the method relies on local data for training. We also found that while working with AI chatbots is easier than with code, getting useful results from them still represents a challenge for the user. Furthermore, we encountered alarming conceptual errors in the output of some chatbots, such as not being able to count the number of lines of our inputs and explaining the mistake as ``human error''. Although this is not complete evidence that AI chatbots can be effectively used for metadata classification, we believe that the information provided in this work can be useful to librarians and data curators in developing pathways for the integration and use of AI tools for data curation or metadata augmentation tasks.
- [106] arXiv:2504.17192 [pdf, html, other]
-
Title: Paper2Code: Automating Code Generation from Scientific Papers in Machine LearningSubjects: Computation and Language (cs.CL)
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.
- [107] arXiv:2504.17194 [pdf, other]
-
Title: Developing a Blockchain-Based Secure Digital Contents Distribution SystemComments: 4 pages, 5 figuresSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
As digital content distribution expands rapidly through online platforms, securing digital media and protecting intellectual property has become increasingly complex. Traditional centralized systems, while widely adopted, suffer from vulnerabilities such as single points of failure and limited traceability of unauthorized access. This paper presents a blockchain-based secure digital content distribution system that integrates Sia, a decentralized storage network, and Skynet, a content delivery network, to enhance content protection and distribution. The proposed system employs a dual-layer architecture: off-chain for user authentication and on-chain for transaction validation using smart contracts and asymmetric encryption. By introducing a license issuance and secret block mechanism, the system ensures content authenticity, privacy, and controlled access. Experimental results demonstrate the feasibility and scalability of the system in securely distributing multimedia files. The proposed platform not only improves content security but also paves the way for future enhancements with decentralized applications and integrated royalty payment mechanisms.
- [108] arXiv:2504.17196 [pdf, html, other]
-
Title: A Double-Norm Aggregated Tensor Latent Factorization Model for Temporal-Aware Traffic Speed ImputationComments: 11pages,3figuresSubjects: Machine Learning (cs.LG)
In intelligent transportation systems (ITS), traffic management departments rely on sensors, cameras, and GPS devices to collect real-time traffic data. Traffic speed data is often incomplete due to sensor failures, data transmission delays, or occlusions, resulting in missing speed data in certain road segments. Currently, tensor decomposition based methods are extensively utilized, they mostly rely on the $L_2$-norm to construct their learning objectives, which leads to reduced robustness in the algorithms. To address this, we propose Temporal-Aware Traffic Speed Imputation (TATSI), which combines the $L_2$-norm and smooth $L_1$ (${SL}_1$)-norm in its loss function, thereby achieving both high accuracy and robust performance in imputing missing time-varying traffic speed data. TATSI adopts a single latent factor-dependent, nonnegative, and multiplicative update (SLF-NMU) approach, which serves as an efficient solver for performing nonnegative latent factor analysis (LFA) on a tensor. Empirical studies on three real-world time-varying traffic speed datasets demonstrate that, compared with state-of-the-art traffic speed predictors, TATSI more precisely captures temporal patterns, thereby yielding the most accurate imputations for missing traffic speed data.
- [109] arXiv:2504.17198 [pdf, html, other]
-
Title: Automatically Generating Rules of Malicious Software Packages via Large Language ModelComments: 14 pages, 11 figuresJournal-ref: the 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN), 2025Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Today's security tools predominantly rely on predefined rules crafted by experts, making them poorly adapted to the emergence of software supply chain attacks. To tackle this limitation, we propose a novel tool, RuleLLM, which leverages large language models (LLMs) to automate rule generation for OSS ecosystems. RuleLLM extracts metadata and code snippets from malware as its input, producing YARA and Semgrep rules that can be directly deployed in software development. Specifically, the rule generation task involves three subtasks: crafting rules, refining rules, and aligning rules. To validate RuleLLM's effectiveness, we implemented a prototype system and conducted experiments on the dataset of 1,633 malicious packages. The results are promising that RuleLLM generated 763 rules (452 YARA and 311 Semgrep) with a precision of 85.2\% and a recall of 91.8\%, outperforming state-of-the-art (SOTA) tools and scored-based approaches. We further analyzed generated rules and proposed a rule taxonomy: 11 categories and 38 subcategories.
- [110] arXiv:2504.17200 [pdf, html, other]
-
Title: A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and AdaptationYangxinyu Xie, Bowen Jiang, Tanwi Mallick, Joshua David Bergerson, John K. Hutchison, Duane R. Verner, Jordan Branham, M. Ross Alexander, Robert B. Ross, Yan Feng, Leslie-Anne Levy, Weijie Su, Camillo J. TaylorSubjects: Computation and Language (cs.CL)
Large language models (LLMs) are a transformational capability at the frontier of artificial intelligence and machine learning that can support decision-makers in addressing pressing societal challenges such as extreme natural hazard events. As generalized models, LLMs often struggle to provide context-specific information, particularly in areas requiring specialized knowledge. In this work we propose a retrieval-augmented generation (RAG)-based multi-agent LLM system to support analysis and decision-making in the context of natural hazards and extreme weather events. As a proof of concept, we present WildfireGPT, a specialized system focused on wildfire hazards. The architecture employs a user-centered, multi-agent design to deliver tailored risk insights across diverse stakeholder groups. By integrating natural hazard and extreme weather projection data, observational datasets, and scientific literature through an RAG framework, the system ensures both the accuracy and contextual relevance of the information it provides. Evaluation across ten expert-led case studies demonstrates that WildfireGPT significantly outperforms existing LLM-based solutions for decision support.
- [111] arXiv:2504.17201 [pdf, html, other]
-
Title: Simultaneous Collision Detection and Force Estimation for Dynamic Quadrupedal LocomotionSubjects: Robotics (cs.RO)
In this paper we address the simultaneous collision detection and force estimation problem for quadrupedal locomotion using joint encoder information and the robot dynamics only. We design an interacting multiple-model Kalman filter (IMM-KF) that estimates the external force exerted on the robot and multiple possible contact modes. The method is invariant to any gait pattern design. Our approach leverages pseudo-measurement information of the external forces based on the robot dynamics and encoder information. Based on the estimated contact mode and external force, we design a reflex motion and an admittance controller for the swing leg to avoid collisions by adjusting the leg's reference motion. Additionally, we implement a force-adaptive model predictive controller to enhance balancing. Simulation ablatation studies and experiments show the efficacy of the approach.
- [112] arXiv:2504.17203 [pdf, html, other]
-
Title: High-Fidelity And Complex Test Data Generation For Real-World SQL Code Generation ServicesSubjects: Databases (cs.DB); Machine Learning (cs.LG)
The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically ``meaningful'' mock data for complex schema that includes columns with nested structures that we frequently encounter in Google SQL code generation workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex schema, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate realistic high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the test targets (SQL queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant services. Our results demonstrate the practical utility of an out-of-the-box LLM (\textit{gemini}) based test data generation for industrial SQL code generation services where generating realistic test data is essential due to the frequent unavailability of production datasets.
- [113] arXiv:2504.17204 [pdf, html, other]
-
Title: Factually: Exploring Wearable Fact-Checking for Augmented Truth DiscernmentChitralekha Gupta, Hanjun Wu, Praveen Sasikumar, Shreyas Sridhar, Priambudi Bagaskara, Suranga NanayakkaraComments: Presented at the 2025 ACM Workshop on Human-AI Interaction for Augmented Reasoning, Report Number: CHI25-WS-AUGMENTED-REASONINGJournal-ref: Proceedings of the 2025 ACM CHI Workshop on Human-AI Interaction for Augmented ReasoningSubjects: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
Wearable devices are transforming human capabilities by seamlessly augmenting cognitive functions. In this position paper, we propose a voice-based, interactive learning companion designed to amplify and extend cognitive abilities through informal learning. Our vision is threefold: (1) to enable users to discover new knowledge on-the-go through contextual interactive quizzes, fostering critical thinking and mindfulness, (2) to proactively detect misinformation, empowering users to critically assess information in real time, and (3) to provide spoken language correction and prompting hints for second language learning and effective communication. As an initial step toward this vision, we present Factually - a proactive, wearable fact-checking system integrated into devices like smartwatches or rings. Factually discreetly alerts users to potential falsehoods via vibrotactile feedback, helping them assess information critically. We demonstrate its utility through three illustrative scenarios, highlighting its potential to extend cognitive abilities for real-time misinformation detection. Early qualitative feedback suggests that Factually can enhance users' fact-checking capabilities, offering both practical and experiential benefits.
- [114] arXiv:2504.17207 [pdf, other]
-
Title: Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery SimulationComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.
- [115] arXiv:2504.17210 [pdf, html, other]
-
Title: Synthetic Power Flow Data Generation Using Physics-Informed Denoising Diffusion Probabilistic ModelsComments: Submitted to IEEE SmartGridComm Conference 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Many data-driven modules in smart grid rely on access to high-quality power flow data; however, real-world data are often limited due to privacy and operational constraints. This paper presents a physics-informed generative framework based on Denoising Diffusion Probabilistic Models (DDPMs) for synthesizing feasible power flow data. By incorporating auxiliary training and physics-informed loss functions, the proposed method ensures that the generated data exhibit both statistical fidelity and adherence to power system feasibility. We evaluate the approach on the IEEE 14-bus and 30-bus benchmark systems, demonstrating its ability to capture key distributional properties and generalize to out-of-distribution scenarios. Comparative results show that the proposed model outperforms three baseline models in terms of feasibility, diversity, and accuracy of statistical features. This work highlights the potential of integrating generative modelling into data-driven power system applications.
- [116] arXiv:2504.17211 [pdf, html, other]
-
Title: Breaking the Flow and the Bank: Stealthy Cyberattacks on Water Network HydraulicsSubjects: Systems and Control (eess.SY); Cryptography and Security (cs.CR)
As water distribution networks (WDNs) become increasingly connected with digital infrastructures, they face greater exposure to cyberattacks that threaten their operational integrity. Stealthy False Data Injection Attacks (SFDIAs) are particularly concerning, as they manipulate sensor data to compromise system operations while avoiding detection. While existing studies have focused on either detection methods or specific attack formulations, the relationship between attack sophistication, system knowledge requirements, and achievable impact remains unexplored. This paper presents a systematic analysis of sensor attacks against WDNs, investigating different combinations of physical constraints, state monitoring requirements, and intrusion detection evasion conditions. We propose several attack formulations that range from tailored strategies satisfying both physical and detection constraints to simpler measurement manipulations. The proposed attacks are simple and local -- requiring knowledge only of targeted sensors and their hydraulic connections -- making them scalable and practical. Through case studies on Net1 and Net3 benchmark networks, we demonstrate how these attacks can persistently increase operational costs and alter water flows while remaining undetected by monitoring systems for extended periods. The analysis provides utilities with insights for vulnerability assessment and motivates the development of protection strategies that combine physical and statistical security mechanisms.
- [117] arXiv:2504.17213 [pdf, html, other]
-
Title: MCAF: Efficient Agent-based Video Understanding Framework through Multimodal Coarse-to-Fine Attention FocusingSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Even in the era of rapid advances in large models, video understanding, particularly long videos, remains highly challenging. Compared with textual or image-based information, videos commonly contain more information with redundancy, requiring large models to strategically allocate attention at a global level for accurate comprehension. To address this, we propose MCAF, an agent-based, training-free framework perform video understanding through Multimodal Coarse-to-fine Attention Focusing. The key innovation lies in its ability to sense and prioritize segments of the video that are highly relevant to the understanding task. First, MCAF hierarchically concentrates on highly relevant frames through multimodal information, enhancing the correlation between the acquired contextual information and the query. Second, it employs a dilated temporal expansion mechanism to mitigate the risk of missing crucial details when extracting information from these concentrated frames. In addition, our framework incorporates a self-reflection mechanism utilizing the confidence level of the model's responses as feedback. By iteratively applying these two creative focusing strategies, it adaptively adjusts attention to capture highly query-connected context and thus improves response accuracy. MCAF outperforms comparable state-of-the-art methods on average. On the EgoSchema dataset, it achieves a remarkable 5% performance gain over the leading approach. Meanwhile, on Next-QA and IntentQA datasets, it outperforms the current state-of-the-art standard by 0.2% and 0.3% respectively. On the Video-MME dataset, which features videos averaging nearly an hour in length, MCAF also outperforms other agent-based methods.
- [118] arXiv:2504.17216 [pdf, html, other]
-
Title: Robotic Grinding Skills Learning Based on Geodesic Length Dynamic Motion PrimitivesSubjects: Robotics (cs.RO)
Learning grinding skills from human craftsmen via imitation learning has become a key research topic in robotic machining. Due to their strong generalization and robustness to external disturbances, Dynamical Movement Primitives (DMPs) offer a promising approach for robotic grinding skill learning. However, directly applying DMPs to grinding tasks faces challenges, such as low orientation accuracy, unsynchronized position-orientation-force, and limited generalization for surface trajectories. To address these issues, this paper proposes a robotic grinding skill learning method based on geodesic length DMPs (Geo-DMPs). First, a normalized 2D weighted Gaussian kernel and intrinsic mean clustering algorithm are developed to extract geometric features from multiple demonstrations. Then, an orientation manifold distance metric removes the time dependency in traditional orientation DMPs, enabling accurate orientation learning via Geo-DMPs. A synchronization encoding framework is further proposed to jointly model position, orientation, and force using a geodesic length-based phase function. This framework enables robotic grinding actions to be generated between any two surface points. Experiments on robotic chamfer grinding and free-form surface grinding validate that the proposed method achieves high geometric accuracy and generalization in skill encoding and generation. To our knowledge, this is the first attempt to use DMPs for jointly learning and generating grinding skills in position, orientation, and force on model-free surfaces, offering a novel path for robotic grinding.
- [119] arXiv:2504.17219 [pdf, html, other]
-
Title: Enhancing Variational Autoencoders with Smooth Robust Latent EncodingComments: Under reviewSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Variational Autoencoders (VAEs) have played a key role in scaling up diffusion-based generative models, as in Stable Diffusion, yet questions regarding their robustness remain largely underexplored. Although adversarial training has been an established technique for enhancing robustness in predictive models, it has been overlooked for generative models due to concerns about potential fidelity degradation by the nature of trade-offs between performance and robustness. In this work, we challenge this presumption, introducing Smooth Robust Latent VAE (SRL-VAE), a novel adversarial training framework that boosts both generation quality and robustness. In contrast to conventional adversarial training, which focuses on robustness only, our approach smooths the latent space via adversarial perturbations, promoting more generalizable representations while regularizing with originality representation to sustain original fidelity. Applied as a post-training step on pre-trained VAEs, SRL-VAE improves image robustness and fidelity with minimal computational overhead. Experiments show that SRL-VAE improves both generation quality, in image reconstruction and text-guided image editing, and robustness, against Nightshade attacks and image editing attacks. These results establish a new paradigm, showing that adversarial training, once thought to be detrimental to generative models, can instead enhance both fidelity and robustness.
- [120] arXiv:2504.17220 [pdf, other]
-
Title: Does Knowledge Distillation Matter for Large Language Model based Bundle Generation?Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
LLMs are increasingly explored for bundle generation, thanks to their reasoning capabilities and knowledge. However, deploying large-scale LLMs introduces significant efficiency challenges, primarily high computational costs during fine-tuning and inference due to their massive parameterization. Knowledge distillation (KD) offers a promising solution, transferring expertise from large teacher models to compact student models. This study systematically investigates knowledge distillation approaches for bundle generation, aiming to minimize computational demands while preserving performance. We explore three critical research questions: (1) how does the format of KD impact bundle generation performance? (2) to what extent does the quantity of distilled knowledge influence performance? and (3) how do different ways of utilizing the distilled knowledge affect performance? We propose a comprehensive KD framework that (i) progressively extracts knowledge (patterns, rules, deep thoughts); (ii) captures varying quantities of distilled knowledge through different strategies; and (iii) exploits complementary LLM adaptation techniques (in-context learning, supervised fine-tuning, combination) to leverage distilled knowledge in small student models for domain-specific adaptation and enhanced efficiency. Extensive experiments provide valuable insights into how knowledge format, quantity, and utilization methodologies collectively shape LLM-based bundle generation performance, exhibiting KD's significant potential for more efficient yet effective LLM-based bundle generation.
- [121] arXiv:2504.17222 [pdf, other]
-
Title: Optimal Distribution of Solutions for Crowding Distance on Linear Pareto Fronts of Two-Objective Optimization ProblemsSubjects: Neural and Evolutionary Computing (cs.NE)
Characteristics of an evolutionary multi-objective optimization (EMO) algorithm can be explained using its best solution set. For example, the best solution set for SMS-EMOA is the same as the optimal distribution of solutions for hypervolume maximization. For NSGA-III, if the Pareto front has intersection points with all reference lines, all of those intersection points are the best solution set. For MOEA/D, the best solution set is the set of the optimal solution of each sub-problem. Whereas these EMO algorithms can be analyzed in this manner, the best solution set for the most well-known and frequently-used EMO algorithm NSGA-II has not been discussed in the literature. This is because NSGA-II is not based on any clear criterion to be optimized (e.g., hypervolume maximization, distance minimization to the nearest reference line). As the first step toward the best solution set analysis for NSGA-II, we discuss the optimal distribution of solutions for the crowding distance under the simplest setting: the maximization of the minimum crowding distance on linear Pareto fronts of two-objective optimization problems. That is, we discuss the optimal distribution of solutions on a straight line. Our theoretical analysis shows that the uniformly distributed solutions are not the best solution set. However, it is also shown by computational experiments that the uniformly distributed solutions (except for the duplicated two extreme solutions at each edge of the Pareto front) are obtained from a modified NSGA-II with the ($\mu$ + 1) generation update scheme.
- [122] arXiv:2504.17223 [pdf, html, other]
-
Title: Towards Generalizable Deepfake Detection with Spatial-Frequency Collaborative Learning and Hierarchical Cross-Modal FusionSubjects: Computer Vision and Pattern Recognition (cs.CV)
The rapid evolution of deep generative models poses a critical challenge to deepfake detection, as detectors trained on forgery-specific artifacts often suffer significant performance degradation when encountering unseen forgeries. While existing methods predominantly rely on spatial domain analysis, frequency domain operations are primarily limited to feature-level augmentation, leaving frequency-native artifacts and spatial-frequency interactions insufficiently exploited. To address this limitation, we propose a novel detection framework that integrates multi-scale spatial-frequency analysis for universal deepfake detection. Our framework comprises three key components: (1) a local spectral feature extraction pipeline that combines block-wise discrete cosine transform with cascaded multi-scale convolutions to capture subtle spectral artifacts; (2) a global spectral feature extraction pipeline utilizing scale-invariant differential accumulation to identify holistic forgery distribution patterns; and (3) a multi-stage cross-modal fusion mechanism that incorporates shallow-layer attention enhancement and deep-layer dynamic modulation to model spatial-frequency interactions. Extensive evaluations on widely adopted benchmarks demonstrate that our method outperforms state-of-the-art deepfake detection methods in both accuracy and generalizability.
- [123] arXiv:2504.17224 [pdf, html, other]
-
Title: Visual and textual prompts for enhancing emotion recognition in videoZhifeng Wang, Qixuan Zhang, Peter Zhang, Wenjia Niu, Kaihao Zhang, Ramesh Sankaranarayana, Sabrina Caldwell, Tom GedeonComments: 12 pages, 10 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Vision Large Language Models (VLLMs) exhibit promising potential for multi-modal understanding, yet their application to video-based emotion recognition remains limited by insufficient spatial and contextual awareness. Traditional approaches, which prioritize isolated facial features, often neglect critical non-verbal cues such as body language, environmental context, and social interactions, leading to reduced robustness in real-world scenarios. To address this gap, we propose Set-of-Vision-Text Prompting (SoVTP), a novel framework that enhances zero-shot emotion recognition by integrating spatial annotations (e.g., bounding boxes, facial landmarks), physiological signals (facial action units), and contextual cues (body posture, scene dynamics, others' emotions) into a unified prompting strategy. SoVTP preserves holistic scene information while enabling fine-grained analysis of facial muscle movements and interpersonal dynamics. Extensive experiments show that SoVTP achieves substantial improvements over existing visual prompting methods, demonstrating its effectiveness in enhancing VLLMs' video emotion recognition capabilities.
- [124] arXiv:2504.17226 [pdf, html, other]
-
Title: FLAG: Formal and LLM-assisted SVA Generation for Formal Specifications of On-Chip Communication ProtocolsComments: 9 pages, 3 figuresSubjects: Hardware Architecture (cs.AR); Software Engineering (cs.SE)
Formal specifications of on-chip communication protocols are crucial for system-on-chip (SoC) design and verification. However, manually constructing these formal specifications from informal documents remains a tedious and error-prone task. Although recent efforts have used Large Language Models (LLMs) to generate SystemVerilog Assertion (SVA) properties from design documents for Register-Transfer Level (RTL) design verification, in our experience these approaches have not shown promise in generating SVA properties for communication protocols. Since protocol specification documents are unstructured and ambiguous in nature, LLMs often fail to extract the necessary information and end up generating irrelevant or even incorrect properties. We propose FLAG, a two-stage framework to help construct formal protocol specifications from informal documents. In the first stage, a predefined template set is used to generate candidate SVA properties. To avoid missing necessary properties, we develop a grammar-based approach to generate comprehensive template sets that capture critical signal behaviors for various communication protocols. In the second stage, we utilize unambiguous timing diagrams in conjunction with textual descriptions from the specification documents to filter out incorrect properties. A formal approach is first implemented to check the candidate properties and filter out those inconsistent with the timing diagrams. An LLM is then consulted to further remove incorrect properties with respect to the textual description, obtaining the final property set. Experiments on various open-source communication protocols demonstrate the effectiveness of FLAG in generating SVA properties from informal documents.
- [125] arXiv:2504.17229 [pdf, html, other]
-
Title: Range Image-Based Implicit Neural Compression for LiDAR Point CloudsSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper presents a novel scheme to efficiently compress Light Detection and Ranging~(LiDAR) point clouds, enabling high-precision 3D scene archives, and such archives pave the way for a detailed understanding of the corresponding 3D scenes. We focus on 2D range images~(RIs) as a lightweight format for representing 3D LiDAR observations. Although conventional image compression techniques can be adapted to improve compression efficiency for RIs, their practical performance is expected to be limited due to differences in bit precision and the distinct pixel value distribution characteristics between natural images and RIs. We propose a novel implicit neural representation~(INR)--based RI compression method that effectively handles floating-point valued pixels. The proposed method divides RIs into depth and mask images and compresses them using patch-wise and pixel-wise INR architectures with model pruning and quantization, respectively. Experiments on the KITTI dataset show that the proposed method outperforms existing image, point cloud, RI, and INR-based compression methods in terms of 3D reconstruction and detection quality at low bitrates and decoding latency.
- [126] arXiv:2504.17232 [pdf, html, other]
-
Title: Multi-Modal Traffic Analysis: Integrating Time-Series Forecasting, Accident Prediction, and Image ClassificationComments: 5 pages,10 figuresSubjects: Machine Learning (cs.LG)
This study proposes an integrated machine learning framework for advanced traffic analysis, combining time-series forecasting, classification, and computer vision techniques. The system utilizes an ARIMA(2,0,1) model for traffic prediction (MAE: 2.1), an XGBoost classifier for accident severity classification (100% accuracy on balanced data), and a Convolutional Neural Network (CNN) for traffic image classification (92% accuracy). Tested on diverse datasets, the framework outperforms baseline models and identifies key factors influencing accident severity, including weather and road infrastructure. Its modular design supports deployment in smart city systems for real-time monitoring, accident prevention, and resource optimization, contributing to the evolution of intelligent transportation systems.
- [127] arXiv:2504.17233 [pdf, html, other]
-
Title: An Adaptive Finite Element DtN Method for the Acoustic-Elastic Interaction Problem in Periodic StructuresComments: 28 pages, 9 figuresSubjects: Numerical Analysis (math.NA)
Consider a time-harmonic acoustic plane wave incident onto an elastic body with an unbounded periodic surface. The medium above the surface is supposed to be filled with a homogeneous compressible inviscid air/fluid of constant mass density, while the elastic body is assumed to be isotropic and linear. By introducing the Dirichlet-to-Neumann (DtN) operators for acoustic and elastic waves simultaneously, the model is formulated as an acoustic-elastic interaction problem in periodic structures. Based on a duality argument, an a posteriori error estimate is derived for the associated truncated finite element approximation. The a posteriori error estimate consists of the finite element approximation error and the truncation error of two different DtN operators, where the latter decays exponentially with respect to the truncation parameter. Based on the a posteriori error, an adaptive finite element algorithm is proposed for solving the acoustic-elastic interaction problem in periodic structures. Numerical experiments are presented to demonstrate the effectiveness of the proposed algorithm.
- [128] arXiv:2504.17234 [pdf, html, other]
-
Title: Scene Perceived Image Perceptual Score (SPIPS): combining global and local perception for image quality assessmentSubjects: Computer Vision and Pattern Recognition (cs.CV)
The rapid advancement of artificial intelligence and widespread use of smartphones have resulted in an exponential growth of image data, both real (camera-captured) and virtual (AI-generated). This surge underscores the critical need for robust image quality assessment (IQA) methods that accurately reflect human visual perception. Traditional IQA techniques primarily rely on spatial features - such as signal-to-noise ratio, local structural distortions, and texture inconsistencies - to identify artifacts. While effective for unprocessed or conventionally altered images, these methods fall short in the context of modern image post-processing powered by deep neural networks (DNNs). The rise of DNN-based models for image generation, enhancement, and restoration has significantly improved visual quality, yet made accurate assessment increasingly complex. To address this, we propose a novel IQA approach that bridges the gap between deep learning methods and human perception. Our model disentangles deep features into high-level semantic information and low-level perceptual details, treating each stream separately. These features are then combined with conventional IQA metrics to provide a more comprehensive evaluation framework. This hybrid design enables the model to assess both global context and intricate image details, better reflecting the human visual process, which first interprets overall structure before attending to fine-grained elements. The final stage employs a multilayer perceptron (MLP) to map the integrated features into a concise quality score. Experimental results demonstrate that our method achieves improved consistency with human perceptual judgments compared to existing IQA models.
- [129] arXiv:2504.17236 [pdf, html, other]
-
Title: Rate-Distortion-Perception Theory for the Quadratic Wasserstein SpaceSubjects: Information Theory (cs.IT); Machine Learning (cs.LG)
We establish a single-letter characterization of the fundamental distortion-rate-perception tradeoff with limited common randomness under the squared error distortion measure and the squared Wasserstein-2 perception measure. Moreover, it is shown that this single-letter characterization can be explicitly evaluated for the Gaussian source. Various notions of universal representation are also clarified.
- [130] arXiv:2504.17238 [pdf, html, other]
-
Title: Crisp: Cognitive Restructuring of Negative Thoughts through Multi-turn Supportive DialoguesJinfeng Zhou, Yuxuan Chen, Jianing Yin, Yongkang Huang, Yihan Shi, Xikun Zhang, Libiao Peng, Rongsheng Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie HuangSubjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Cognitive Restructuring (CR) is a psychotherapeutic process aimed at identifying and restructuring an individual's negative thoughts, arising from mental health challenges, into more helpful and positive ones via multi-turn dialogues. Clinician shortage and stigma urge the development of human-LLM interactive psychotherapy for CR. Yet, existing efforts implement CR via simple text rewriting, fixed-pattern dialogues, or a one-shot CR workflow, failing to align with the psychotherapeutic process for effective CR. To address this gap, we propose CRDial, a novel framework for CR, which creates multi-turn dialogues with specifically designed identification and restructuring stages of negative thoughts, integrates sentence-level supportive conversation strategies, and adopts a multi-channel loop mechanism to enable iterative CR. With CRDial, we distill Crisp, a large-scale and high-quality bilingual dialogue dataset, from LLM. We then train Crispers, Crisp-based conversational LLMs for CR, at 7B and 14B scales. Extensive human studies show the superiority of Crispers in pointwise, pairwise, and intervention evaluations.
- [131] arXiv:2504.17243 [pdf, html, other]
-
Title: NeuralGrok: Accelerate Grokking by Neural Gradient TransformationComments: Preprint, 16 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Grokking is proposed and widely studied as an intricate phenomenon in which generalization is achieved after a long-lasting period of overfitting. In this work, we propose NeuralGrok, a novel gradient-based approach that learns an optimal gradient transformation to accelerate the generalization of transformers in arithmetic tasks. Specifically, NeuralGrok trains an auxiliary module (e.g., an MLP block) in conjunction with the base model. This module dynamically modulates the influence of individual gradient components based on their contribution to generalization, guided by a bilevel optimization algorithm. Our extensive experiments demonstrate that NeuralGrok significantly accelerates generalization, particularly in challenging arithmetic tasks. We also show that NeuralGrok promotes a more stable training paradigm, constantly reducing the model's complexity, while traditional regularization methods, such as weight decay, can introduce substantial instability and impede generalization. We further investigate the intrinsic model complexity leveraging a novel Absolute Gradient Entropy (AGE) metric, which explains that NeuralGrok effectively facilitates generalization by reducing the model complexity. We offer valuable insights on the grokking phenomenon of Transformer models, which encourages a deeper understanding of the fundamental principles governing generalization ability.
- [132] arXiv:2504.17244 [pdf, html, other]
-
Title: Service Rate Regions of MDS Codes & Fractional Matchings in Quasi-uniform HypergraphsSubjects: Information Theory (cs.IT); Combinatorics (math.CO)
The service rate region (SRR) has emerged as a critical performance metric for distributed systems that store data redundantly. It measures the system's ability to serve multiple users concurrently. Mathematically, the SRR is a polytope in R^k where each dimension corresponds to the service request rate of one of the k data objects. This paper focuses on systems employing a class of Maximum Distance Separable (MDS) codes. For each code in the class, we characterize the k axes intercept points of its SRR, and the smallest standard simplex that includes the SRR. We use these results to show that the SRR grows with the increasing number of systematic columns in the generator matrices. We establish a graph-theoretic framework associating this SRR problem with fractional matchings in quasi-uniform hypergraphs. Identifying the SRR polytope is equivalent to determining a particular image of the fractional-matching polytope. We introduce a notion of Greedy Matching and show that it is sufficient to focus on these matchings to characterize the SRR rather than the entire matching polytope. With these tools, we determine the SRR of a large subset of the considered class of codes. Our results generalize previous characterizations of systematic and non-systematic MDS-coded systems, offering a unified framework for analyzing service rate regions of codes.
- [133] arXiv:2504.17247 [pdf, html, other]
-
Title: Targeted AMP generation through controlled diffusion with efficient embeddingsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Deep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as low experimental hit rates as well as the need for nuanced controllability and efficient modeling of peptide properties. To address these challenges, we introduce OmegAMP, a framework that leverages a diffusion-based generative model with efficient low-dimensional embeddings, precise controllability mechanisms, and novel classifiers with drastically reduced false positive rates for candidate filtering. OmegAMP enables the targeted generation of AMPs with specific physicochemical properties, activity profiles, and species-specific effectiveness. Moreover, it maximizes sample diversity while ensuring faithfulness to the underlying data distribution during generation. We demonstrate that OmegAMP achieves state-of-the-art performance across all stages of the AMP discovery pipeline, significantly advancing the potential of computational frameworks in combating antimicrobial resistance.
- [134] arXiv:2504.17248 [pdf, html, other]
-
Title: How Jungian Cognitive Functions Explain MBTI Type Prevalence in Computer Industry CareersSubjects: Computers and Society (cs.CY)
This study investigates the relationship between Carl Jung's cognitive functions and success in computer industry careers by analyzing the distribution of Myers-Briggs Type Indicator (MBTI) types among professionals in the field. Building on Carl Jung's theory of psychological types, which categorizes human cognition into four primary functions, Sensing, Intuition, Thinking, and Feeling, this study investigates how these functions, when combined with the attitudes of Extraversion and Introversion, influence personality types and career choices in the tech sector. Through a comprehensive analysis of data from 30 studies spanning multiple countries and decades, encompassing 18,264 individuals in computer-related professions, we identified the most prevalent cognitive functions and their combinations. After normalizing the data against general population distributions, our findings showed that individual Jungian functions (Te, Ni, Ti, Ne), dual function combinations (Ni-Te, Ti-Ne, Si-Te, Ni-Fe), and MBTI types (INTJ, ENTJ, INTP, ENTP, ISTJ, INFJ, ESTJ, ESTP) had significantly higher representation compared to general population norms. The paper addresses gaps in the existing literature by providing a more nuanced understanding of how cognitive functions impact job performance and team dynamics, offering insights for career guidance, team composition, and professional development in the computer industry, and a deeper understanding of how cognitive preferences influence career success in technology-related fields.
- [135] arXiv:2504.17249 [pdf, html, other]
-
Title: Demonstrating Berkeley Humanoid Lite: An Open-source, Accessible, and Customizable 3D-printed Humanoid RobotYufeng Chi, Qiayuan Liao, Junfeng Long, Xiaoyu Huang, Sophia Shao, Borivoje Nikolic, Zhongyu Li, Koushil SreenathComments: Accepted in Robotics: Science and Systems (RSS) 2025Subjects: Robotics (cs.RO)
Despite significant interest and advancements in humanoid robotics, most existing commercially available hardware remains high-cost, closed-source, and non-transparent within the robotics community. This lack of accessibility and customization hinders the growth of the field and the broader development of humanoid technologies. To address these challenges and promote democratization in humanoid robotics, we demonstrate Berkeley Humanoid Lite, an open-source humanoid robot designed to be accessible, customizable, and beneficial for the entire community. The core of this design is a modular 3D-printed gearbox for the actuators and robot body. All components can be sourced from widely available e-commerce platforms and fabricated using standard desktop 3D printers, keeping the total hardware cost under $5,000 (based on U.S. market prices). The design emphasizes modularity and ease of fabrication. To address the inherent limitations of 3D-printed gearboxes, such as reduced strength and durability compared to metal alternatives, we adopted a cycloidal gear design, which provides an optimal form factor in this context. Extensive testing was conducted on the 3D-printed actuators to validate their durability and alleviate concerns about the reliability of plastic components. To demonstrate the capabilities of Berkeley Humanoid Lite, we conducted a series of experiments, including the development of a locomotion controller using reinforcement learning. These experiments successfully showcased zero-shot policy transfer from simulation to hardware, highlighting the platform's suitability for research validation. By fully open-sourcing the hardware design, embedded code, and training and deployment frameworks, we aim for Berkeley Humanoid Lite to serve as a pivotal step toward democratizing the development of humanoid robotics. All resources are available at this https URL.
- [136] arXiv:2504.17252 [pdf, html, other]
-
Title: Low-Resource Neural Machine Translation Using Recurrent Neural Networks and Transfer Learning: A Case Study on English-to-IgboComments: 25 pages, 14 combined figures (19 total), includes horizontal layouts. Submitted to arXiv for open accessSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
In this study, we develop Neural Machine Translation (NMT) and Transformer-based transfer learning models for English-to-Igbo translation - a low-resource African language spoken by over 40 million people across Nigeria and West Africa. Our models are trained on a curated and benchmarked dataset compiled from Bible corpora, local news, Wikipedia articles, and Common Crawl, all verified by native language experts. We leverage Recurrent Neural Network (RNN) architectures, including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), enhanced with attention mechanisms to improve translation accuracy. To further enhance performance, we apply transfer learning using MarianNMT pre-trained models within the SimpleTransformers framework. Our RNN-based system achieves competitive results, closely matching existing English-Igbo benchmarks. With transfer learning, we observe a performance gain of +4.83 BLEU points, reaching an estimated translation accuracy of 70%. These findings highlight the effectiveness of combining RNNs with transfer learning to address the performance gap in low-resource language translation tasks.
- [137] arXiv:2504.17253 [pdf, html, other]
-
Title: DIVE: Inverting Conditional Diffusion Models for Discriminative TasksComments: Accepted by IEEE Transactions on MultimediaSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Diffusion models have shown remarkable progress in various generative tasks such as image and video generation. This paper studies the problem of leveraging pretrained diffusion models for performing discriminative tasks. Specifically, we extend the discriminative capability of pretrained frozen generative diffusion models from the classification task to the more complex object detection task, by "inverting" a pretrained layout-to-image diffusion model. To this end, a gradient-based discrete optimization approach for replacing the heavy prediction enumeration process, and a prior distribution model for making more accurate use of the Bayes' rule, are proposed respectively. Empirical results show that this method is on par with basic discriminative object detection baselines on COCO dataset. In addition, our method can greatly speed up the previous diffusion-based method for classification without sacrificing accuracy. Code and models are available at this https URL .
- [138] arXiv:2504.17256 [pdf, html, other]
-
Title: A Comment on "e-PoS: Making PoS Decentralized and Fair"Comments: Comment on arXiv:2101.00330Subjects: Cryptography and Security (cs.CR)
Proof-of-Stake (PoS) is a prominent Sybil control mechanism for blockchain-based systems. In "e-PoS: Making PoS Decentralized and Fair," Saad et al. (TPDS'21) introduced a new Proof-of-Stake protocol, e-PoS, to enhance PoS applications' decentralization and fairness. In this comment paper, we address a misunderstanding in the work of Saad et al. The conventional Proof-of-Stake model that causes the fairness problem does not align with the general concept of Proof-of-Stake nor the Proof-of-Stake cryptocurrencies mentioned in their paper.
- [139] arXiv:2504.17258 [pdf, other]
-
Title: Group Downsampling with Equivariant Anti-aliasingSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Group Theory (math.GR)
Downsampling layers are crucial building blocks in CNN architectures, which help to increase the receptive field for learning high-level features and reduce the amount of memory/computation in the model. In this work, we study the generalization of the uniform downsampling layer for group equivariant architectures, e.g., G-CNNs. That is, we aim to downsample signals (feature maps) on general finite groups with anti-aliasing. This involves the following: (a) Given a finite group and a downsampling rate, we present an algorithm to form a suitable choice of subgroup. (b) Given a group and a subgroup, we study the notion of bandlimited-ness and propose how to perform anti-aliasing. Notably, our method generalizes the notion of downsampling based on classical sampling theory. When the signal is on a cyclic group, i.e., periodic, our method recovers the standard downsampling of an ideal low-pass filter followed by a subsampling operation. Finally, we conducted experiments on image classification tasks demonstrating that the proposed downsampling operation improves accuracy, better preserves equivariance, and reduces model size when incorporated into G-equivariant networks
- [140] arXiv:2504.17261 [pdf, html, other]
-
Title: Symbolic Representation for Any-to-Any Generative TasksJiaqi Chen, Xiaoye Zhu, Yue Wang, Tianyang Liu, Xinhui Chen, Ying Chen, Chak Tou Leong, Yifei Ke, Joseph Liu, Yiwen Yuan, Julian McAuley, Li-jia LiSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
We propose a symbolic generative task description language and a corresponding inference engine capable of representing arbitrary multimodal tasks as structured symbolic flows. Unlike conventional generative models that rely on large-scale training and implicit neural representations to learn cross-modal mappings, often at high computational cost and with limited flexibility, our framework introduces an explicit symbolic representation comprising three core primitives: functions, parameters, and topological logic. Leveraging a pre-trained language model, our inference engine maps natural language instructions directly to symbolic workflows in a training-free manner. Our framework successfully performs over 12 diverse multimodal generative tasks, demonstrating strong performance and flexibility without the need for task-specific tuning. Experiments show that our method not only matches or outperforms existing state-of-the-art unified models in content quality, but also offers greater efficiency, editability, and interruptibility. We believe that symbolic task representations provide a cost-effective and extensible foundation for advancing the capabilities of generative AI.
- [141] arXiv:2504.17263 [pdf, html, other]
-
Title: Precision Neural Network Quantization via Learnable Adaptive ModulesWenqiang Zhou, Zhendong Yu, Xinyu Liu, Jiaming Yang, Rong Xiao, Tao Wang, Chenwei Tang, Jiancheng LvSubjects: Computer Vision and Pattern Recognition (cs.CV); Computational Complexity (cs.CC)
Quantization Aware Training (QAT) is a neural network quantization technique that compresses model size and improves operational efficiency while effectively maintaining model performance. The paradigm of QAT is to introduce fake quantization operators during the training process, allowing the model to autonomously compensate for information loss caused by quantization. Making quantization parameters trainable can significantly improve the performance of QAT, but at the cost of compromising the flexibility during inference, especially when dealing with activation values with substantially different distributions. In this paper, we propose an effective learnable adaptive neural network quantization method, called Adaptive Step Size Quantization (ASQ), to resolve this conflict. Specifically, the proposed ASQ method first dynamically adjusts quantization scaling factors through a trained module capable of accommodating different activations. Then, to address the rigid resolution issue inherent in Power of Two (POT) quantization, we propose an efficient non-uniform quantization scheme. We utilize the Power Of Square root of Two (POST) as the basis for exponential quantization, effectively handling the bell-shaped distribution of neural network weights across various bit-widths while maintaining computational efficiency through a Look-Up Table method (LUT). Extensive experimental results demonstrate that the proposed ASQ method is superior to the state-of-the-art QAT approaches. Notably that the ASQ is even competitive compared to full precision baselines, with its 4-bit quantized ResNet34 model improving accuracy by 1.2\% on ImageNet.
- [142] arXiv:2504.17264 [pdf, html, other]
-
Title: JurisCTC: Enhancing Legal Judgment Prediction via Cross-Domain Transfer and Contrastive LearningComments: Accepted in International Joint Conference on Neural Networks (IJCNN) 2025Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
In recent years, Unsupervised Domain Adaptation (UDA) has gained significant attention in the field of Natural Language Processing (NLP) owing to its ability to enhance model generalization across diverse domains. However, its application for knowledge transfer between distinct legal domains remains largely unexplored. To address the challenges posed by lengthy and complex legal texts and the limited availability of large-scale annotated datasets, we propose JurisCTC, a novel model designed to improve the accuracy of Legal Judgment Prediction (LJP) tasks. Unlike existing approaches, JurisCTC facilitates effective knowledge transfer across various legal domains and employs contrastive learning to distinguish samples from different domains. Specifically, for the LJP task, we enable knowledge transfer between civil and criminal law domains. Compared to other models and specific large language models (LLMs), JurisCTC demonstrates notable advancements, achieving peak accuracies of 76.59% and 78.83%, respectively.
- [143] arXiv:2504.17267 [pdf, html, other]
-
Title: MV-Crafter: An Intelligent System for Music-guided Video GenerationSubjects: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
Music videos, as a prevalent form of multimedia entertainment, deliver engaging audio-visual experiences to audiences and have gained immense popularity among singers and fans. Creators can express their interpretations of music naturally through visual elements. However, the creation process of music video demands proficiency in script design, video shooting, and music-video synchronization, posing significant challenges for non-professionals. Previous work has designed automated music video generation frameworks. However, they suffer from complexity in input and poor output quality. In response, we present MV-Crafter, a system capable of producing high-quality music videos with synchronized music-video rhythm and style. Our approach involves three technical modules that simulate the human creation process: the script generation module, video generation module, and music-video synchronization module. MV-Crafter leverages a large language model to generate scripts considering the musical semantics. To address the challenge of synchronizing short video clips with music of varying lengths, we propose a dynamic beat matching algorithm and visual envelope-induced warping method to ensure precise, monotonic music-video synchronization. Besides, we design a user-friendly interface to simplify the creation process with intuitive editing features. Extensive experiments have demonstrated that MV-Crafter provides an effective solution for improving the quality of generated music videos.
- [144] arXiv:2504.17268 [pdf, html, other]
-
Title: Parameter Estimation in ODE Models with Certified Polynomial System SolvingComments: 3 pagesSubjects: Symbolic Computation (cs.SC); Mathematical Software (cs.MS); Systems and Control (eess.SY); Dynamical Systems (math.DS)
We consider dynamical models given by rational ODE systems. Parameter estimation is an important and challenging task of recovering parameter values from observed data. Recently, a method based on differential algebra and rational interpolation was proposed to express parameter estimation in terms of polynomial system solving. Typically, polynomial system solving is a bottleneck, hence the choice of the polynomial solver is crucial. In this contribution, we compare two polynomial system solvers applied to parameter estimation: homotopy continuation solver from this http URL and our new implementation of a certified solver based on rational univariate representation (RUR) and real root isolation. We show how the new RUR solver can tackle examples that are out of reach for the homotopy methods and vice versa.
- [145] arXiv:2504.17269 [pdf, html, other]
-
Title: Towards Generalized and Training-Free Text-Guided Semantic ManipulationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Text-guided semantic manipulation refers to semantically editing an image generated from a source prompt to match a target prompt, enabling the desired semantic changes (e.g., addition, removal, and style transfer) while preserving irrelevant contents. With the powerful generative capabilities of the diffusion model, the task has shown the potential to generate high-fidelity visual content. Nevertheless, existing methods either typically require time-consuming fine-tuning (inefficient), fail to accomplish multiple semantic manipulations (poorly extensible), and/or lack support for different modality tasks (limited generalizability). Upon further investigation, we find that the geometric properties of noises in the diffusion model are strongly correlated with the semantic changes. Motivated by this, we propose a novel $\textit{GTF}$ for text-guided semantic manipulation, which has the following attractive capabilities: 1) $\textbf{Generalized}$: our $\textit{GTF}$ supports multiple semantic manipulations (e.g., addition, removal, and style transfer) and can be seamlessly integrated into all diffusion-based methods (i.e., Plug-and-play) across different modalities (i.e., modality-agnostic); and 2) $\textbf{Training-free}$: $\textit{GTF}$ produces high-fidelity results via simply controlling the geometric relationship between noises without tuning or optimization. Our extensive experiments demonstrate the efficacy of our approach, highlighting its potential to advance the state-of-the-art in semantics manipulation.
- [146] arXiv:2504.17271 [pdf, html, other]
-
Title: Contrastive Learning for Continuous Touch-Based AuthenticationSubjects: Cryptography and Security (cs.CR)
Smart mobile devices have become indispensable in modern daily life, where sensitive information is frequently processed, stored, and transmitted-posing critical demands for robust security controls. Given that touchscreens are the primary medium for human-device interaction, continuous user authentication based on touch behavior presents a natural and seamless security solution. While existing methods predominantly adopt binary classification under single-modal learning settings, we propose a unified contrastive learning framework for continuous authentication in a non-disruptive manner. Specifically, the proposed method leverages a Temporal Masked Autoencoder to extract temporal patterns from raw multi-sensor data streams, capturing continuous motion and gesture dynamics. The pre-trained TMAE is subsequently integrated into a Siamese Temporal-Attentive Convolutional Network within a contrastive learning paradigm to model both sequential and cross-modal patterns. To further enhance performance, we incorporate multi-head attention and channel attention mechanisms to capture long-range dependencies and optimize inter-channel feature integration. Extensive experiments on public benchmarks and a self-collected dataset demonstrate that our approach outperforms state-of-the-art methods, offering a reliable and effective solution for user authentication on mobile devices.
- [147] arXiv:2504.17274 [pdf, other]
-
Title: Signal Recovery from Random Dot-Product Graphs Under Local Differential PrivacySubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
We consider the problem of recovering latent information from graphs under $\varepsilon$-edge local differential privacy where the presence of relationships/edges between two users/vertices remains confidential, even from the data curator. For the class of generalized random dot-product graphs, we show that a standard local differential privacy mechanism induces a specific geometric distortion in the latent positions. Leveraging this insight, we show that consistent recovery of the latent positions is achievable by appropriately adjusting the statistical inference procedure for the privatized graph. Furthermore, we prove that our procedure is nearly minimax-optimal under local edge differential privacy constraints. Lastly, we show that this framework allows for consistent recovery of geometric and topological information underlying the latent positions, as encoded in their persistence diagrams. Our results extend previous work from the private community detection literature to a substantially richer class of models and inferential tasks.
- [148] arXiv:2504.17276 [pdf, html, other]
-
Title: HeRB: Heterophily-Resolved Structure Balancer for Graph Neural NetworksSubjects: Machine Learning (cs.LG)
Recent research has witnessed the remarkable progress of Graph Neural Networks (GNNs) in the realm of graph data representation. However, GNNs still encounter the challenge of structural imbalance. Prior solutions to this problem did not take graph heterophily into account, namely that connected nodes process distinct labels or features, thus resulting in a deficiency in effectiveness. Upon verifying the impact of heterophily on solving the structural imbalance problem, we propose to rectify the heterophily first and then transfer homophilic knowledge. To the end, we devise a method named HeRB (Heterophily-Resolved Structure Balancer) for GNNs. HeRB consists of two innovative components: 1) A heterophily-lessening augmentation module which serves to reduce inter-class edges and increase intra-class edges; 2) A homophilic knowledge transfer mechanism to convey homophilic information from head nodes to tail nodes. Experimental results demonstrate that HeRB achieves superior performance on two homophilic and six heterophilic benchmark datasets, and the ablation studies further validate the efficacy of two proposed components.
- [149] arXiv:2504.17277 [pdf, html, other]
-
Title: ExOSITO: Explainable Off-Policy Learning with Side Information for Intensive Care Unit Blood Test OrdersComments: Accepted to the Conference on Health, Inference, and Learning (CHIL) 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Ordering a minimal subset of lab tests for patients in the intensive care unit (ICU) can be challenging. Care teams must balance between ensuring the availability of the right information and reducing the clinical burden and costs associated with each lab test order. Most in-patient settings experience frequent over-ordering of lab tests, but are now aiming to reduce this burden on both hospital resources and the environment. This paper develops a novel method that combines off-policy learning with privileged information to identify the optimal set of ICU lab tests to order. Our approach, EXplainable Off-policy learning with Side Information for ICU blood Test Orders (ExOSITO) creates an interpretable assistive tool for clinicians to order lab tests by considering both the observed and predicted future status of each patient. We pose this problem as a causal bandit trained using offline data and a reward function derived from clinically-approved rules; we introduce a novel learning framework that integrates clinical knowledge with observational data to bridge the gap between the optimal and logging policies. The learned policy function provides interpretable clinical information and reduces costs without omitting any vital lab orders, outperforming both a physician's policy and prior approaches to this practical problem.
- [150] arXiv:2504.17279 [pdf, html, other]
-
Title: Evaluating and Mitigating Bias in AI-Based Medical Text GenerationComments: 12 pages, 8 figures, published in Nature Computational ScienceJournal-ref: Nature Computational Science 2025Subjects: Computation and Language (cs.CL)
Artificial intelligence (AI) systems, particularly those based on deep learning models, have increasingly achieved expert-level performance in medical applications. However, there is growing concern that such AI systems may reflect and amplify human bias, and reduce the quality of their performance in historically under-served populations. The fairness issue has attracted considerable research interest in the medical imaging classification field, yet it remains understudied in the text generation domain. In this study, we investigate the fairness problem in text generation within the medical field and observe significant performance discrepancies across different races, sexes, and age groups, including intersectional groups, various model scales, and different evaluation metrics. To mitigate this fairness issue, we propose an algorithm that selectively optimizes those underperformed groups to reduce bias. The selection rules take into account not only word-level accuracy but also the pathology accuracy to the target reference, while ensuring that the entire process remains fully differentiable for effective model training. Our evaluations across multiple backbones, datasets, and modalities demonstrate that our proposed algorithm enhances fairness in text generation without compromising overall performance. Specifically, the disparities among various groups across different metrics were diminished by more than 30% with our algorithm, while the relative change in text generation accuracy was typically within 2%. By reducing the bias generated by deep learning models, our proposed approach can potentially alleviate concerns about the fairness and reliability of text generation diagnosis in medical domain.
Our code is publicly available to facilitate further research at this https URL. - [151] arXiv:2504.17280 [pdf, html, other]
-
Title: EdgePoint2: Compact Descriptors for Superior Efficiency and AccuracySubjects: Computer Vision and Pattern Recognition (cs.CV)
The field of keypoint extraction, which is essential for vision applications like Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM), has evolved from relying on handcrafted methods to leveraging deep learning techniques. While deep learning approaches have significantly improved performance, they often incur substantial computational costs, limiting their deployment in real-time edge applications. Efforts to create lightweight neural networks have seen some success, yet they often result in trade-offs between efficiency and accuracy. Additionally, the high-dimensional descriptors generated by these networks poses challenges for distributed applications requiring efficient communication and coordination, highlighting the need for compact yet competitively accurate descriptors. In this paper, we present EdgePoint2, a series of lightweight keypoint detection and description neural networks specifically tailored for edge computing applications on embedded system. The network architecture is optimized for efficiency without sacrificing accuracy. To train compact descriptors, we introduce a combination of Orthogonal Procrustes loss and similarity loss, which can serve as a general approach for hypersphere embedding distillation tasks. Additionally, we offer 14 sub-models to satisfy diverse application requirements. Our experiments demonstrate that EdgePoint2 consistently achieves state-of-the-art (SOTA) accuracy and efficiency across various challenging scenarios while employing lower-dimensional descriptors (32/48/64). Beyond its accuracy, EdgePoint2 offers significant advantages in flexibility, robustness, and versatility. Consequently, EdgePoint2 emerges as a highly competitive option for visual tasks, especially in contexts demanding adaptability to diverse computational and communication constraints.
- [152] arXiv:2504.17281 [pdf, html, other]
-
Title: Building Sustainable and Trustworthy Indigenous Knowledge Preservation EcosystemSubjects: Computers and Society (cs.CY); Emerging Technologies (cs.ET)
This paper focuses on the essential global issue of protecting and transmitting indigenous knowledge. It reveals the challenges in this area and proposes a sustainable supply chain framework for indigenous knowledge. The paper reviews existing technological solutions and identifies technical challenges and gaps. It then introduces cutting-edge technologies to protect and disseminate indigenous knowledge more effectively. The paper also discusses how the proposed framework can address real-world challenges in protecting and transmitting indigenous knowledge, and explores future research applications of the proposed solutions. Finally, it addresses open issues and provides a detailed analysis, offering promising research directions for the protection and transmission of indigenous knowledge worldwide.
- [153] arXiv:2504.17282 [pdf, html, other]
-
Title: Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement LearningSubjects: Artificial Intelligence (cs.AI)
Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboard actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through $\textit{intent-based affordances}$ -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose $\textbf{Code as Generative Affordances}$ $(\textbf{$\texttt{CoGA}$})$, a method that leverages pre-trained vision-language models (VLMs) to generate code that determines affordable actions through implicit intent-completion functions and using a fully-automated program generation and verification pipeline. These programs are then used in-the-loop of a reinforcement learning agent to return a set of affordances given a pixel observation. By greatly reducing the number of actions that an agent must consider, we demonstrate on a wide range of tasks in the MiniWob++ benchmark that: $\textbf{1)}$ $\texttt{CoGA}$ is orders of magnitude more sample efficient than its RL agent, $\textbf{2)}$ $\texttt{CoGA}$'s programs can generalize within a family of tasks, and $\textbf{3)}$ $\texttt{CoGA}$ performs better or on par compared with behavior cloning when a small number of expert demonstrations is available.
- [154] arXiv:2504.17287 [pdf, html, other]
-
Title: Combining Static and Dynamic Approaches for Mining and Testing Constraints for RESTful API TestingSubjects: Software Engineering (cs.SE)
In API testing, deriving logical constraints on API response bodies is crucial in generating the test cases to cover various aspects of RESTful APIs. However, existing approaches are limited to dynamic analysis in which constraints are extracted from the execution of APIs as part of the system under test. The key limitation of such a dynamic approach is its under-estimation in which inputs in API executions are not sufficiently diverse to uncover actual constraints on API response bodies. In this paper, we propose to combine a novel static analysis approach (in which the constraints for API response bodies are mined from API specifications), with the dynamic approach (which relies on API execution data). We leverage large language models (LLMs) to comprehend the API specifications, mine constraints for response bodies, and generate test cases. To reduce LLMs' hallucination, we apply an Observation-Confirmation (OC) scheme which uses initial prompts to contextualize constraints. %, allowing subsequent prompts to more accurately confirm their presence. Our empirical results show that~LLMs with OC prompting achieve high precision in constraint mining with the average of 91.2%. When combining static and dynamic analysis, our tool, RBCTest , achieves a precision of 78.5%. RBCTest detects 107 constraints that the dynamic approach misses and 46 more precise constraints. We also use its generated test cases to detect 21 mismatches between the API specification and actual response data for 8 real-world APIs. Four of the mismatches were, in fact, reported in developers' forums.
- [155] arXiv:2504.17289 [pdf, html, other]
-
Title: Separating Two Points with Obstacles in the Plane: Improved Upper and Lower BoundsComments: 32 pages, 16 figuresSubjects: Computational Geometry (cs.CG)
Given two points in the plane, and a set of "obstacles" given as curves through the plane with assigned weights, we consider the point-separation problem, which asks for the minimum-weight subset of the obstacles separating the two points. A few computational models for this problem have been previously studied. We give a unified approach to this problem in all models via a reduction to a particular shortest-path problem, and obtain improved running times in essentially all cases. In addition, we also give fine-grained lower bounds for many cases.
- [156] arXiv:2504.17295 [pdf, html, other]
-
Title: AI-Enhanced Business Process Automation: A Case Study in the Insurance Domain Using Object-Centric Process MiningSubjects: Artificial Intelligence (cs.AI)
Recent advancements in Artificial Intelligence (AI), particularly Large Language Models (LLMs), have enhanced organizations' ability to reengineer business processes by automating knowledge-intensive tasks. This automation drives digital transformation, often through gradual transitions that improve process efficiency and effectiveness. To fully assess the impact of such automation, a data-driven analysis approach is needed - one that examines how traditional and AI-enhanced process variants coexist during this transition. Object-Centric Process Mining (OCPM) has emerged as a valuable method that enables such analysis, yet real-world case studies are still needed to demonstrate its applicability. This paper presents a case study from the insurance sector, where an LLM was deployed in production to automate the identification of claim parts, a task previously performed manually and identified as a bottleneck for scalability. To evaluate this transformation, we apply OCPM to assess the impact of AI-driven automation on process scalability. Our findings indicate that while LLMs significantly enhance operational capacity, they also introduce new process dynamics that require further refinement. This study also demonstrates the practical application of OCPM in a real-world setting, highlighting its advantages and limitations.
- [157] arXiv:2504.17297 [pdf, html, other]
-
Title: Knapsack on Graphs with Relaxed Neighborhood ConstraintsSubjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC)
In the knapsack problems with neighborhood constraints that were studied before, the input is a graph $\mathcal{G}$ on a set $\mathcal{V}$ of items, each item $v \in \mathcal{V}$ has a weight $w_v$ and profit $p_v$, the size $s$ of the knapsack, and the demand $d$. The goal is to compute if there exists a feasible solution whose total weight is at most $s$ and total profit is at most $d$. Here, feasible solutions are all subsets $\mathcal{S}$ of the items such that, for every item in $\mathcal{S}$, at least one of its neighbors in $\mathcal{G}$ is also in $\mathcal{S}$ for \hor, and all its neighbors in $\mathcal{G}$ are also in $\mathcal{S}$ for \hand~\cite{borradaile2012knapsack}. We study a relaxation of the above problems. Specifically, we allow all possible subsets of items to be feasible solutions. However, only those items for which we pick at least one or all of its neighbor (out-neighbor for directed graph) contribute to profit whereas every item picked contribute to the weight; we call the corresponding problems \sor and \sand. We show that both \sor and \sand are strongly \NPC even on undirected graphs. Regarding parameterized complexity, we show both \sor and \hor are \WTH parameterized by the size $s$ of the knapsack size. Interestingly, both \sand and \hand are \WOH parameterized by knapsack size, $s$ plus profit demand, $d$ and also parameterized by solution size, $b$. For \sor and \hor, we present a randomized color-coding-based pseudo-\FPT algorithm, parameterized by the solution size $b$, and consequently by the demand $d$. We then consider the treewidth of the input graph as our parameter and design pseudo fixed-parameter tractable (\FPT) algorithm parameterized by treewidth, $\text{tw}$ for all variants. Finally, we present an additive $1$ approximation for \sor when both the weight and profit of every vertex is $1$.
- [158] arXiv:2504.17299 [pdf, html, other]
-
Title: Approximate Problems for Finite TransducersSubjects: Formal Languages and Automata Theory (cs.FL)
Finite (word) state transducers extend finite state automata by defining a binary relation over finite words, called rational relation. If the rational relation is the graph of a function, this function is said to be rational. The class of sequential functions is a strict subclass of rational functions, defined as the functions recognised by input-deterministic finite state transducers. The class membership problems between those classes are known to be decidable. We consider approximate versions of these problems and show they are decidable as well. This includes the approximate functionality problem, which asks whether given a rational relation (by a transducer), is it close to a rational function, and the approximate determinisation problem, which asks whether a given rational function is close to a sequential function. We prove decidability results for several classical distances, including Hamming and Levenshtein edit distance. Finally, we investigate the approximate uniformisation problem, which asks, given a rational relation $R$, whether there exists a sequential function that is close to some function uniformising $R$. As for its exact version, we prove that this problem is undecidable.
- [159] arXiv:2504.17300 [pdf, html, other]
-
Title: The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style AttributesComments: Accepted at SaTML 2025Subjects: Machine Learning (cs.LG)
Backdoor attacks on text classifiers can cause them to predict a predefined label when a particular "trigger" is present. Prior attacks often rely on triggers that are ungrammatical or otherwise unusual, leading to conspicuous attacks. As a result, human annotators, who play a critical role in curating training data in practice, can easily detect and filter out these unnatural texts during manual inspection, reducing the risk of such attacks. We argue that a key criterion for a successful attack is for text with and without triggers to be indistinguishable to humans. However, prior work neither directly nor comprehensively evaluated attack subtlety and invisibility with human involvement. We bridge the gap by conducting thorough human evaluations to assess attack subtlety. We also propose \emph{AttrBkd}, consisting of three recipes for crafting subtle yet effective trigger attributes, such as extracting fine-grained attributes from existing baseline backdoor attacks. Our human evaluations find that AttrBkd with these baseline-derived attributes is often more effective (higher attack success rate) and more subtle (fewer instances detected by humans) than the original baseline backdoor attacks, demonstrating that backdoor attacks can bypass detection by being inconspicuous and appearing natural even upon close inspection, while still remaining effective. Our human annotation also provides information not captured by automated metrics used in prior work, and demonstrates the misalignment of these metrics with human judgment.
- [160] arXiv:2504.17304 [pdf, html, other]
-
Title: You Are What You Bought: Generating Customer Personas for E-commerce ApplicationsComments: SIGIR 2025Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
In e-commerce, user representations are essential for various applications. Existing methods often use deep learning techniques to convert customer behaviors into implicit embeddings. However, these embeddings are difficult to understand and integrate with external knowledge, limiting the effectiveness of applications such as customer segmentation, search navigation, and product recommendations. To address this, our paper introduces the concept of the customer persona. Condensed from a customer's numerous purchasing histories, a customer persona provides a multi-faceted and human-readable characterization of specific purchase behaviors and preferences, such as Busy Parents or Bargain Hunters.
This work then focuses on representing each customer by multiple personas from a predefined set, achieving readable and informative explicit user representations. To this end, we propose an effective and efficient solution GPLR. To ensure effectiveness, GPLR leverages pre-trained LLMs to infer personas for customers. To reduce overhead, GPLR applies LLM-based labeling to only a fraction of users and utilizes a random walk technique to predict personas for the remaining customers. We further propose RevAff, which provides an absolute error $\epsilon$ guarantee while improving the time complexity of the exact solution by a factor of at least $O(\frac{\epsilon\cdot|E|N}{|E|+N\log N})$, where $N$ represents the number of customers and products, and $E$ represents the interactions between them. We evaluate the performance of our persona-based representation in terms of accuracy and robustness for recommendation and customer segmentation tasks using three real-world e-commerce datasets. Most notably, we find that integrating customer persona representations improves the state-of-the-art graph convolution-based recommendation model by up to 12% in terms of NDCG@K and F1-Score@K. - [161] arXiv:2504.17305 [pdf, other]
-
Title: Machine learning-based condition monitoring of powertrains in modern electric drivesComments: 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksJournal-ref: IEEE Power Electronics Magazine (Volume: 10, Issue: 1, March 2023)Subjects: Machine Learning (cs.LG)
The recent technological advances in digitalization have revolutionized the industrial sector. Leveraging data analytics has now enabled the collection of deep insights into the performance and, as a result, the optimization of assets. Industrial drives, for example, already accumulate all the necessary information to control electric machines. These signals include but are not limited to currents, frequency, and temperature. Integrating machine learning (ML) models responsible for predicting the evolution of those directly collected or implicitly derived parameters enhances the smartness of industrial systems even further. In this article, data already residing in most modern electric drives has been used to develop a data-driven thermal model of a power module. A test bench has been designed and used specifically for training and validating the thermal digital twin undergoing various static and dynamic operating profiles. Different approaches, from traditional linear models to deep neural networks, have been implemented to emanate the best ML model for estimating the case temperature of a power module. Several evaluation metrics were then used to assess the investigated methods' performance and implementation in industrial embedded systems.
- [162] arXiv:2504.17306 [pdf, html, other]
-
Title: Advanced Segmentation of Diabetic Retinopathy Lesions Using DeepLabv3+Comments: This work was accepted at the ACS/IEEE International Conference on Computer Systems and Applications (AICCSA) 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
To improve the segmentation of diabetic retinopathy lesions (microaneurysms, hemorrhages, exudates, and soft exudates), we implemented a binary segmentation method specific to each type of lesion. As post-segmentation, we combined the individual model outputs into a single image to better analyze the lesion types. This approach facilitated parameter optimization and improved accuracy, effectively overcoming challenges related to dataset limitations and annotation complexity. Specific preprocessing steps included cropping and applying contrast-limited adaptive histogram equalization to the L channel of the LAB image. Additionally, we employed targeted data augmentation techniques to further refine the model's efficacy. Our methodology utilized the DeepLabv3+ model, achieving a segmentation accuracy of 99%. These findings highlight the efficacy of innovative strategies in advancing medical image analysis, particularly in the precise segmentation of diabetic retinopathy lesions. The IDRID dataset was utilized to validate and demonstrate the robustness of our approach.
- [163] arXiv:2504.17307 [pdf, html, other]
-
Title: An Extensible Software Transport Layer for GPU NetworkingYang Zhou, Zhongjie Chen, Ziming Mao, ChonLam Lao, Shuo Yang, Pravein Govindan Kannan, Jiaqi Gao, Yilong Zhao, Yongji Wu, Kaichao You, Fengyuan Ren, Zhiying Xu, Costin Raiciu, Ion StoicaSubjects: Networking and Internet Architecture (cs.NI)
Fast-evolving machine learning (ML) workloads have increasing requirements for networking. However, host network transport on RDMA NICs is hard to evolve, causing problems for ML workloads. For example, single-path RDMA traffic is prone to flow collisions that severely degrade collective communication performance. We present UCCL, an extensible software transport layer to evolve GPU networking. UCCL decouples the data path and control path of existing RDMA NICs and efficiently runs the control-path transport on host CPUs. This software extensibility brings in transport innovations that cannot be achieved in hardware for ML workloads, e.g., a multipath transport to resolve flow collisions. ML collectives atop UCCL achieve up to 3.3x higher performance compared to an industry solution.
- [164] arXiv:2504.17309 [pdf, html, other]
-
Title: CoheMark: A Novel Sentence-Level Watermark for Enhanced Text QualityComments: Published at the 1st workshop on GenAI Watermarking, collocated with ICLR 2025Subjects: Computation and Language (cs.CL)
Watermarking technology is a method used to trace the usage of content generated by large language models. Sentence-level watermarking aids in preserving the semantic integrity within individual sentences while maintaining greater robustness. However, many existing sentence-level watermarking techniques depend on arbitrary segmentation or generation processes to embed watermarks, which can limit the availability of appropriate sentences. This limitation, in turn, compromises the quality of the generated response. To address the challenge of balancing high text quality with robust watermark detection, we propose CoheMark, an advanced sentence-level watermarking technique that exploits the cohesive relationships between sentences for better logical fluency. The core methodology of CoheMark involves selecting sentences through trained fuzzy c-means clustering and applying specific next sentence selection criteria. Experimental evaluations demonstrate that CoheMark achieves strong watermark strength while exerting minimal impact on text quality.
- [165] arXiv:2504.17310 [pdf, other]
-
Title: An All-Optical Metro Network Architecture and QoS-Aware Wavelength Allocation Study for Converged Fixed, Mobile, and Edge Computing Multi-Granular TrafficComments: ONDM 2025Subjects: Networking and Internet Architecture (cs.NI)
In this paper, we introduce an all-optical metro network architecture, called MOON, to serve converged multigranular traffic from fixed, mobile, and edge computing services. Since traffic is characterized by high dynamicity and diverse access requirements, MOON uses network slicing to provide quality of service (QoS) aware wavelength allocation to fulfill the various applications traffic demands. MOON incorporates hybrid optical switching (HOS) combining optical circuit switching (OCS) and optical time slotted switching (OTS) capabilities that appropriately maps different traffic types to them. Specifically, the OCS network slice explicitly serves aggregated traffic of long duration and high volume, while OTS network slice serves short bursty traffic. In order to provide flexibility, separate sets of wavelengths are used for OCS and OTS traffic service, both within a metro-access network (MAN) (intra-MAN) and between different MANs (inter-MAN). We extensively study the required number of wavelengths to efficiently serve OCS and OTS traffic for intra- and inter-MAN communication scenarios, taking into account their specific traffic access requirements in an effort to optimize wavelengths utilization. In our study, we assume nonblocking OCS communication for immediate access; therefore the number of required OCS wavelengths depends only on the number of nodes, while the number of required OTS wavelengths to obtain a desired QoS and latency level is independent from the number for OCS wavelengths. Simulation results show that within an OTS intra-MAN we achieve end-to-end (E2E) latency in submilliseconds scale, suitable for dynamic bursty traffic, while it is an decreasing function of the number of used OTS wavelengths.
- [166] arXiv:2504.17311 [pdf, other]
-
Title: FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness EvaluationYulia Otmakhova, Hung Thinh Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu, Daniel Beck, Jey Han LauSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a task-agnostic framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels - from orthography to dialect and style varieties - and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across four diverse NLP tasks, and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) while LLMs have better overall robustness compared to fine-tuned models, they still exhibit significant brittleness to certain linguistic variations; (3) all models show substantial vulnerability to negation modifications across most tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.
- [167] arXiv:2504.17313 [pdf, html, other]
-
Title: Tokenizing Stock Prices for Enhanced Multi-Step Forecast and PredictionSubjects: Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP)
Effective stock price forecasting (estimating future prices) and prediction (estimating future price changes) are pivotal for investors, regulatory agencies, and policymakers. These tasks enable informed decision-making, risk management, strategic planning, and superior portfolio returns. Despite their importance, forecasting and prediction are challenging due to the dynamic nature of stock price data, which exhibit significant temporal variations in distribution and statistical properties. Additionally, while both forecasting and prediction targets are derived from the same dataset, their statistical characteristics differ significantly. Forecasting targets typically follow a log-normal distribution, characterized by significant shifts in mean and variance over time, whereas prediction targets adhere to a normal distribution. Furthermore, although multi-step forecasting and prediction offer a broader perspective and richer information compared to single-step approaches, it is much more challenging due to factors such as cumulative errors and long-term temporal variance. As a result, many previous works have tackled either single-step stock price forecasting or prediction instead. To address these issues, we introduce a novel model, termed Patched Channel Integration Encoder (PCIE), to tackle both stock price forecasting and prediction. In this model, we utilize multiple stock channels that cover both historical prices and price changes, and design a novel tokenization method to effectively embed these channels in a cross-channel and temporally efficient manner. Specifically, the tokenization process involves univariate patching and temporal learning with a channel-mixing encoder to reduce cumulative errors. Comprehensive experiments validate that PCIE outperforms current state-of-the-art models in forecast and prediction tasks.
- [168] arXiv:2504.17314 [pdf, html, other]
-
Title: Class-Conditional Distribution Balancing for Group Robust ClassificationSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Spurious correlations that lead models to correct predictions for the wrong reasons pose a critical challenge for robust real-world generalization. Existing research attributes this issue to group imbalance and addresses it by maximizing group-balanced or worst-group accuracy, which heavily relies on expensive bias annotations. A compromise approach involves predicting bias information using extensively pretrained foundation models, which requires large-scale data and becomes impractical for resource-limited rare domains. To address these challenges, we offer a novel perspective by reframing the spurious correlations as imbalances or mismatches in class-conditional distributions, and propose a simple yet effective robust learning method that eliminates the need for both bias annotations and predictions. With the goal of reducing the mutual information between spurious factors and label information, our method leverages a sample reweighting strategy to achieve class-conditional distribution balancing, which automatically highlights minority groups and classes, effectively dismantling spurious correlations and producing a debiased data distribution for classification. Extensive experiments and analysis demonstrate that our approach consistently delivers state-of-the-art performance, rivaling methods that rely on bias supervision.
- [169] arXiv:2504.17315 [pdf, html, other]
-
Title: DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language ModelZhanglin Wu, Tengfei Song, Ning Xie, Weidong Zhang, Pengfei Li, Shuang Wu, Chong Li, Junhao Zhu, Hao YangComments: 7 pages, 1 figures, 2 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation for Complex Layouts" competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.
- [170] arXiv:2504.17327 [pdf, html, other]
-
Title: Simple Universally Optimal DijkstraSubjects: Data Structures and Algorithms (cs.DS)
Let G be a weighted (directed) graph with n vertices and m edges. Given a source vertex s, Dijkstra's algorithm computes the shortest path lengths from s to all other vertices in O(m + n log n) time. This bound is known to be worst-case optimal via a reduction to sorting. Theoretical computer science has developed numerous fine-grained frameworks for analyzing algorithmic performance beyond standard worst-case analysis, such as instance optimality and output sensitivity. Haeupler et al. [FOCS '24] consider the notion of universal optimality, a refined complexity measure that accounts for both the graph topology and the edge weights. For a fixed graph topology, the universal running time of a weighted graph algorithm is defined as its worst-case running time over all possible edge weightings of G. An algorithm is universally optimal if no other algorithm achieves a better asymptotic universal running time on any particular graph topology. They show that Dijkstra's algorithm can be made universally optimal by replacing the heap with a custom data structure.
We revisit their result. We introduce a simple heap property called timestamp optimality, where the cost of popping an element x is logarithmic in the number of elements inserted between pushing and popping x. We show that timestamp optimal heaps are not only easier to define but also easier to implement. Using these timestamps, we provide a significantly simpler proof that Dijkstra's algorithm, with the right kind of heap, is universally optimal. - [171] arXiv:2504.17329 [pdf, html, other]
-
Title: On Runge-Kutta methods of order 10Comments: 21 pages, 5 figures, 3 tablesSubjects: Numerical Analysis (math.NA)
A family of explicit 15-stage Runge-Kutta methods of order 10 is derived.
- [172] arXiv:2504.17331 [pdf, html, other]
-
Title: Exploring Context-aware and LLM-driven Locomotion for Immersive Virtual RealityComments: This work has been submitted to the IEEE for possible publicationSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Locomotion plays a crucial role in shaping the user experience within virtual reality environments. In particular, hands-free locomotion offers a valuable alternative by supporting accessibility and freeing users from reliance on handheld controllers. To this end, traditional speech-based methods often depend on rigid command sets, limiting the naturalness and flexibility of interaction. In this study, we propose a novel locomotion technique powered by large language models (LLMs), which allows users to navigate virtual environments using natural language with contextual awareness. We evaluate three locomotion methods: controller-based teleportation, voice-based steering, and our language model-driven approach. Our evaluation measures include eye-tracking data analysis, including explainable machine learning through SHAP analysis as well as standardized questionnaires for usability, presence, cybersickness, and cognitive load to examine user attention and engagement. Our findings indicate that the LLM-driven locomotion possesses comparable usability, presence, and cybersickness scores to established methods like teleportation, demonstrating its novel potential as a comfortable, natural language-based, hands-free alternative. In addition, it enhances user attention within the virtual environment, suggesting greater engagement. Complementary to these findings, SHAP analysis revealed that fixation, saccade, and pupil-related features vary across techniques, indicating distinct patterns of visual attention and cognitive processing. Overall, we state that our method can facilitate hands-free locomotion in virtual spaces, especially in supporting accessibility.
- [173] arXiv:2504.17332 [pdf, html, other]
-
Title: Bridging Cognition and Emotion: Empathy-Driven Multimodal Misinformation DetectionSubjects: Computation and Language (cs.CL)
In the digital era, social media has become a major conduit for information dissemination, yet it also facilitates the rapid spread of misinformation. Traditional misinformation detection methods primarily focus on surface-level features, overlooking the crucial roles of human empathy in the propagation process. To address this gap, we propose the Dual-Aspect Empathy Framework (DAE), which integrates cognitive and emotional empathy to analyze misinformation from both the creator and reader perspectives. By examining creators' cognitive strategies and emotional appeals, as well as simulating readers' cognitive judgments and emotional responses using Large Language Models (LLMs), DAE offers a more comprehensive and human-centric approach to misinformation detection. Moreover, we further introduce an empathy-aware filtering mechanism to enhance response authenticity and diversity. Experimental results on benchmark datasets demonstrate that DAE outperforms existing methods, providing a novel paradigm for multimodal misinformation detection.
- [174] arXiv:2504.17333 [pdf, html, other]
-
Title: Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model AccelerationSubjects: Hardware Architecture (cs.AR)
State Space Models (SSMs) offer a promising alternative to transformers for long-sequence processing. However, their efficiency remains hindered by memory-bound operations, particularly in the prefill stage. While MARCA, a recent first effort to accelerate SSMs through a dedicated hardware accelerator, achieves great speedup over high-end GPUs, an analysis into the broader accelerator design space is lacking. This work systematically analyzes SSM acceleration opportunities both from the scheduling perspective through fine-grained operator fusion and the hardware perspective through design space exploration, using an extended version of the Stream modeling framework.
Our results demonstrate that the improved data locality stemming from our optimized fusion and scheduling strategy enables a speedup of up to 4.8x over unfused execution, while our adaptive memory-aware fusion approach reduces on-chip memory requirements by an order of magnitude without sacrificing performance. We further explore accelerator design trade-offs, showing that a fusion-aware hardware architecture can achieve 1.78x higher performance than the state-of-the-art MARCA accelerator, within the same area budget. These results establish operator fusion as a key enabler for next-generation SSM accelerators. - [175] arXiv:2504.17334 [pdf, html, other]
-
Title: DataScout: Automatic Data Fact Retrieval for Statement Augmentation with an LLM-Based AgentSubjects: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
A data story typically integrates data facts from multiple perspectives and stances to construct a comprehensive and objective narrative. However, retrieving these facts demands time for data search and challenges the creator's analytical skills. In this work, we introduce DataScout, an interactive system that automatically performs reasoning and stance-based data facts retrieval to augment the user's statement. Particularly, DataScout leverages an LLM-based agent to construct a retrieval tree, enabling collaborative control of its expansion between users and the agent. The interface visualizes the retrieval tree as a mind map that eases users to intuitively steer the retrieval direction and effectively engage in reasoning and analysis. We evaluate the proposed system through case studies and in-depth expert interviews. Our evaluation demonstrates that DataScout can effectively retrieve multifaceted data facts from different stances, helping users verify their statements and enhance the credibility of their stories.
- [176] arXiv:2504.17336 [pdf, other]
-
Title: Operational Semantics for Crystality: A Smart Contract Language for Parallel EVMsSubjects: Programming Languages (cs.PL)
The increasing demand for scalable blockchain has driven research into parallel execution models for smart contracts. Crystality is a novel smart contract programming language designed for parallel Ethereum Virtual Machines (EVMs), enabling fine-grained concurrency through Programmable Contract Scopes and Asynchronous Functional Relay. This paper presents the first formal structural operational semantics for Crystality, providing a rigorous framework to reason about its execution. We mechanize the syntax and semantics of Crystality in the theorem-proving assistant Coq, enabling formal verification of correctness properties. As a case study, we verify a simplified token transfer function, demonstrating the applicability of our semantics in ensuring smart contract correctness. Our work lays the foundation for formally verified parallel smart contracts, contributing to the security and scalability of blockchain systems.
- [177] arXiv:2504.17337 [pdf, html, other]
-
Title: Error Exponents for DNA Storage Codes with a Variable Number of ReadsSubjects: Information Theory (cs.IT)
In this paper, we study error exponents for a concatataned coding based class of DNA storage codes in which the number of reads performed can be variable. That is, the decoder can sequentially perform reads and choose whether to output the final decision or take more reads, and we are interested in minimizing the average number of reads performed rather than a fixed pre-specified value. We show that this flexibility leads to a considerable reduction in the error probability compared to a fixed number of reads, not only in terms of constants in the error exponent but also in the scaling laws. This is shown via an achievability result for a suitably-designed protocol, and in certain parameter regimes we additionally establish a matching converse that holds for all protocols within a broader concatenated coding based class.
- [178] arXiv:2504.17338 [pdf, html, other]
-
Title: Dynamic Approximate Maximum Matching in the Distributed Vertex Partition ModelComments: 22 pagesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
We initiate the study of approximate maximum matching in the vertex partition model, for graphs subject to dynamic changes. We assume that the $n$ vertices of the graph are partitioned among $k$ players, who execute a distributed algorithm and communicate via message passing. An adaptive adversary may perform dynamic updates to the graph topology by inserting or removing edges between the nodes, and the algorithm needs to respond to these changes by adapting the output of the players, with the goal of maintaining an approximate maximum matching. The main performance metric in this setting is the algorithm's update time, which corresponds to the number of rounds required for updating the solution upon an adversarial change. For the standard setting of single-edge insertions and deletions, we obtain the following results:
We give a randomized Las Vegas algorithm with an expected update time of $O( \frac{\sqrt{m}}{\beta k} )$ rounds that maintains a $\frac{2}{3}$-approximate maximum matching that is also maximal, where $m$ is the number of edges of the graph. We also show that any algorithm has a worst case update time of $\Omega( \frac{n}{\beta k^2\log n} )$, assuming a link bandwidth of $O(\beta\log n)$ bits per round, if it maintains a matching that is maximal and does not have any 3-augmenting paths. For batch-dynamic updates, where the adversary may modify up to $\ell\ge 1$ edges at once, we prove the following: There is a randomized algorithm that succeeds with high probability in maintaining a $\frac{2}{3}$-approximate maximum matching and has a worst case update time of $\Omega( \frac{\ell\log n}{\sqrt{\beta k}} )$ rounds. We show that $\Omega( \frac{\ell}{\beta k \log n} )$ poses a lower bound for maintaining a maximal matching without 3-augmenting paths. - [179] arXiv:2504.17342 [pdf, html, other]
-
Title: Fréchet Distance in Unweighted Planar GraphsSubjects: Computational Geometry (cs.CG)
The Fréchet distance is a distance measure between trajectories in the plane or walks in a graph G. Given constant-time shortest path queries in a graph G, the Discrete Fréchet distance $F_G(P, Q)$ between two walks P and Q can be computed in $O(|P| \cdot |Q|)$ time using a dynamic program. Driemel, van der Hoog, and Rotenberg [SoCG'22] show that for weighted planar graphs this approach is likely tight, as there can be no strongly subquadratic algorithm to compute a $1.01$-approximation of $F_G(P, Q)$ unless the Orthogonal Vector Hypothesis (OVH) fails.
Such quadratic-time conditional lower bounds are common to many Fréchet distance variants. However, they can be circumvented by assuming that the input comes from some well-behaved class: There exist $(1+\varepsilon)$-approximations, both in weighted graphs and in Rd, that take near-linear time for $c$-packed or $\kappa$-straight walks in the graph. In Rd, there also exists a near-linear time algorithm to compute the Fréchet distance whenever all input edges are long compared to the distance.
We consider computing the Fréchet distance in unweighted planar graphs. We show that there exist no 1.25-approximations of the discrete Fréchet distance between two disjoint simple paths in an unweighted planar graph in strongly subquadratic time, unless OVH fails. This improves the previous lower bound, both in terms of generality and approximation factor. We subsequently show that adding graph structure circumvents this lower bound: If the graph is a regular tiling with unit-weighted edges, then there exists an $\tilde{O}( (|P| + |Q|)^{1.5})$-time algorithm to compute $D_F(P, Q)$. Our result has natural implications in the plane, as it allows us to define a new class of well-behaved curves that facilitate $(1+\varepsilon)$-approximations of their discrete Fréchet distance in subquadratic time. - [180] arXiv:2504.17343 [pdf, html, other]
-
Title: TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming VideosLinli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu SunSubjects: Computer Vision and Pattern Recognition (cs.CV)
The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they face significant limitations in streaming scenarios due to their inability to handle dense, redundant frames efficiently. We introduce TimeChat-Online, a novel online VideoLLM that revolutionizes real-time video interaction. At its core lies our innovative Differential Token Drop (DTD) module, which addresses the fundamental challenge of visual redundancy in streaming videos. Drawing inspiration from human visual perception's Change Blindness phenomenon, DTD preserves meaningful temporal changes while filtering out static, redundant content between frames. Remarkably, our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench, revealing that over 80% of visual content in streaming videos is naturally redundant without requiring language guidance. To enable seamless real-time interaction, we present TimeChat-Online-139K, a comprehensive streaming video dataset featuring diverse interaction patterns including backward-tracing, current-perception, and future-responding scenarios. TimeChat-Online's unique Proactive Response capability, naturally achieved through continuous monitoring of video scene transitions via DTD, sets it apart from conventional approaches. Our extensive evaluation demonstrates TimeChat-Online's superior performance on streaming benchmarks (StreamingBench and OvOBench) and maintaining competitive results on long-form video tasks such as Video-MME and MLVU.
- [181] arXiv:2504.17346 [pdf, other]
-
Title: Dual-Individual Genetic Algorithm: A Dual-Individual Approach for Efficient Training of Multi-Layer Neural NetworksSubjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
This paper introduces an enhanced Genetic Algorithm technique called Dual-Individual Genetic Algorithm (Dual-Individual GA), which optimizes neural networks for binary image classification tasks, such as cat vs. non-cat classification. The proposed method employs only two individuals for crossover, represented by two parameter sets: Leader and Follower. The Leader focuses on exploitation, representing the primary optimal solution at even-indexed positions (0, 2, 4, ...), while the Follower promotes exploration by preserving diversity and avoiding premature convergence, operating at odd-indexed positions (1, 3, 5, ...). Leader and Follower are modeled as two phases or roles. The key contributions of this work are threefold: (1) a self-adaptive layer dimension mechanism that eliminates the need for manual tuning of layer architectures; (2) generates two parameter sets, leader and follower parameter sets, with 10 layer architecture configurations (5 for each set), ranked by Pareto dominance and cost. post-optimization; and (3) demonstrated superior performance compared to traditional gradient-based methods. Experimental results show that the Dual-Individual GA achieves 99.04% training accuracy and 80% testing accuracy (cost = 0.034) on a three-layer network with architecture [12288, 17, 4, 1], outperforming a gradient-based approach that achieves 98% training accuracy and 80% testing accuracy (cost = 0.092) on a four-layer network with architecture [12288, 20, 7, 5, 1]. These findings highlight the efficiency and effectiveness of the proposed method in optimizing neural networks.
- [182] arXiv:2504.17347 [pdf, html, other]
-
Title: Analysis and Mitigation of Data injection Attacks against Data-Driven ControlComments: Under review for publicationSubjects: Systems and Control (eess.SY)
This paper investigates the impact of false data injection attacks on data-driven control systems. Specifically, we consider an adversary injecting false data into the sensor channels during the learning phase. When the operator seeks to learn a stable state-feedback controller, we propose an attack strategy capable of misleading the operator into learning an unstable feedback gain. We also investigate the effects of constant-bias injection attacks on data-driven linear quadratic regulation (LQR). Finally, we explore potential mitigation strategies and support our findings with numerical examples.
- [183] arXiv:2504.17349 [pdf, html, other]
-
Title: DRC: Enhancing Personalized Image Generation via Disentangled Representation CompositionSubjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Personalized image generation has emerged as a promising direction in multimodal content creation. It aims to synthesize images tailored to individual style preferences (e.g., color schemes, character appearances, layout) and semantic intentions (e.g., emotion, action, scene contexts) by leveraging user-interacted history images and multimodal instructions. Despite notable progress, existing methods -- whether based on diffusion models, large language models, or Large Multimodal Models (LMMs) -- struggle to accurately capture and fuse user style preferences and semantic intentions. In particular, the state-of-the-art LMM-based method suffers from the entanglement of visual features, leading to Guidance Collapse, where the generated images fail to preserve user-preferred styles or reflect the specified semantics.
To address these limitations, we introduce DRC, a novel personalized image generation framework that enhances LMMs through Disentangled Representation Composition. DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively, to form user-specific latent instructions that guide image generation within LMMs. Specifically, it involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation. Extensive experiments on two benchmarks demonstrate that DRC shows competitive performance while effectively mitigating the guidance collapse issue, underscoring the importance of disentangled representation learning for controllable and effective personalized image generation. - [184] arXiv:2504.17350 [pdf, other]
-
Title: First-order store and visibility in name-passing calculiDaniel Hirschkoff (PLUME, ENS de Lyon, LIP), Iwan Quémerais (PLUME, ENS de Lyon), Davide Sangiorgi (FOCUS, UNIBO)Subjects: Logic in Computer Science (cs.LO)
The $\pi$-calculus is the paradigmatical name-passing calculus. While being purely name-passing, it allows the representation of higher-order functions and store. We study how $\pi$-calculus processes can be controlled so that computations can only involve storage of first-order values. The discipline is enforced by a type system that is based on the notion of visibility, coming from game semantics. We discuss the impact of visibility on the behavioural theory. We propose characterisations of may-testing and barbed equivalence, based on (variants of) trace equivalence and labelled bisimilarity, in the case where computation is sequential, and in the case where computation is well-bracketed.
- [185] arXiv:2504.17352 [pdf, other]
-
Title: The Riemannian Means Field Classifier for EEG-Based BCI DataJournal-ref: Sensors, 2025, 25 (7), pp.2305Subjects: Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
A substantial amount of research has demonstrated the robustness and accuracy of the Riemannian minimum distance to mean (MDM) classifier for all kinds of EEG-based brain--computer interfaces (BCIs). This classifier is simple, fully deterministic, robust to noise, computationally efficient, and prone to transfer learning. Its training is very simple, requiring just the computation of a geometric mean of a symmetric positive-definite (SPD) matrix per class. We propose an improvement of the MDM involving a number of power means of SPD matrices instead of the sole geometric mean. By the analysis of 20 public databases, 10 for the motor-imagery BCI paradigm and 10 for the P300 BCI paradigm, comprising 587 individuals in total, we show that the proposed classifier clearly outperforms the MDM, approaching the state-of-the art in terms of performance while retaining the simplicity and the deterministic behavior. In order to promote reproducible research, our code will be released as open source.
- [186] arXiv:2504.17353 [pdf, html, other]
-
Title: M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information ExtractionChengguang Gan, Sunbowen Lee, Zhixi Cai, Yanbin Wei, Lei Zheng, Yunhao Liang, Shiwen Ni, Tatsunori MoriSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection of information extraction and model interpretability. MRE aims to leverage the mutual understanding between tasks of different granularities, enhancing the performance of both coarse-grained and fine-grained tasks through joint modeling. While MRE has been explored and validated in the textual domain, its applicability to visual and multimodal domains remains unexplored. In this work, we extend MRE to the multimodal information extraction domain for the first time. Specifically, we introduce a new task: Multimodal Mutual Reinforcement Effect (M-MRE), and construct a corresponding dataset to support this task. To address the challenges posed by M-MRE, we further propose a Prompt Format Adapter (PFA) that is fully compatible with various Large Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can also be observed in the M-MRE task, a multimodal text-image understanding scenario. This provides strong evidence that MRE facilitates mutual gains across three interrelated tasks, confirming its generalizability beyond the textual domain.
- [187] arXiv:2504.17354 [pdf, html, other]
-
Title: Data-Driven Surrogate Modeling Techniques to Predict the Effective Contact Area of Rough Surface Contact ProblemsSubjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
The effective contact area in rough surface contact plays a critical role in multi-physics phenomena such as wear, sealing, and thermal or electrical conduction. Although accurate numerical methods, like the Boundary Element Method (BEM), are available to compute this quantity, their high computational cost limits their applicability in multi-query contexts, such as uncertainty quantification, parameter identification, and multi-scale algorithms, where many repeated evaluations are required. This study proposes a surrogate modeling framework for predicting the effective contact area using fast-to-evaluate data-driven techniques. Various machine learning algorithms are trained on a precomputed dataset, where the inputs are the imposed load and statistical roughness parameters, and the output is the corresponding effective contact area. All models undergo hyperparameter optimization to enable fair comparisons in terms of predictive accuracy and computational efficiency, evaluated using established quantitative metrics. Among the models, the Kernel Ridge Regressor demonstrates the best trade-off between accuracy and efficiency, achieving high predictive accuracy, low prediction time, and minimal training overhead-making it a strong candidate for general-purpose surrogate modeling. The Gaussian Process Regressor provides an attractive alternative when uncertainty quantification is required, although it incurs additional computational cost due to variance estimation. The generalization capability of the Kernel Ridge model is validated on an unseen simulation scenario, confirming its ability to transfer to new configurations. Database generation constitutes the dominant cost in the surrogate modeling process. Nevertheless, the approach proves practical and efficient for multi-query tasks, even when accounting for this initial expense.
- [188] arXiv:2504.17355 [pdf, html, other]
-
Title: Collaborative Multi-Agent Reinforcement Learning for Automated Feature Transformation with Graph-Driven Path OptimizationXiaohan Huang, Dongjie Wang, Zhiyuan Ning, Ziyue Qiao, Qingqing Long, Haowei Zhu, Yi Du, Min Wu, Yuanchun Zhou, Meng XiaoComments: 13 pages, Keywords: Automated Feature Transformation, Tabular Dataset, Reinforcement LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Feature transformation methods aim to find an optimal mathematical feature-feature crossing process that generates high-value features and improves the performance of downstream machine learning tasks. Existing frameworks, though designed to mitigate manual costs, often treat feature transformations as isolated operations, ignoring dynamic dependencies between transformation steps. To address the limitations, we propose TCTO, a collaborative multi-agent reinforcement learning framework that automates feature engineering through graph-driven path optimization. The framework's core innovation lies in an evolving interaction graph that models features as nodes and transformations as edges. Through graph pruning and backtracking, it dynamically eliminates low-impact edges, reduces redundant operations, and enhances exploration stability. This graph also provides full traceability to empower TCTO to reuse high-utility subgraphs from historical transformations. To demonstrate the efficacy and adaptability of our approach, we conduct comprehensive experiments and case studies, which show superior performance across a range of datasets.
- [189] arXiv:2504.17356 [pdf, html, other]
-
Title: Comprehend, Divide, and Conquer: Feature Subspace Exploration via Multi-Agent Hierarchical Reinforcement LearningWeiliang Zhang, Xiaohan Huang, Yi Du, Ziyue Qiao, Qingqing Long, Zhen Meng, Yuanchun Zhou, Meng XiaoComments: 20 pages, keywords: Automated Feature Engineering, Tabular Dataset, Multi-Agent Reinforcement Learning, Feature SelectionSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Feature selection aims to preprocess the target dataset, find an optimal and most streamlined feature subset, and enhance the downstream machine learning task. Among filter, wrapper, and embedded-based approaches, the reinforcement learning (RL)-based subspace exploration strategy provides a novel objective optimization-directed perspective and promising performance. Nevertheless, even with improved performance, current reinforcement learning approaches face challenges similar to conventional methods when dealing with complex datasets. These challenges stem from the inefficient paradigm of using one agent per feature and the inherent complexities present in the datasets. This observation motivates us to investigate and address the above issue and propose a novel approach, namely HRLFS. Our methodology initially employs a Large Language Model (LLM)-based hybrid state extractor to capture each feature's mathematical and semantic characteristics. Based on this information, features are clustered, facilitating the construction of hierarchical agents for each cluster and sub-cluster. Extensive experiments demonstrate the efficiency, scalability, and robustness of our approach. Compared to contemporary or the one-feature-one-agent RL-based approaches, HRLFS improves the downstream ML performance with iterative feature subspace exploration while accelerating total run time by reducing the number of agents involved.
- [190] arXiv:2504.17360 [pdf, other]
-
Title: PatientDx: Merging Large Language Models for Protecting Data-Privacy in HealthcareJose G. Moreno (IRIT-IRIS), Jesus Lovon (IRIT-IRIS), M'Rick Robin-Charlet (UT3), Christine Damase-Michel, Lynda Tamine (IRIT-IRIS)Journal-ref: Workshop CL4Health @ NAACL 2025, May 2025, Albuquerque, New Mexico, United StatesSubjects: Computation and Language (cs.CL)
Fine-tuning of Large Language Models (LLMs) has become the default practice for improving model performance on a given task. However, performance improvement comes at the cost of training on vast amounts of annotated data which could be sensitive leading to significant data privacy concerns. In particular, the healthcare domain is one of the most sensitive domains exposed to data privacy issues. In this paper, we present PatientDx, a framework of model merging that allows the design of effective LLMs for health-predictive tasks without requiring fine-tuning nor adaptation on patient data. Our proposal is based on recently proposed techniques known as merging of LLMs and aims to optimize a building block merging strategy. PatientDx uses a pivotal model adapted to numerical reasoning and tunes hyperparameters on examples based on a performance metric but without training of the LLM on these data. Experiments using the mortality tasks of the MIMIC-IV dataset show improvements up to 7% in terms of AUROC when compared to initial models. Additionally, we confirm that when compared to fine-tuned models, our proposal is less prone to data leak problems without hurting performance. Finally, we qualitatively show the capabilities of our proposal through a case study. Our best model is publicly available at this https URL Jgmorenof/mistral\_merged\_0\_4.
- [191] arXiv:2504.17364 [pdf, html, other]
-
Title: I-INR: Iterative Implicit Neural RepresentationsAli Haider, Muhammad Salman Ali, Maryam Qamar, Tahir Khalil, Soo Ye Kim, Jihyong Oh, Enzo Tartaglione, Sung-Ho BaeSubjects: Computer Vision and Pattern Recognition (cs.CV)
Implicit Neural Representations (INRs) have revolutionized signal processing and computer vision by modeling signals as continuous, differentiable functions parameterized by neural networks. However, their inherent formulation as a regression problem makes them prone to regression to the mean, limiting their ability to capture fine details, retain high-frequency information, and handle noise effectively. To address these challenges, we propose Iterative Implicit Neural Representations (I-INRs) a novel plug-and-play framework that enhances signal reconstruction through an iterative refinement process. I-INRs effectively recover high-frequency details, improve robustness to noise, and achieve superior reconstruction quality. Our framework seamlessly integrates with existing INR architectures, delivering substantial performance gains across various tasks. Extensive experiments show that I-INRs outperform baseline methods, including WIRE, SIREN, and Gauss, in diverse computer vision applications such as image restoration, image denoising, and object occupancy prediction.
- [192] arXiv:2504.17365 [pdf, html, other]
-
Title: TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.
- [193] arXiv:2504.17366 [pdf, html, other]
-
Title: LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live StreamsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Long-context understanding poses significant challenges in natural language processing, particularly for real-world dialogues characterized by speech-based elements, high redundancy, and uneven information density. Although large language models (LLMs) achieve impressive results on existing benchmarks, these datasets fail to reflect the complexities of such texts, limiting their applicability to practical scenarios. To bridge this gap, we construct the first spoken long-text dataset, derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-world scenarios. We construct tasks in three categories: retrieval-dependent, reasoning-dependent, and hybrid. We then evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks. Our results show that current methods exhibit strong task-specific preferences and perform poorly on highly redundant inputs, with no single method consistently outperforming others. We propose a new baseline that better handles redundancy in spoken text and achieves strong performance across tasks. Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding. Finally, our benchmark fills a gap in evaluating long-context spoken language understanding and provides a practical foundation for developing real-world e-commerce systems. The code and benchmark are available at this https URL.
- [194] arXiv:2504.17370 [pdf, html, other]
-
Title: Doubly Adaptive Social LearningComments: This work has been submitted to the IEEE for possible publicationSubjects: Machine Learning (cs.LG)
In social learning, a network of agents assigns probability scores (beliefs) to some hypotheses of interest, which rule the generation of local streaming data observed by each agent. Belief formation takes place by means of an iterative two-step procedure where: i) the agents update locally their beliefs by using some likelihood model; and ii) the updated beliefs are combined with the beliefs of the neighboring agents, using a pooling rule. This procedure can fail to perform well in the presence of dynamic drifts, leading the agents to incorrect decision making. Here, we focus on the fully online setting where both the true hypothesis and the likelihood models can change over time. We propose the doubly adaptive social learning ($\text{A}^2\text{SL}$) strategy, which infuses social learning with the necessary adaptation capabilities. This goal is achieved by exploiting two adaptation stages: i) a stochastic gradient descent update to learn and track the drifts in the decision model; ii) and an adaptive belief update to track the true hypothesis changing over time. These stages are controlled by two adaptation parameters that govern the evolution of the error probability for each agent. We show that all agents learn consistently for sufficiently small adaptation parameters, in the sense that they ultimately place all their belief mass on the true hypothesis. In particular, the probability of choosing the wrong hypothesis converges to values on the order of the adaptation parameters. The theoretical analysis is illustrated both on synthetic data and by applying the $\text{A}^2\text{SL}$ strategy to a social learning problem in the online setting using real data.
- [195] arXiv:2504.17371 [pdf, html, other]
-
Title: Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D DatasetOussema Dhaouadi, Johannes Meier, Luca Wahl, Jacques Kaiser, Luca Scalerandi, Nick Wandelburg, Zhuolun Zhou, Nijanthan Berinpanathan, Holger Banzhaf, Daniel CremersSubjects: Computer Vision and Pattern Recognition (cs.CV)
Accurate 3D trajectory data is crucial for advancing autonomous driving. Yet, traditional datasets are usually captured by fixed sensors mounted on a car and are susceptible to occlusion. Additionally, such an approach can precisely reconstruct the dynamic environment in the close vicinity of the measurement vehicle only, while neglecting objects that are further away. In this paper, we introduce the DeepScenario Open 3D Dataset (DSC3D), a high-quality, occlusion-free dataset of 6 degrees of freedom bounding box trajectories acquired through a novel monocular camera drone tracking pipeline. Our dataset includes more than 175,000 trajectories of 14 types of traffic participants and significantly exceeds existing datasets in terms of diversity and scale, containing many unprecedented scenarios such as complex vehicle-pedestrian interaction on highly populated urban streets and comprehensive parking maneuvers from entry to exit. DSC3D dataset was captured in five various locations in Europe and the United States and include: a parking lot, a crowded inner-city, a steep urban intersection, a federal highway, and a suburban intersection. Our 3D trajectory dataset aims to enhance autonomous driving systems by providing detailed environmental 3D representations, which could lead to improved obstacle interactions and safety. We demonstrate its utility across multiple applications including motion prediction, motion planning, scenario mining, and generative reactive traffic agents. Our interactive online visualization platform and the complete dataset are publicly available at this http URL, facilitating research in motion prediction, behavior modeling, and safety validation.
- [196] arXiv:2504.17376 [pdf, html, other]
-
Title: On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware AccelerationSubjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Transformer-based Large Language Models (LLMs) have significantly advanced AI capabilities but pose considerable challenges for deployment on edge devices due to high computational demands, memory bandwidth constraints, and energy consumption. This paper addresses these challenges by presenting an efficient framework for deploying the Qwen2.5-0.5B model on the Xilinx Kria KV260 edge platform, a heterogeneous system integrating an ARM Cortex-A53 CPU with reconfigurable FPGA logic. Leveraging Activation-aware Weight Quantization (AWQ) with FPGA-accelerated execution pipelines, the proposed approach enhances both model compression rate and system throughput. Additionally, we propose a hybrid execution strategy that intelligently offloads compute-intensive operations to the FPGA while utilizing the CPU for lighter tasks, effectively balancing the computational workload and maximizing overall performance. Our framework achieves a model compression rate of 55.08% compared to the original model and produces output at a rate of 5.1 tokens per second, outperforming the baseline performance of 2.8 tokens per second.
- [197] arXiv:2504.17381 [pdf, html, other]
-
Title: Subtrajectory Clustering and Coverage Maximization in Cubic Time, or BetterSubjects: Computational Geometry (cs.CG)
Many application areas collect unstructured trajectory data. In subtrajectory clustering, one is interested to find patterns in this data using a hybrid combination of segmentation and clustering. We analyze two variants of this problem based on the well-known \textsc{SetCover} and \textsc{CoverageMaximization} problems. In both variants the set system is induced by metric balls under the Fréchet distance centered at polygonal curves. Our algorithms focus on improving the running time of the update step of the generic greedy algorithm by means of a careful combination of sweeps through a candidate space. In the first variant, we are given a polygonal curve $P$ of complexity $n$, distance threshold $\Delta$ and complexity bound $\ell$ and the goal is to identify a minimum-size set of center curves $\mathcal{C}$, where each center curve is of complexity at most $\ell$ and every point $p$ on $P$ is covered. A point $p$ on $P$ is covered if it is part of a subtrajectory $\pi_p$ of $P$ such that there is a center $c\in\mathcal{C}$ whose Fréchet distance to $\pi_p$ is at most $\Delta$. We present an approximation algorithm for this problem with a running time of $O((n^2\ell + \sqrt{k_\Delta}n^{5/2})\log^2n)$, where $k_\Delta$ is the size of an optimal solution. The algorithm gives a bicriterial approximation guarantee that relaxes the Fréchet distance threshold by a constant factor and the size of the solution by a factor of $O(\log n)$. The second problem variant asks for the maximum fraction of the input curve $P$ that can be covered using $k$ center curves, where $k\leq n$ is a parameter to the algorithm. Here, we show that our techniques lead to an algorithm with a running time of $O((k+\ell)n^2\log^2 n)$ and similar approximation guarantees. Note that in both algorithms $k,k_\Delta\in O(n)$ and hence the running time is cubic, or better if $k\ll n$.
- [198] arXiv:2504.17390 [pdf, html, other]
-
Title: PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image PersonaComments: Accepted in NAACL 2025 mainSubjects: Computation and Language (cs.CL)
Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users' personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains this https URL.
- [199] arXiv:2504.17392 [pdf, html, other]
-
Title: Edge-weighted Online Stochastic Matching Under Jaillet-Lu LPSubjects: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT)
The online stochastic matching problem was introduced by [FMMM09], together with the $(1-\frac1e)$-competitive Suggested Matching algorithm. In the most general edge-weighted setting, this ratio has not been improved for more than one decade, until recently [Yan24] beat the $1-\frac1e$ bound and [QFZW23] further improved the ratio to $0.650$. Both of these works measure the online competitiveness against the offline LP relaxation introduced by [JL14]. This LP has also played an important role in other settings since it is a natural choice for two-choices online algorithms.
In this paper, we propose an upper bound of $0.663$ and a lower bound of $0.662$ for edge-weighted online stochastic matching under Jaillet-Lu LP. First, we propose a hard instance and prove that the optimal online algorithm for this instance only has a competitive ratio $<0.663$. Then, we show that a near-optimal algorithm for this instance can be generalized to work on all instances and achieve a competitive ratio $>0.662$. It indicates that more powerful LPs are necessary if we want to further improve the ratio by $0.001$. - [200] arXiv:2504.17393 [pdf, html, other]
-
Title: Towards User-Centred Design of AI-Assisted Decision-Making in Law EnforcementVesna Nowack, Dalal Alrajeh, Carolina Gutierrez Muñoz, Katie Thomas, William Hobson, Catherine Hamilton-Giachritsis, Patrick Benjamin, Tim Grant, Juliane A. Kloess, Jessica WoodhamsComments: 10 pages, 1 figureSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Artificial Intelligence (AI) has become an important part of our everyday lives, yet user requirements for designing AI-assisted systems in law enforcement remain unclear. To address this gap, we conducted qualitative research on decision-making within a law enforcement agency. Our study aimed to identify limitations of existing practices, explore user requirements and understand the responsibilities that humans expect to undertake in these systems.
Participants in our study highlighted the need for a system capable of processing and analysing large volumes of data efficiently to help in crime detection and prevention. Additionally, the system should satisfy requirements for scalability, accuracy, justification, trustworthiness and adaptability to be adopted in this domain. Participants also emphasised the importance of having end users review the input data that might be challenging for AI to interpret, and validate the generated output to ensure the system's accuracy. To keep up with the evolving nature of the law enforcement domain, end users need to help the system adapt to the changes in criminal behaviour and government guidance, and technical experts need to regularly oversee and monitor the system. Furthermore, user-friendly human interaction with the system is essential for its adoption and some of the participants confirmed they would be happy to be in the loop and provide necessary feedback that the system can learn from. Finally, we argue that it is very unlikely that the system will ever achieve full automation due to the dynamic and complex nature of the law enforcement domain. - [201] arXiv:2504.17395 [pdf, html, other]
-
Title: SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object CountingYiming Zhao, Guorong Li, Laiyun Qing, Amin Beheshti, Jian Yang, Michael Sheng, Yuankai Qi, Qingming HuangSubjects: Computer Vision and Pattern Recognition (cs.CV)
Open-world object counting leverages the robust text-image alignment of pre-trained vision-language models (VLMs) to enable counting of arbitrary categories in images specified by textual queries. However, widely adopted naive fine-tuning strategies concentrate exclusively on text-image consistency for categories contained in training, which leads to limited generalizability for unseen categories. In this work, we propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories with minimal overhead in parameters and inference time. First, we introduce a two-stage visual prompt learning strategy composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts, and then TGPR distills latent structural patterns from the VLM's text encoder to refine these prompts. During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories, facilitating robust text-image alignment for unseen categories. Extensive experiments integrating SDVPT with all available open-world object counting models demonstrate its effectiveness and adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+.
- [202] arXiv:2504.17397 [pdf, html, other]
-
Title: Fine-tune Smarter, Not Harder: Parameter-Efficient Fine-Tuning for Geospatial Foundation ModelsComments: Code available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Earth observation (EO) is crucial for monitoring environmental changes, responding to disasters, and managing natural resources. In this context, foundation models facilitate remote sensing image analysis to retrieve relevant geoinformation accurately and efficiently. However, as these models grow in size, fine-tuning becomes increasingly challenging due to the associated computational resources and costs, limiting their accessibility and scalability. Furthermore, full fine-tuning can lead to forgetting pre-trained features and even degrade model generalization. To address this, Parameter-Efficient Fine-Tuning (PEFT) techniques offer a promising solution. In this paper, we conduct extensive experiments with various foundation model architectures and PEFT techniques to evaluate their effectiveness on five different EO datasets. Our results provide a comprehensive comparison, offering insights into when and how PEFT methods support the adaptation of pre-trained geospatial models. We demonstrate that PEFT techniques match or even exceed full fine-tuning performance and enhance model generalisation to unseen geographic regions, while reducing training time and memory requirements. Additional experiments investigate the effect of architecture choices such as the decoder type or the use of metadata, suggesting UNet decoders and fine-tuning without metadata as the recommended configuration. We have integrated all evaluated foundation models and techniques into the open-source package TerraTorch to support quick, scalable, and cost-effective model adaptation.
- [203] arXiv:2504.17399 [pdf, html, other]
-
Title: S2S-Net: Addressing the Domain Gap of Heterogeneous Sensor Systems in LiDAR-Based Collective PerceptionSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Collective Perception (CP) has emerged as a promising approach to overcome the limitations of individual perception in the context of autonomous driving. Various approaches have been proposed to realize collective perception; however, the Sensor2Sensor domain gap that arises from the utilization of different sensor systems in Connected and Automated Vehicles (CAVs) remains mostly unaddressed. This is primarily due to the paucity of datasets containing heterogeneous sensor setups among the CAVs. The recently released SCOPE datasets address this issue by providing data from three different LiDAR sensors for each CAV. This study is the first to tackle the Sensor2Sensor domain gap in vehicle to vehicle (V2V) collective perception. First, we present our sensor-domain robust architecture S2S-Net. Then an in-depth analysis of the Sensor2Sensor domain adaptation capabilities of S2S-Net on the SCOPE dataset is conducted. S2S-Net demonstrates the capability to maintain very high performance in unseen sensor domains and achieved state-of-the-art results on the SCOPE dataset.
- [204] arXiv:2504.17401 [pdf, html, other]
-
Title: StereoMamba: Real-time and Robust Intraoperative Stereo Disparity Estimation via Long-range Spatial DependenciesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made significant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architecture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then introduce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior performance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (1280*1024), striking the optimum balance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets.
- [205] arXiv:2504.17402 [pdf, html, other]
-
Title: Assessing the Capability of Large Language Models for Domain-Specific Ontology GenerationAnna Sofia Lippolis, Mohammad Javad Saeedizade, Robin Keskisarkka, Aldo Gangemi, Eva Blomqvist, Andrea Giovanni NuzzoleseSubjects: Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have shown significant potential for ontology engineering. However, it is still unclear to what extent they are applicable to the task of domain-specific ontology generation. In this study, we explore the application of LLMs for automated ontology generation and evaluate their performance across different domains. Specifically, we investigate the generalizability of two state-of-the-art LLMs, DeepSeek and o1-preview, both equipped with reasoning capabilities, by generating ontologies from a set of competency questions (CQs) and related user stories. Our experimental setup comprises six distinct domains carried out in existing ontology engineering projects and a total of 95 curated CQs designed to test the models' reasoning for ontology engineering. Our findings show that with both LLMs, the performance of the experiments is remarkably consistent across all domains, indicating that these methods are capable of generalizing ontology generation tasks irrespective of the domain. These results highlight the potential of LLM-based approaches in achieving scalable and domain-agnostic ontology construction and lay the groundwork for further research into enhancing automated reasoning and knowledge representation techniques.
- [206] arXiv:2504.17403 [pdf, html, other]
-
Title: Coding for Computation: Efficient Compression of Neural Networks for Reconfigurable HardwareComments: Accepted at the 2025 IEEE Statistical Signal Processing (SSP) Workshop, EdinburghSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
As state of the art neural networks (NNs) continue to grow in size, their resource-efficient implementation becomes ever more important. In this paper, we introduce a compression scheme that reduces the number of computations required for NN inference on reconfigurable hardware such as FPGAs. This is achieved by combining pruning via regularized training, weight sharing and linear computation coding (LCC). Contrary to common NN compression techniques, where the objective is to reduce the memory used for storing the weights of the NNs, our approach is optimized to reduce the number of additions required for inference in a hardware-friendly manner. The proposed scheme achieves competitive performance for simple multilayer perceptrons, as well as for large scale deep NNs such as ResNet-34.
- [207] arXiv:2504.17404 [pdf, html, other]
-
Title: Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic SocietyFeifei Zhao, Yuwei Wang, Enmeng Lu, Dongcheng Zhao, Bing Han, Haibo Tong, Yao Liang, Dongqi Liang, Kang Sun, Lei Wang, Yitao Liang, Chao Liu, Yaodong Yang, Yi ZengSubjects: Artificial Intelligence (cs.AI)
Artificial Intelligence (AI) systems are becoming increasingly powerful and autonomous, and may progress to surpass human intelligence levels, namely Artificial Superintelligence (ASI). During the progression from AI to ASI, it may exceed human control, violate human values, and even lead to irreversible catastrophic consequences in extreme cases. This gives rise to a pressing issue that needs to be addressed: superalignment, ensuring that AI systems much smarter than humans, remain aligned with human (compatible) intentions and values. Existing scalable oversight and weak-to-strong generalization methods may prove substantially infeasible and inadequate when facing ASI. We must explore safer and more pluralistic frameworks and approaches for superalignment. In this paper, we redefine superalignment as the human-AI co-alignment towards a sustainable symbiotic society, and highlight a framework that integrates external oversight and intrinsic proactive alignment. External oversight superalignment should be grounded in human-centered ultimate decision, supplemented by interpretable automated evaluation and correction, to achieve continuous alignment with humanity's evolving values. Intrinsic proactive superalignment is rooted in a profound understanding of the self, others, and society, integrating self-awareness, self-reflection, and empathy to spontaneously infer human intentions, distinguishing good from evil and proactively considering human well-being, ultimately attaining human-AI co-alignment through iterative interaction. The integration of externally-driven oversight with intrinsically-driven proactive alignment empowers sustainable symbiotic societies through human-AI co-alignment, paving the way for achieving safe and beneficial AGI and ASI for good, for human, and for a symbiotic ecology.
- [208] arXiv:2504.17406 [pdf, html, other]
-
Title: Finding Conditions for Target Controllability under Christmas TreesComments: Submitted at the Conference on Decision and Control 2025Subjects: Systems and Control (eess.SY)
This paper presents new graph-theoretic conditions for structural target controllability of directed networks. After reviewing existing conditions and highlighting some gaps in the literature, we introduce a new class of network systems, named Christmas trees, which generalizes trees and cacti. We then establish a graph-theoretic characterization of sets of nodes that are structurally target controllable for a simple subclass of Christmas trees. Our characterization applies to general network systems by considering spanning subgraphs of Christmas tree class and allows us to uncover target controllable sets that existing criteria fail to identify.
- [209] arXiv:2504.17409 [pdf, html, other]
-
Title: AGCo-MATA: Air-Ground Collaborative Multi-Agent Task Allocation in Mobile CrowdsensingSubjects: Multiagent Systems (cs.MA)
Rapid progress in intelligent unmanned systems has presented new opportunities for mobile crowd sensing (MCS). Today, heterogeneous air-ground collaborative multi-agent framework, which comprise unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs), have presented superior flexibility and efficiency compared to traditional homogeneous frameworks in complex sensing tasks. Within this context, task allocation among different agents always play an important role in improving overall MCS quality. In order to better allocate tasks among heterogeneous collaborative agents, in this paper, we investigated two representative complex multi-agent task allocation scenarios with dual optimization objectives: (1) For AG-FAMT (Air-Ground Few Agents More Tasks) scenario, the objectives are to maximize the task completion while minimizing the total travel distance; (2) For AG-MAFT (Air-Ground More Agents Few Tasks) scenario, where the agents are allocated based on their locations, has the optimization objectives of minimizing the total travel distance while reducing travel time cost. To achieve this, we proposed a Multi-Task Minimum Cost Maximum Flow (MT-MCMF) optimization algorithm tailored for AG-FAMT, along with a multi-objective optimization algorithm called W-ILP designed for AG-MAFT, with a particular focus on optimizing the charging path planning of UAVs. Our experiments based on a large-scale real-world dataset demonstrated that the proposed two algorithms both outperform baseline approaches under varying experimental settings, including task quantity, task difficulty, and task distribution, providing a novel way to improve the overall quality of mobile crowdsensing tasks.
- [210] arXiv:2504.17410 [pdf, html, other]
-
Title: Bias-Eliminated PnP for Stereo Visual Odometry: Provably Consistent and Large-Scale LocalizationComments: 10 pages, 7 figuresSubjects: Robotics (cs.RO)
In this paper, we first present a bias-eliminated weighted (Bias-Eli-W) perspective-n-point (PnP) estimator for stereo visual odometry (VO) with provable consistency. Specifically, leveraging statistical theory, we develop an asymptotically unbiased and $\sqrt {n}$-consistent PnP estimator that accounts for varying 3D triangulation uncertainties, ensuring that the relative pose estimate converges to the ground truth as the number of features increases. Next, on the stereo VO pipeline side, we propose a framework that continuously triangulates contemporary features for tracking new frames, effectively decoupling temporal dependencies between pose and 3D point errors. We integrate the Bias-Eli-W PnP estimator into the proposed stereo VO pipeline, creating a synergistic effect that enhances the suppression of pose estimation errors. We validate the performance of our method on the KITTI and Oxford RobotCar datasets. Experimental results demonstrate that our method: 1) achieves significant improvements in both relative pose error and absolute trajectory error in large-scale environments; 2) provides reliable localization under erratic and unpredictable robot motions. The successful implementation of the Bias-Eli-W PnP in stereo VO indicates the importance of information screening in robotic estimation tasks with high-uncertainty measurements, shedding light on diverse applications where PnP is a key ingredient.
- [211] arXiv:2504.17412 [pdf, html, other]
-
Title: Catalytic Computing and Register Programs Beyond Log-DepthSubjects: Computational Complexity (cs.CC)
In a seminal work, Buhrman et al. (STOC 2014) defined the class $CSPACE(s,c)$ of problems solvable in space $s$ with an additional catalytic tape of size $c$, which is a tape whose initial content must be restored at the end of the computation. They showed that uniform $TC^1$ circuits are computable in catalytic logspace, i.e., $CL=CSPACE(O(\log{n}), 2^{O(\log{n})})$, thus giving strong evidence that catalytic space gives $L$ strict additional power. Their study focuses on an arithmetic model called register programs, which has been a focal point in development since then.
Understanding $CL$ remains a major open problem, as $TC^1$ remains the most powerful containment to date. In this work, we study the power of catalytic space and register programs to compute circuits of larger depth. Using register programs, we show that for every $\epsilon > 0$,
$SAC^2 \subseteq CSPACE\left(O\left(\frac{\log^2{n}}{\log\log{n}}\right), 2^{O(\log^{1+\epsilon} n)}\right)$
This is an $O(\log \log n)$ factor improvement on the free space needed to compute $SAC^2$, which can be accomplished with near-polynomial catalytic space.
We also exhibit non-trivial register programs for matrix powering, which is a further step towards showing $NC^2 \subseteq CL$. - [212] arXiv:2504.17414 [pdf, html, other]
-
Title: 3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion ModelsComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Video try-on replaces clothing in videos with target garments. Existing methods struggle to generate high-quality and temporally consistent results when handling complex clothing patterns and diverse body poses. We present 3DV-TON, a novel diffusion-based framework for generating high-fidelity and temporally consistent video try-on results. Our approach employs generated animatable textured 3D meshes as explicit frame-level guidance, alleviating the issue of models over-focusing on appearance fidelity at the expanse of motion coherence. This is achieved by enabling direct reference to consistent garment texture movements throughout video sequences. The proposed method features an adaptive pipeline for generating dynamic 3D guidance: (1) selecting a keyframe for initial 2D image try-on, followed by (2) reconstructing and animating a textured 3D mesh synchronized with original video poses. We further introduce a robust rectangular masking strategy that successfully mitigates artifact propagation caused by leaking clothing information during dynamic human and garment movements. To advance video try-on research, we introduce HR-VVT, a high-resolution benchmark dataset containing 130 videos with diverse clothing types and scenarios. Quantitative and qualitative results demonstrate our superior performance over existing methods. The project page is at this link this https URL
- [213] arXiv:2504.17418 [pdf, html, other]
-
Title: Longitudinal Control for Autonomous Racing with Combustion Engine VehiclesComments: 8 pages, 9 FiguresSubjects: Systems and Control (eess.SY); Robotics (cs.RO)
Usually, a controller for path- or trajectory tracking is employed in autonomous driving. Typically, these controllers generate high-level commands like longitudinal acceleration or force. However, vehicles with combustion engines expect different actuation inputs. This paper proposes a longitudinal control concept that translates high-level trajectory-tracking commands to the required low-level vehicle commands such as throttle, brake pressure and a desired gear. We chose a modular structure to easily integrate different trajectory-tracking control algorithms and vehicles. The proposed control concept enables a close tracking of the high-level control command. An anti-lock braking system, traction control, and brake warmup control also ensure a safe operation during real-world tests. We provide experimental validation of our concept using real world data with longitudinal accelerations reaching up to $25 \, \frac{\mathrm{m}}{\mathrm{s}^2}$. The experiments were conducted using the EAV24 racecar during the first event of the Abu Dhabi Autonomous Racing League on the Yas Marina Formula 1 Circuit.
- [214] arXiv:2504.17419 [pdf, other]
-
Title: How Do Communities of ML-Enabled Systems Smell? A Cross-Sectional Study on the Prevalence of Community SmellsSubjects: Software Engineering (cs.SE)
Effective software development relies on managing both collaboration and technology, but sociotechnical challenges can harm team dynamics and increase technical debt. Although teams working on ML enabled systems are interdisciplinary, research has largely focused on technical issues, leaving their socio-technical dynamics underexplored. This study aims to address this gap by examining the prevalence, evolution, and interrelations of community smells, in open-source ML projects. We conducted an empirical study on 188 repositories from the NICHE dataset using the CADOCS tool to identify and analyze community smells. Our analysis focused on their prevalence, interrelations, and temporal variations. We found that certain smells, such as Prima Donna Effects and Sharing Villainy, are more prevalent and fluctuate over time compared to others like Radio Silence or Organizational Skirmish. These insights might provide valuable support for ML project managers in addressing socio-technical issues and improving team coordination.
- [215] arXiv:2504.17421 [pdf, html, other]
-
Title: Towards Harnessing the Collaborative Power of Large and Small Models for Domain TasksYang Liu, Bingjie Yan, Tianyuan Zou, Jianqing Zhang, Zixuan Gu, Jianbing Ding, Xidong Wang, Jingyi Li, Xiaozhou Ye, Ye Ouyang, Qiang Yang, Ya-Qin ZhangSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Large language models (LLMs) have demonstrated remarkable capabilities, but they require vast amounts of data and computational resources. In contrast, smaller models (SMs), while less powerful, can be more efficient and tailored to specific domains. In this position paper, we argue that taking a collaborative approach, where large and small models work synergistically, can accelerate the adaptation of LLMs to private domains and unlock new potential in AI. We explore various strategies for model collaboration and identify potential challenges and opportunities. Building upon this, we advocate for industry-driven research that prioritizes multi-objective benchmarks on real-world private datasets and applications.
- [216] arXiv:2504.17424 [pdf, html, other]
-
Title: Object Pose Estimation by Camera Arm Control Based on the Next Viewpoint EstimationJournal-ref: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
We have developed a new method to estimate a Next Viewpoint (NV) which is effective for pose estimation of simple-shaped products for product display robots in retail stores. Pose estimation methods using Neural Networks (NN) based on an RGBD camera are highly accurate, but their accuracy significantly decreases when the camera acquires few texture and shape features at a current view point. However, it is difficult for previous mathematical model-based methods to estimate effective NV which is because the simple shaped objects have few shape features. Therefore, we focus on the relationship between the pose estimation and NV estimation. When the pose estimation is more accurate, the NV estimation is more accurate. Therefore, we develop a new pose estimation NN that estimates NV simultaneously. Experimental results showed that our NV estimation realized a pose estimation success rate 77.3\%, which was 7.4pt higher than the mathematical model-based NV calculation did. Moreover, we verified that the robot using our method displayed 84.2\% of products.
- [217] arXiv:2504.17426 [pdf, html, other]
-
Title: Towards Leveraging Large Language Model Summaries for Topic Modeling in Source CodeSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Understanding source code is a topic of great interest in the software engineering community, since it can help programmers in various tasks such as software maintenance and reuse. Recent advances in large language models (LLMs) have demonstrated remarkable program comprehension capabilities, while transformer-based topic modeling techniques offer effective ways to extract semantic information from text. This paper proposes and explores a novel approach that combines these strengths to automatically identify meaningful topics in a corpus of Python programs. Our method consists in applying topic modeling on the descriptions obtained by asking an LLM to summarize the code. To assess the internal consistency of the extracted topics, we compare them against topics inferred from function names alone, and those derived from existing docstrings. Experimental results suggest that leveraging LLM-generated summaries provides interpretable and semantically rich representation of code structure. The promising results suggest that our approach can be fruitfully applied in various software engineering tasks such as automatic documentation and tagging, code search, software reorganization and knowledge discovery in large repositories.
- [218] arXiv:2504.17427 [pdf, html, other]
-
Title: Beyond Whole Dialogue Modeling: Contextual Disentanglement for Conversational RecommendationSubjects: Information Retrieval (cs.IR)
Conversational recommender systems aim to provide personalized recommendations by analyzing and utilizing contextual information related to dialogue. However, existing methods typically model the dialogue context as a whole, neglecting the inherent complexity and entanglement within the dialogue. Specifically, a dialogue comprises both focus information and background information, which mutually influence each other. Current methods tend to model these two types of information mixedly, leading to misinterpretation of users' actual needs, thereby lowering the accuracy of recommendations. To address this issue, this paper proposes a novel model to introduce contextual disentanglement for improving conversational recommender systems, named DisenCRS. The proposed model DisenCRS employs a dual disentanglement framework, including self-supervised contrastive disentanglement and counterfactual inference disentanglement, to effectively distinguish focus information and background information from the dialogue context under unsupervised conditions. Moreover, we design an adaptive prompt learning module to automatically select the most suitable prompt based on the specific dialogue context, fully leveraging the power of large language models. Experimental results on two widely used public datasets demonstrate that DisenCRS significantly outperforms existing conversational recommendation models, achieving superior performance on both item recommendation and response generation tasks.
- [219] arXiv:2504.17428 [pdf, html, other]
-
Title: Detection, Classification and Prevalence of Self-Admitted Aging DebtComments: DraftSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); General Literature (cs.GL)
Context: Previous research on software aging is limited with focus on dynamic runtime indicators like memory and performance, often neglecting evolutionary indicators like source code comments and narrowly examining legacy issues within the TD context. Objective: We introduce the concept of Aging Debt (AD), representing the increased maintenance efforts and costs needed to keep software updated. We study AD through Self-Admitted Aging Debt (SAAD) observed in source code comments left by software developers. Method: We employ a mixed-methods approach, combining qualitative and quantitative analyses to detect and measure AD in software. This includes framing SAAD patterns from the source code comments after analysing the source code context, then utilizing the SAAD patterns to detect SAAD comments. In the process, we develop a taxonomy for SAAD that reflects the temporal aging of software and its associated debt. Then we utilize the taxonomy to quantify the different types of AD prevalent in OSS repositories. Results: Our proposed taxonomy categorizes temporal software aging into Active and Dormant types. Our extensive analysis of over 9,000+ Open Source Software (OSS) repositories reveals that more than 21% repositories exhibit signs of SAAD as observed from our gold standard SAAD dataset. Notably, Dormant AD emerges as the predominant category, highlighting a critical but often overlooked aspect of software maintenance. Conclusion: As software volume grows annually, so do evolutionary aging and maintenance challenges; our proposed taxonomy can aid researchers in detailed software aging studies and help practitioners develop improved and proactive maintenance strategies.
- [220] arXiv:2504.17432 [pdf, html, other]
-
Title: Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMsTiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang DengComments: 13 pages, 8 figures, Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains this http URL this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLMś language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.
- [221] arXiv:2504.17438 [pdf, other]
-
Title: Storing and Querying Evolving Graphs in NoSQL Storage ModelsSubjects: Databases (cs.DB)
This paper investigates advanced storage models for evolving graphs, focusing on the efficient management of historical data and the optimization of global query performance. Evolving graphs, which represent dynamic relationships between entities over time, present unique challenges in preserving their complete history while supporting complex analytical queries. We first do a fast review of the current state of the art focusing mainly on distributed historical graph databases to provide the context of our proposals. We investigate the im- plementation of an enhanced vertex-centric storage model in MongoDB that prioritizes space efficiency by leveraging in-database query mechanisms to minimize redundant data and reduce storage costs. To ensure broad applicability, we employ datasets, some of which are generated with the LDBC SNB generator, appropriately post-processed to utilize both snapshot- and interval-based representations. Our experimental results both in centralized and distributed infrastructures, demonstrate significant improvements in query performance, particularly for resource-intensive global queries that traditionally suffer from inefficiencies in entity-centric frameworks. The proposed model achieves these gains by optimizing memory usage, reducing client involvement, and exploiting the computational capabilities of MongoDB. By addressing key bottlenecks in the storage and processing of evolving graphs, this study demonstrates a step toward a robust and scalable framework for managing dynamic graph data. This work contributes to the growing field of temporal graph analytics by enabling more efficient ex- ploration of historical data and facilitating real-time insights into the evolution of complex networks.
- [222] arXiv:2504.17441 [pdf, html, other]
-
Title: Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object UnderstandingComments: See our website at: this https URL First two authors contributed equallySubjects: Computer Vision and Pattern Recognition (cs.CV)
Humans can resort to long-form inspection to build intuition on predicting the 3D configurations of unseen objects. The more we observe the object motion, the better we get at predicting its 3D state immediately. Existing systems either optimize underlying representations from multi-view observations or train a feed-forward predictor from supervised datasets. We introduce Predict-Optimize-Distill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. Given a multi-view object scan and a long-form monocular video of human-object interaction, POD iteratively trains a neural network to predict local part poses from RGB frames, uses this predictor to initialize a global optimization which refines output poses through inverse rendering, then finally distills the results of optimization back into the model by generating synthetic self-labeled training data from novel viewpoints. Each iteration improves both the predictive model and the optimized motion trajectory, creating a virtuous cycle that bootstraps its own training data to learn about the pose configurations of an object. We also introduce a quasi-multiview mining strategy for reducing depth ambiguity by leveraging long video. We evaluate POD on 14 real-world and 5 synthetic objects with various joint types, including revolute and prismatic joints as well as multi-body configurations where parts detach or reattach independently. POD demonstrates significant improvement over a pure optimization baseline which gets stuck in local minima, particularly for longer videos. We also find that POD's performance improves with both video length and successive iterations of the self-improving cycle, highlighting its ability to scale performance with additional observations and looped refinement.
- [223] arXiv:2504.17443 [pdf, html, other]
-
Title: Morphisms and BWT-run SensitivityComments: SubmittedSubjects: Formal Languages and Automata Theory (cs.FL); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Combinatorics (math.CO)
We study how the application of injective morphisms affects the number $r$ of equal-letter runs in the Burrows-Wheeler Transform (BWT). This parameter has emerged as a key repetitiveness measure in compressed indexing. We focus on the notion of BWT-run sensitivity after application of an injective morphism. For binary alphabets, we characterize the class of morphisms that preserve the number of BWT-runs up to a bounded additive increase, by showing that it coincides with the known class of primitivity-preserving morphisms, which are those that map primitive words to primitive words. We further prove that deciding whether a given binary morphism has bounded BWT-run sensitivity is possible in polynomial time with respect to the total length of the images of the two letters. Additionally, we explore new structural and combinatorial properties of synchronizing and recognizable morphisms. These results establish new connections between BWT-based compressibility, code theory, and symbolic dynamics.
- [224] arXiv:2504.17444 [pdf, other]
-
Title: Encode the $\forall\exists$ Relational Hoare Logic into Standard Hoare LogicSubjects: Programming Languages (cs.PL)
Verifying a real-world program's functional correctness can be decomposed into (1) a refinement proof showing that the program implements a more abstract high-level program and (2) an algorithm correctness proof at the high level. Relational Hoare logic serves as a powerful tool to establish refinement but often necessitates formalization beyond standard Hoare logic. Particularly in the nondeterministic setting, the $\forall\exists$ relational Hoare logic is required. Existing approaches encode this logic into a Hoare logic with ghost states and invariants, yet these extensions significantly increase formalization complexity and soundness proof overhead. This paper proposes a generic encoding theory that reduces the $\forall\exists$ relational Hoare logic to standard (unary) Hoare logic. Precisely, we propose to redefine the validity of relational Hoare triples while reserving the original proof rules and then encapsulate the $\forall\exists$ pattern within assertions. We have proved that the validity of encoded standard Hoare triples is equivalent to the validity of the desired relational Hoare triples. Moreover, the encoding theory demonstrates how common relational Hoare logic proof rules are indeed special cases of standard Hoare logic proof rules, and relational proof steps correspond to standard proof steps. Our theory enables standard Hoare logic to prove $\forall\exists$ relational properties by defining a predicate Exec, without requiring modifications to the logic framework or re-verification of soundness.
- [225] arXiv:2504.17445 [pdf, html, other]
-
Title: Creating Targeted, Interpretable Topic Models with LLM-Generated Text AugmentationComments: Presented at IC2S2 2024 in Philadelphia, USASubjects: Computation and Language (cs.CL)
Unsupervised machine learning techniques, such as topic modeling and clustering, are often used to identify latent patterns in unstructured text data in fields such as political science and sociology. These methods overcome common concerns about reproducibility and costliness involved in the labor-intensive process of human qualitative analysis. However, two major limitations of topic models are their interpretability and their practicality for answering targeted, domain-specific social science research questions. In this work, we investigate opportunities for using LLM-generated text augmentation to improve the usefulness of topic modeling output. We use a political science case study to evaluate our results in a domain-specific application, and find that topic modeling using GPT-4 augmentations creates highly interpretable categories that can be used to investigate domain-specific research questions with minimal human guidance.
- [226] arXiv:2504.17447 [pdf, html, other]
-
Title: FRAG: Frame Selection Augmented Generation for Long Video and Long Document UnderstandingSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
There has been impressive progress in Large Multimodal Models (LMMs). Recent works extend these models to long inputs, including multi-page documents and long videos. However, the model size and performance of these long context models are still limited due to the computational cost in both training and inference. In this work, we explore an orthogonal direction and process long inputs without long context LMMs. We propose Frame Selection Augmented Generation (FRAG), where the model first selects relevant frames within the input, and then only generates the final outputs based on the selected frames. The core of the selection process is done by scoring each frame independently, which does not require long context processing. The frames with the highest scores are then selected by a simple Top-K selection. We show that this frustratingly simple framework is applicable to both long videos and multi-page documents using existing LMMs without any fine-tuning. We consider two models, LLaVA-OneVision and InternVL2, in our experiments and show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding. For videos, FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA compared with recent LMMs specialized in long document understanding. Code is available at: this https URL
- [227] arXiv:2504.17448 [pdf, html, other]
-
Title: CHASe: Client Heterogeneity-Aware Data Selection for Effective Federated Active LearningComments: Accepted by TKDE 2025Subjects: Machine Learning (cs.LG); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
Active learning (AL) reduces human annotation costs for machine learning systems by strategically selecting the most informative unlabeled data for annotation, but performing it individually may still be insufficient due to restricted data diversity and annotation budget. Federated Active Learning (FAL) addresses this by facilitating collaborative data selection and model training, while preserving the confidentiality of raw data samples. Yet, existing FAL methods fail to account for the heterogeneity of data distribution across clients and the associated fluctuations in global and local model parameters, adversely affecting model accuracy. To overcome these challenges, we propose CHASe (Client Heterogeneity-Aware Data Selection), specifically designed for FAL. CHASe focuses on identifying those unlabeled samples with high epistemic variations (EVs), which notably oscillate around the decision boundaries during training. To achieve both effectiveness and efficiency, \model{} encompasses techniques for 1) tracking EVs by analyzing inference inconsistencies across training epochs, 2) calibrating decision boundaries of inaccurate models with a new alignment loss, and 3) enhancing data selection efficiency via a data freeze and awaken mechanism with subset sampling. Experiments show that CHASe surpasses various established baselines in terms of effectiveness and efficiency, validated across diverse datasets, model complexities, and heterogeneous federation settings.
- [228] arXiv:2504.17449 [pdf, html, other]
-
Title: HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language ModelsComments: Accepted by VLDBJ 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
The significant computational demands of pretrained language models (PLMs), which often require dedicated hardware, present a substantial challenge in serving them efficiently, especially in multi-tenant environments. To address this, we introduce HMI, a Hierarchical knowledge management-based Multi-tenant Inference system, designed to manage tenants with distinct PLMs resource-efficiently. Our approach is three-fold: Firstly, we categorize PLM knowledge into general, domain-specific, and task-specific. Leveraging insights on knowledge acquisition across different model layers, we construct hierarchical PLMs (hPLMs) by extracting and storing knowledge at different levels, significantly reducing GPU memory usage per tenant. Secondly, we establish hierarchical knowledge management for hPLMs generated by various tenants in HMI. We manage domain-specific knowledge with acceptable storage increases by constructing and updating domain-specific knowledge trees based on frequency. We manage task-specific knowledge within limited GPU memory through parameter swapping. Finally, we propose system optimizations to enhance resource utilization and inference throughput. These include fine-grained pipelining via hierarchical knowledge prefetching to overlap CPU and I/O operations with GPU computations, and optimizing parallel implementations with batched matrix multiplications. Our experimental results demonstrate that the proposed HMI can efficiently serve up to 10,000 hPLMs (hBERTs and hGPTs) on a single GPU, with only a negligible compromise in accuracy.
- [229] arXiv:2504.17454 [pdf, html, other]
-
Title: Adaptive Orchestration of Modular Generative Information Access SystemsComments: Accepted at SIGIR 2025 Perspective Paper TrackSubjects: Information Retrieval (cs.IR)
Advancements in large language models (LLMs) have driven the emergence of complex new systems to provide access to information, that we will collectively refer to as modular generative information access (GenIA) systems. They integrate a broad and evolving range of specialized components, including LLMs, retrieval models, and a heterogeneous set of sources and tools. While modularity offers flexibility, it also raises critical challenges: How can we systematically characterize the space of possible modules and their interactions? How can we automate and optimize interactions among these heterogeneous components? And, how do we enable this modular system to dynamically adapt to varying user query requirements and evolving module capabilities? In this perspective paper, we argue that the architecture of future modular generative information access systems will not just assemble powerful components, but enable a self-organizing system through real-time adaptive orchestration -- where components' interactions are dynamically configured for each user input, maximizing information relevance while minimizing computational overhead. We give provisional answers to the questions raised above with a roadmap that depicts the key principles and methods for designing such an adaptive modular system. We identify pressing challenges, and propose avenues for addressing them in the years ahead. This perspective urges the IR community to rethink modular system designs for developing adaptive, self-optimizing, and future-ready architectures that evolve alongside their rapidly advancing underlying technologies.
- [230] arXiv:2504.17455 [pdf, html, other]
-
Title: An approach based on metaheuristic algorithms to the timetabling problem in deregulated railway marketsDavid Muñoz-Valero, Juan Moreno-Garcia, Julio Alberto López-Gómez, Enrique Adrian Villarrubia-Martin, Luis Rodriguez-BenitezComments: 20 pages, 16 figuresSubjects: Neural and Evolutionary Computing (cs.NE); Computational Engineering, Finance, and Science (cs.CE)
The train timetabling problem in liberalized railway markets represents a challenge to the coordination between infrastructure managers and railway undertakings. Efficient scheduling is critical in maximizing infrastructure capacity and utilization while adhering as closely as possible to the requests of railway undertakings. These objectives ultimately contribute to maximizing the infrastructure manager's revenues. This paper sets out a modular simulation framework to reproduce the dynamics of deregulated railway systems. Ten metaheuristic algorithms using the MEALPY Python library are then evaluated in order to optimize train schedules in the liberalized Spanish railway market. The results show that the Genetic Algorithm outperforms others in revenue optimization, convergence speed, and schedule adherence. Alternatives, such as Particle Swarm Optimization and Ant Colony Optimization Continuous, show slower convergence and higher variability. The results emphasize the trade-off between scheduling more trains and adhering to requested times, providing insights into solving complex scheduling problems in deregulated railway systems.
- [231] arXiv:2504.17457 [pdf, html, other]
-
Title: Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial AttacksZhiying Li, Yeying Jin, Fan Shen, Zhi Liu, Weibin Chen, Pengju Zhang, Xiaomei Zhang, Boyu Chen, Michael Shen, Kejian Wu, Zhaoxin Fan, Jin DongComments: 14 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Expressive human pose and shape estimation (EHPS) is crucial for digital human generation, especially in applications like live streaming. While existing research primarily focuses on reducing estimation errors, it largely neglects robustness and security aspects, leaving these systems vulnerable to adversarial attacks. To address this significant challenge, we propose the \textbf{Tangible Attack (TBA)}, a novel framework designed to generate adversarial examples capable of effectively compromising any digital human generation model. Our approach introduces a \textbf{Dual Heterogeneous Noise Generator (DHNG)}, which leverages Variational Autoencoders (VAE) and ControlNet to produce diverse, targeted noise tailored to the original image features. Additionally, we design a custom \textbf{adversarial loss function} to optimize the noise, ensuring both high controllability and potent disruption. By iteratively refining the adversarial sample through multi-gradient signals from both the noise and the state-of-the-art EHPS model, TBA substantially improves the effectiveness of adversarial attacks. Extensive experiments demonstrate TBA's superiority, achieving a remarkable 41.0\% increase in estimation error, with an average improvement of approximately 17.0\%. These findings expose significant security vulnerabilities in current EHPS models and highlight the need for stronger defenses in digital human generation systems.
- [232] arXiv:2504.17460 [pdf, html, other]
-
Title: A Lightweight Method for Generating Multi-Tier JIT Compilation Virtual Machine in a Meta-Tracing Compiler FrameworkComments: ECOOP 2025Subjects: Programming Languages (cs.PL)
Meta-compiler frameworks, such as RPython and Graal/Truffle, generate high-performance virtual machines (VMs) from interpreter definitions. Although they generate VMs with high-quality just-in-time (JIT) compilers, they still lack an important feature that dedicated VMs (i.e., VMs that are developed for specific languages) have, namely \emph{multi-tier compilation}. Multi-tier compilation uses light-weight compilers at early stages and highly-optimizing compilers at later stages in order to balance between compilation overheads and code quality.
We propose a novel approach to enabling multi-tier compilation in the VMs generated by a meta-compiler framework. Instead of extending the JIT compiler backend of the framework, our approach drives an existing (heavyweight) compiler backend in the framework to quickly generate unoptimized native code by merely embedding directives and compile-time operations into interpreter definitions.
As a validation of the approach, we developed 2SOM, a Simple Object Machine with a two-tier JIT compiler based on RPython. 2SOM first applies the tier-1 threaded code generator that is generated by our proposed technique, then, to the loops that exceed a threshold, applies the tier-2 tracing JIT compiler that is generated by the original RPython framework. Our performance evaluation that runs a program with a realistic workload showed that 2SOM improved, when compared against an RPython-based VM, warm-up performance by 15\%, with merely a 5\% reduction in peak performance. - [233] arXiv:2504.17461 [pdf, html, other]
-
Title: Evaluating Time Series Models for Urban Wastewater Management: Predictive Performance, Model Complexity and ResilienceComments: 6 pages, 6 figures, accepted at 10th International Conference on Smart and Sustainable Technologies (SpliTech) 2025, GitHub: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Climate change increases the frequency of extreme rainfall, placing a significant strain on urban infrastructures, especially Combined Sewer Systems (CSS). Overflows from overburdened CSS release untreated wastewater into surface waters, posing environmental and public health risks. Although traditional physics-based models are effective, they are costly to maintain and difficult to adapt to evolving system dynamics. Machine Learning (ML) approaches offer cost-efficient alternatives with greater adaptability. To systematically assess the potential of ML for modeling urban infrastructure systems, we propose a protocol for evaluating Neural Network architectures for CSS time series forecasting with respect to predictive performance, model complexity, and robustness to perturbations. In addition, we assess model performance on peak events and critical fluctuations, as these are the key regimes for urban wastewater management. To investigate the feasibility of lightweight models suitable for IoT deployment, we compare global models, which have access to all information, with local models, which rely solely on nearby sensor readings. Additionally, to explore the security risks posed by network outages or adversarial attacks on urban infrastructure, we introduce error models that assess the resilience of models. Our results demonstrate that while global models achieve higher predictive performance, local models provide sufficient resilience in decentralized scenarios, ensuring robust modeling of urban infrastructure. Furthermore, models with longer native forecast horizons exhibit greater robustness to data perturbations. These findings contribute to the development of interpretable and reliable ML solutions for sustainable urban wastewater management. The implementation is available in our GitHub repository.
- [234] arXiv:2504.17471 [pdf, html, other]
-
Title: GRANITE : a Byzantine-Resilient Dynamic Gossip Learning FrameworkSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighboring peers. Recent GL approaches rely on dynamic communication graphs built and maintained using Random Peer Sampling (RPS) protocols. Thanks to graph dynamics, GL can achieve fast convergence even over extremely sparse topologies. However, the robustness of GL over dy- namic graphs to Byzantine (model poisoning) attacks remains unaddressed especially when Byzantine nodes attack the RPS protocol to scale up model poisoning. We address this issue by introducing GRANITE, a framework for robust learning over sparse, dynamic graphs in the presence of a fraction of Byzantine nodes. GRANITE relies on two key components (i) a History-aware Byzantine-resilient Peer Sampling protocol (HaPS), which tracks previously encountered identifiers to reduce adversarial influence over time, and (ii) an Adaptive Probabilistic Threshold (APT), which leverages an estimate of Byzantine presence to set aggregation thresholds with formal guarantees. Empirical results confirm that GRANITE maintains convergence with up to 30% Byzantine nodes, improves learning speed via adaptive filtering of poisoned models and obtains these results in up to 9 times sparser graphs than dictated by current theory.
- [235] arXiv:2504.17473 [pdf, html, other]
-
Title: Wolves in the Repository: A Software Engineering Analysis of the XZ Utils Supply Chain AttackPiotr Przymus (1), Thomas Durieux (2) ((1) Nicolaus Copernicus University in Torun, Poland, (2) TU Delft & Endor Labs, The Netherlands)Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR)
The digital economy runs on Open Source Software (OSS), with an estimated 90\% of modern applications containing open-source components. While this widespread adoption has revolutionized software development, it has also created critical security vulnerabilities, particularly in essential but under-resourced projects. This paper examines a sophisticated attack on the XZ Utils project (CVE-2024-3094), where attackers exploited not just code, but the entire open-source development process to inject a backdoor into a fundamental Linux compression library. Our analysis reveals a new breed of supply chain attack that manipulates software engineering practices themselves -- from community management to CI/CD configurations -- to establish legitimacy and maintain long-term control. Through a comprehensive examination of GitHub events and development artifacts, we reconstruct the attack timeline, analyze the evolution of attacker tactics. Our findings demonstrate how attackers leveraged seemingly beneficial contributions to project infrastructure and maintenance to bypass traditional security measures. This work extends beyond traditional security analysis by examining how software engineering practices themselves can be weaponized, offering insights for protecting the open-source ecosystem.
- [236] arXiv:2504.17474 [pdf, html, other]
-
Title: Enhanced Sample Selection with Confidence Tracking: Identifying Correctly Labeled yet Hard-to-Learn Samples in Noisy DataSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
We propose a novel sample selection method for image classification in the presence of noisy labels. Existing methods typically consider small-loss samples as correctly labeled. However, some correctly labeled samples are inherently difficult for the model to learn and can exhibit high loss similar to mislabeled samples in the early stages of training. Consequently, setting a threshold on per-sample loss to select correct labels results in a trade-off between precision and recall in sample selection: a lower threshold may miss many correctly labeled hard-to-learn samples (low recall), while a higher threshold may include many mislabeled samples (low precision). To address this issue, our goal is to accurately distinguish correctly labeled yet hard-to-learn samples from mislabeled ones, thus alleviating the trade-off dilemma. We achieve this by considering the trends in model prediction confidence rather than relying solely on loss values. Empirical observations show that only for correctly labeled samples, the model's prediction confidence for the annotated labels typically increases faster than for any other classes. Based on this insight, we propose tracking the confidence gaps between the annotated labels and other classes during training and evaluating their trends using the Mann-Kendall Test. A sample is considered potentially correctly labeled if all its confidence gaps tend to increase. Our method functions as a plug-and-play component that can be seamlessly integrated into existing sample selection techniques. Experiments on several standard benchmarks and real-world datasets demonstrate that our method enhances the performance of existing methods for learning with noisy labels.
- [237] arXiv:2504.17480 [pdf, html, other]
-
Title: Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge DistillationSubjects: Computation and Language (cs.CL)
Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A recent discovery, termed watermark radioactivity, reveals that watermarks embedded in teacher models can be inherited by student models through knowledge distillation. On the positive side, this inheritance allows for the detection of unauthorized knowledge distillation by identifying watermark traces in student models. However, the robustness of watermarks against scrubbing attacks and their unforgeability in the face of spoofing attacks under unauthorized knowledge distillation remain largely unexplored. Existing watermark attack methods either assume access to model internals or fail to simultaneously support both scrubbing and spoofing attacks. In this work, we propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references, followed by bidirectional distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.
- [238] arXiv:2504.17489 [pdf, html, other]
-
Title: Towards Equitable Rail Service Allocation Through Fairness-Oriented Timetabling in Liberalized MarketsDavid Muñoz-Valero, Juan Moreno-Garcia, Julio Alberto López-Gómez, Enrique Adrian Villarrubia-MartinComments: 30 pages, 7 figuresSubjects: Neural and Evolutionary Computing (cs.NE); Computational Engineering, Finance, and Science (cs.CE)
Over the last few decades, European rail transport has undergone major changes as part of the process of liberalization set out in European regulations. In this context of liberalization, railway undertakings compete with each other for the limited infrastructure capacity available to offer their rail services. The infrastructure manager is responsible for the equitable allocation of infrastructure between all companies in the market, which is essential to ensure the efficiency and sustainability of this competitive ecosystem. In this paper, a methodology based on Jain, Gini and Atkinson equity metrics is used to solve the rail service allocation problem in a liberalized railway market, analyzing the solutions obtained. The results show that the proposed methodology and the equity metrics used allow for equitable planning in different competitiveness scenarios. These results contrast with solutions where the objective of the infrastructure manager is to maximize its own profit, without regard for the equitable allocation of infrastructure. Therefore, the computational tests support the methodology and metrics used as a planning and decision support tool in a liberalized railway market.
- [239] arXiv:2504.17490 [pdf, html, other]
-
Title: Plasticine: Accelerating Research in Plasticity-Motivated Deep Reinforcement LearningMingqi Yuan, Qi Wang, Guozheng Ma, Bo Li, Xin Jin, Yunbo Wang, Xiaokang Yang, Wenjun Zeng, Dacheng TaoComments: 23 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Developing lifelong learning agents is crucial for artificial general intelligence. However, deep reinforcement learning (RL) systems often suffer from plasticity loss, where neural networks gradually lose their ability to adapt during training. Despite its significance, this field lacks unified benchmarks and evaluation protocols. We introduce Plasticine, the first open-source framework for benchmarking plasticity optimization in deep RL. Plasticine provides single-file implementations of over 13 mitigation methods, 10 evaluation metrics, and learning scenarios with increasing non-stationarity levels from standard to open-ended environments. This framework enables researchers to systematically quantify plasticity loss, evaluate mitigation strategies, and analyze plasticity dynamics across different contexts. Our documentation, examples, and source code are available at this https URL.
- [240] arXiv:2504.17492 [pdf, html, other]
-
Title: Prototype-enhanced prediction in graph neural networks for climate applicationsSubjects: Machine Learning (cs.LG)
Data-driven emulators are increasingly being used to learn and emulate physics-based simulations, reducing computational expense and run time. Here, we present a structured way to improve the quality of these high-dimensional emulated outputs, through the use of prototypes: an approximation of the emulator's output passed as an input, which informs the model and leads to better predictions. We demonstrate our approach to emulate atmospheric dispersion, key for greenhouse gas emissions monitoring, by comparing a baseline model to models trained using prototypes as an additional input. The prototype models achieve better performance, even with few prototypes and even if they are chosen at random, but we show that choosing the prototypes through data-driven methods (k-means) can lead to almost 10\% increased performance in some metrics.
- [241] arXiv:2504.17493 [pdf, html, other]
-
Title: Goal-Oriented Time-Series Forecasting: Foundation Framework DesignLuca-Andrei Fechete, Mohamed Sana, Fadhel Ayed, Nicola Piovesan, Wenjie Li, Antonio De Domenico, Tareq Si SalemSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Traditional time-series forecasting often focuses only on minimizing prediction errors, ignoring the specific requirements of real-world applications that employ them. This paper presents a new training methodology, which allows a forecasting model to dynamically adjust its focus based on the importance of forecast ranges specified by the end application. Unlike previous methods that fix these ranges beforehand, our training approach breaks down predictions over the entire signal range into smaller segments, which are then dynamically weighted and combined to produce accurate forecasts. We tested our method on standard datasets, including a new dataset from wireless communication, and found that not only it improves prediction accuracy but also improves the performance of end application employing the forecasting model. This research provides a basis for creating forecasting systems that better connect prediction and decision-making in various practical applications.
- [242] arXiv:2504.17497 [pdf, html, other]
-
Title: Combining GCN Structural Learning with LLM Chemical Knowledge for or Enhanced Virtual ScreeningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Virtual screening plays a critical role in modern drug discovery by enabling the identification of promising candidate molecules for experimental validation. Traditional machine learning methods such as support vector machines (SVM) and XGBoost rely on predefined molecular representations, often leading to information loss and potential bias. In contrast, deep learning approaches-particularly Graph Convolutional Networks (GCNs)-offer a more expressive and unbiased alternative by operating directly on molecular graphs. Meanwhile, Large Language Models (LLMs) have recently demonstrated state-of-the-art performance in drug design, thanks to their capacity to capture complex chemical patterns from large-scale data via attention mechanisms.
In this paper, we propose a hybrid architecture that integrates GCNs with LLM-derived embeddings to combine localized structural learning with global chemical knowledge. The LLM embeddings can be precomputed and stored in a molecular feature library, removing the need to rerun the LLM during training or inference and thus maintaining computational efficiency. We found that concatenating the LLM embeddings after each GCN layer-rather than only at the final layer-significantly improves performance, enabling deeper integration of global context throughout the network. The resulting model achieves superior results, with an F1-score of (88.8%), outperforming standalone GCN (87.9%), XGBoost (85.5%), and SVM (85.4%) baselines. - [243] arXiv:2504.17502 [pdf, html, other]
-
Title: RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image GenerationAviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan SzpektorSubjects: Computer Vision and Pattern Recognition (cs.CV)
Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability -- ranging from enhanced personalization in image generation to consistent character representation in video rendering -- progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., \emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87\% accuracy.
- [244] arXiv:2504.17503 [pdf, html, other]
-
Title: Tailored minimal reservoir computing: on the bidirectional connection between nonlinearities in the reservoir and in dataComments: 13 pages, 11 figuresSubjects: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
We study how the degree of nonlinearity in the input data affects the optimal design of reservoir computers, focusing on how closely the model's nonlinearity should align with that of the data. By reducing minimal RCs to a single tunable nonlinearity parameter, we explore how the predictive performance varies with the degree of nonlinearity in the reservoir. To provide controlled testbeds, we generalize to the fractional Halvorsen system, a novel chaotic system with fractional exponents. Our experiments reveal that the prediction performance is maximized when the reservoir's nonlinearity matches the nonlinearity present in the data. In cases where multiple nonlinearities are present in the data, we find that the correlation dimension of the predicted signal is reconstructed correctly when the smallest nonlinearity is matched. We use this observation to propose a method for estimating the minimal nonlinearity in unknown time series by sweeping the reservoir exponent and identifying the transition to a successful reconstruction. Applying this method to both synthetic and real-world datasets, including financial time series, we demonstrate its practical viability. Finally, we transfer these insights to classical RC by augmenting traditional architectures with fractional, generalized reservoir states. This yields performance gains, particularly in resource-constrained scenarios such as physical reservoirs, where increasing reservoir size is impractical or economically unviable. Our work provides a principled route toward tailoring RCs to the intrinsic complexity of the systems they aim to model.
- [245] arXiv:2504.17510 [pdf, html, other]
-
Title: Safe to Stay: Psychological Safety Sustains Participation in Pull-based Open Source ProjectsComments: This work has been submitted to the IEEE for possible publicationSubjects: Software Engineering (cs.SE)
Psychological safety is the belief that team members can speak up or make mistakes without fear of negative consequences. While it is recognized as important in traditional software teams, its role in open-source development remains understudied. Yet, open-source contributors often collaborate without formal roles or structures, where interpersonal relationship can make or break participation. In this study, we examine whether team-level psychological safety, inferred from code review activities, is associated with contributors' continued participation in open-source projects. Code review is a central and collaborative activity in modern software development, which offers a rich context for observing team interactions. Based on 60,684 pull requests, we construct a psychological safety index using cues such as merge decisions, comment activity, interaction diversity, and mentions. We analyze its relationship with contributors' short-term (after 1 year) and long-term (after 4-5 years) sustained participation using three logistic regression models. Our findings show that contributors are more likely to remain active in repositories with higher levels of psychological safety. Psychological safety is positively associated with both short-term and future sustained participation. However, when prior participation is included, it becomes the stronger predictor of future sustained participation, while the effect of psychological safety becomes smaller. This study introduces a scalable approach to study psychological safety through pull request data and provides new evidence that it matters in open-source development.
- [246] arXiv:2504.17511 [pdf, html, other]
-
Title: Subcode Ensemble Decoding of Polar CodesHenning Lulei, Jonathan Mandelbaum, Marvin Rübenacke, Holger Jäkel, Stephan ten Brink, Laurent SchmalenComments: Submitted to IEEESubjects: Information Theory (cs.IT)
In the short block length regime, pre-transformed polar codes together with successive cancellation list (SCL) decoding possess excellent error correction capabilities. However, in practice, the list size is limited due to the suboptimal scaling of the required area in hardware implementations. Automorphism ensemble decoding (AED) can improve performance for a fixed list size by running multiple parallel SCL decodings on permuted received words, yielding a list of estimates from which the final estimate is selected. Yet, AED is limited to appropriately designed polar codes. Subcode ensemble decoding (ScED) was recently proposed for low-density parity-check codes and does not impose such design constraints. It uses multiple decodings in different subcodes, ensuring that the selected subcodes jointly cover the original code. We extend ScED to polar codes by expressing polar subcodes through suitable pre-transformations (PTs). To this end, we describe a framework classifying pre-transformations for pre-transformed polar codes based on their role in encoding and decoding. Within this framework, we propose a new type of PT enabling ScED for polar codes, analyze its properties, and discuss how to construct an efficient ensemble.
- [247] arXiv:2504.17512 [pdf, html, other]
-
Title: Admittance Identification of Grid-Forming Inverters Using Time and Frequency-Domain TechniquesComments: 2025 IEEE Kiel PowerTechSubjects: Systems and Control (eess.SY)
The increasing integration of inverter-based resources (IBRs) into the power grid introduces new challenges, requiring detailed electromagnetic transient (EMT) studies to analyze system interactions. Despite these needs, access to the internal firmware of power electronic devices remains restricted due to stringent nondisclosure agreements enforced by manufacturers. To address this, we explore three system identification techniques: sweep frequency response analysis (SFRA), step excitation method (SEM), and eigensystem realization algorithm (ERA). SFRA employs sinusoidal signals of varying frequencies to measure the system's frequency response, while SEM and ERA utilize step functions to derive time-domain responses and transform them into Laplace-domain transfer functions. All three approaches are shown to provide consistent results in identifying the dq admittance of grid-forming inverters (GFM) over a frequency range of 1 Hz to 100 Hz.
- [248] arXiv:2504.17514 [pdf, html, other]
-
Title: Secure Network Function Computation for Linear Functions, Part II: Target-Function SecurityComments: 44 pagesSubjects: Information Theory (cs.IT)
In this Part II of a two-part paper, we put forward secure network function computation, where in a directed acyclic network, a sink node is required to compute a target function of which the inputs are generated as source messages at multiple source nodes, while a wiretapper, who can access any one but not more than one wiretap set in a given collection of wiretap sets, is not allowed to obtain any information about a security function of the source messages. In Part I of the two-part paper, we have investigated securely computing linear functions with the wiretapper who can eavesdrop any edge subset up to a certain size r, referred to as the security level, where the security function is the identity function. The notion of this security is called source security. In the current paper, we consider another interesting model which is the same as the above one except that the security function is identical to the target function, i.e., we need to protect the information on the target function from being leaked to the wiretapper. The notion of this security is called target-function security. We first prove a non-trivial upper bound on the secure computing capacity, which is applicable to arbitrary network topologies and arbitrary security levels. In particular, when the security level r is equal to 0, the upper bound reduces to the computing capacity without security consideration. Further, from an algebraic point of view, we prove two equivalent conditions for target-function security and source security for the existence of the corresponding linear function-computing secure network codes. With them, for any linear function over a given finite field, we develop a code construction of linear secure network codes for target-function security and thus obtain a lower bound on the secure computing capacity; and also generalize the code construction developed in Part I for source security.
- [249] arXiv:2504.17515 [pdf, html, other]
-
Title: Mamba-Sea: A Mamba-based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image SegmentationComments: Accepted by IEEE TMI 2025. The code is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
To segment medical images with distribution shifts, domain generalization (DG) has emerged as a promising setting to train models on source domains that can generalize to unseen target domains. Existing DG methods are mainly based on CNN or ViT architectures. Recently, advanced state space models, represented by Mamba, have shown promising results in various supervised medical image segmentation. The success of Mamba is primarily owing to its ability to capture long-range dependencies while keeping linear complexity with input sequence length, making it a promising alternative to CNNs and ViTs. Inspired by the success, in the paper, we explore the potential of the Mamba architecture to address distribution shifts in DG for medical image segmentation. Specifically, we propose a novel Mamba-based framework, Mamba-Sea, incorporating global-to-local sequence augmentation to improve the model's generalizability under domain shift issues. Our Mamba-Sea introduces a global augmentation mechanism designed to simulate potential variations in appearance across different sites, aiming to suppress the model's learning of domain-specific information. At the local level, we propose a sequence-wise augmentation along input sequences, which perturbs the style of tokens within random continuous sub-sequences by modeling and resampling style statistics associated with domain shifts. To our best knowledge, Mamba-Sea is the first work to explore the generalization of Mamba for medical image segmentation, providing an advanced and promising Mamba-based architecture with strong robustness to domain shifts. Remarkably, our proposed method is the first to surpass a Dice coefficient of 90% on the Prostate dataset, which exceeds previous SOTA of 88.61%. The code is available at this https URL.
- [250] arXiv:2504.17519 [pdf, html, other]
-
Title: Replication and Exploration of Generative Retrieval over Dynamic CorporaZhen Zhang, Xinyu Ma, Weiwei Sun, Pengjie Ren, Zhumin Chen, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Zhaochun RenComments: Accepted at SIGIR 2025 (Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)Subjects: Information Retrieval (cs.IR)
Generative retrieval (GR) has emerged as a promising paradigm in information retrieval (IR). However, most existing GR models are developed and evaluated using a static document collection, and their performance in dynamic corpora where document collections evolve continuously is rarely studied. In this paper, we first reproduce and systematically evaluate various representative GR approaches over dynamic corpora. Through extensive experiments, we reveal that existing GR models with \textit{text-based} docids show superior generalization to unseen documents. We observe that the more fine-grained the docid design in the GR model, the better its performance over dynamic corpora, surpassing BM25 and even being comparable to dense retrieval methods. While GR models with \textit{numeric-based} docids show high efficiency, their performance drops significantly over dynamic corpora. Furthermore, our experiments find that the underperformance of numeric-based docids is partly due to their excessive tendency toward the initial document set, which likely results from overfitting on the training set. We then conduct an in-depth analysis of the best-performing GR methods. We identify three critical advantages of text-based docids in dynamic corpora: (i) Semantic alignment with language models' pretrained knowledge, (ii) Fine-grained docid design, and (iii) High lexical diversity. Building on these insights, we finally propose a novel multi-docid design that leverages both the efficiency of numeric-based docids and the effectiveness of text-based docids, achieving improved performance in dynamic corpus without requiring additional retraining. Our work offers empirical evidence for advancing GR methods over dynamic corpora and paves the way for developing more generalized yet efficient GR models in real-world search engines.
- [251] arXiv:2504.17520 [pdf, html, other]
-
Title: Communication-Efficient Personalized Distributed Learning with Data and Node HeterogeneityComments: Accepcted by TCCNSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
To jointly tackle the challenges of data and node heterogeneity in decentralized learning, we propose a distributed strong lottery ticket hypothesis (DSLTH), based on which a communication-efficient personalized learning algorithm is developed. In the proposed method, each local model is represented as the Hadamard product of global real-valued parameters and a personalized binary mask for pruning. The local model is learned by updating and fusing the personalized binary masks while the real-valued parameters are fixed among different agents. To further reduce the complexity of hardware implementation, we incorporate a group sparse regularization term in the loss function, enabling the learned local model to achieve structured sparsity. Then, a binary mask aggregation algorithm is designed by introducing an intermediate aggregation tensor and adding a personalized fine-tuning step in each iteration, which constrains model updates towards the local data distribution. The proposed method effectively leverages the relativity among agents while meeting personalized requirements in heterogeneous node conditions. We also provide a theoretical proof for the DSLTH, establishing it as the foundation of the proposed method. Numerical simulations confirm the validity of the DSLTH and demonstrate the effectiveness of the proposed algorithm.
- [252] arXiv:2504.17522 [pdf, html, other]
-
Title: Towards One-Stage End-to-End Table Structure Recognition with Parallel Regression for Diverse ScenariosSubjects: Computer Vision and Pattern Recognition (cs.CV)
Table structure recognition aims to parse tables in unstructured data into machine-understandable formats. Recent methods address this problem through a two-stage process or optimized one-stage approaches. However, these methods either require multiple networks to be serially trained and perform more time-consuming sequential decoding, or rely on complex post-processing algorithms to parse the logical structure of tables. They struggle to balance cross-scenario adaptability, robustness, and computational efficiency. In this paper, we propose a one-stage end-to-end table structure parsing network called TableCenterNet. This network unifies the prediction of table spatial and logical structure into a parallel regression task for the first time, and implicitly learns the spatial-logical location mapping laws of cells through a synergistic architecture of shared feature extraction layers and task-specific decoding. Compared with two-stage methods, our method is easier to train and faster to infer. Experiments on benchmark datasets show that TableCenterNet can effectively parse table structures in diverse scenarios and achieve state-of-the-art performance on the TableGraph-24k dataset. Code is available at this https URL.
- [253] arXiv:2504.17523 [pdf, html, other]
-
Title: From Randomized Response to Randomized Index: Answering Subset Counting Queries with Local Differential PrivacyComments: This paper is accepted by IEEE S&P 2025Subjects: Databases (cs.DB); Cryptography and Security (cs.CR)
Local Differential Privacy (LDP) is the predominant privacy model for safeguarding individual data privacy. Existing perturbation mechanisms typically require perturbing the original values to ensure acceptable privacy, which inevitably results in value distortion and utility deterioration. In this work, we propose an alternative approach -- instead of perturbing values, we apply randomization to indexes of values while ensuring rigorous LDP guarantees. Inspired by the deniability of randomized indexes, we present CRIAD for answering subset counting queries on set-value data. By integrating a multi-dummy, multi-sample, and multi-group strategy, CRIAD serves as a fully scalable solution that offers flexibility across various privacy requirements and domain sizes, and achieves more accurate query results than any existing methods. Through comprehensive theoretical analysis and extensive experimental evaluations, we validate the effectiveness of CRIAD and demonstrate its superiority over traditional value-perturbation mechanisms.
- [254] arXiv:2504.17524 [pdf, other]
-
Title: ESDiff: Encoding Strategy-inspired Diffusion Model with Few-shot Learning for Color Image InpaintingComments: 11 pages,10 figures,Submit to tcsvtSubjects: Computer Vision and Pattern Recognition (cs.CV)
Image inpainting is a technique used to restore missing or damaged regions of an image. Traditional methods primarily utilize information from adjacent pixels for reconstructing missing areas, while they struggle to preserve complex details and structures. Simultaneously, models based on deep learning necessitate substantial amounts of training data. To address this challenge, an encoding strategy-inspired diffusion model with few-shot learning for color image inpainting is proposed in this paper. The main idea of this novel encoding strategy is the deployment of a "virtual mask" to construct high-dimensional objects through mutual perturbations between channels. This approach enables the diffusion model to capture diverse image representations and detailed features from limited training samples. Moreover, the encoding strategy leverages redundancy between channels, integrates with low-rank methods during iterative inpainting, and incorporates the diffusion model to achieve accurate information output. Experimental results indicate that our method exceeds current techniques in quantitative metrics, and the reconstructed images quality has been improved in aspects of texture and structural integrity, leading to more precise and coherent results.
- [255] arXiv:2504.17525 [pdf, html, other]
-
Title: Text-to-Image Alignment in Denoising-Based Models through Step SelectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Visual generative AI models often encounter challenges related to text-image alignment and reasoning limitations. This paper presents a novel method for selectively enhancing the signal at critical denoising steps, optimizing image generation based on input semantics. Our approach addresses the shortcomings of early-stage signal modifications, demonstrating that adjustments made at later stages yield superior results. We conduct extensive experiments to validate the effectiveness of our method in producing semantically aligned images on Diffusion and Flow Matching model, achieving state-of-the-art performance. Our results highlight the importance of a judicious choice of sampling stage to improve performance and overall image alignment.
- [256] arXiv:2504.17526 [pdf, html, other]
-
Title: Cooperative Task Offloading through Asynchronous Deep Reinforcement Learning in Mobile Edge Computing for Future NetworksSubjects: Machine Learning (cs.LG)
Future networks (including 6G) are poised to accelerate the realisation of Internet of Everything. However, it will result in a high demand for computing resources to support new services. Mobile Edge Computing (MEC) is a promising solution, enabling to offload computation-intensive tasks to nearby edge servers from the end-user devices, thereby reducing latency and energy consumption. However, relying solely on a single MEC server for task offloading can lead to uneven resource utilisation and suboptimal performance in complex scenarios. Additionally, traditional task offloading strategies specialise in centralised policy decisions, which unavoidably entail extreme transmission latency and reach computational bottleneck. To fill the gaps, we propose a latency and energy efficient Cooperative Task Offloading framework with Transformer-driven Prediction (CTO-TP), leveraging asynchronous multi-agent deep reinforcement learning to address these challenges. This approach fosters edge-edge cooperation and decreases the synchronous waiting time by performing asynchronous training, optimising task offloading, and resource allocation across distributed networks. The performance evaluation demonstrates that the proposed CTO-TP algorithm reduces up to 80% overall system latency and 87% energy consumption compared to the baseline schemes.
- [257] arXiv:2504.17528 [pdf, html, other]
-
Title: TACO: Tackling Over-correction in Federated Learning with Tailored Adaptive CorrectionComments: 11 pages, 7 figures, accepted by ICDCS 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Non-independent and identically distributed (Non-IID) data across edge clients have long posed significant challenges to federated learning (FL) training in edge computing environments. Prior works have proposed various methods to mitigate this statistical heterogeneity. While these works can achieve good theoretical performance, in this work we provide the first investigation into a hidden over-correction phenomenon brought by the uniform model correction coefficients across clients adopted by existing methods. Such over-correction could degrade model performance and even cause failures in model convergence. To address this, we propose TACO, a novel algorithm that addresses the non-IID nature of clients' data by implementing fine-grained, client-specific gradient correction and model aggregation, steering local models towards a more accurate global optimum. Moreover, we verify that leading FL algorithms generally have better model accuracy in terms of communication rounds rather than wall-clock time, resulting from their extra computation overhead imposed on clients. To enhance the training efficiency, TACO deploys a lightweight model correction and tailored aggregation approach that requires minimum computation overhead and no extra information beyond the synchronized model parameters. To validate TACO's effectiveness, we present the first FL convergence analysis that reveals the root cause of over-correction. Extensive experiments across various datasets confirm TACO's superior and stable performance in practice.
- [258] arXiv:2504.17529 [pdf, html, other]
-
Title: IRA: Adaptive Interest-aware Representation and Alignment for Personalized Multi-interest RetrievalYoungjune Lee, Haeyu Jeong, Changgeon Lim, Jeong Choi, Hongjun Lim, Hangon Kim, Jiyoon Kwon, Saehun KimComments: Accepted to SIGIR 2025 Industry Track. First two authors contributed equallySubjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Online community platforms require dynamic personalized retrieval and recommendation that can continuously adapt to evolving user interests and new documents. However, optimizing models to handle such changes in real-time remains a major challenge in large-scale industrial settings. To address this, we propose the Interest-aware Representation and Alignment (IRA) framework, an efficient and scalable approach that dynamically adapts to new interactions through a cumulative structure. IRA leverages two key mechanisms: (1) Interest Units that capture diverse user interests as contextual texts, while reinforcing or fading over time through cumulative updates, and (2) a retrieval process that measures the relevance between Interest Units and documents based solely on semantic relationships, eliminating dependence on click signals to mitigate temporal biases. By integrating cumulative Interest Unit updates with the retrieval process, IRA continuously adapts to evolving user preferences, ensuring robust and fine-grained personalization without being constrained by past training distributions. We validate the effectiveness of IRA through extensive experiments on real-world datasets, including its deployment in the Home Section of NAVER's CAFE, South Korea's leading community platform.
- [259] arXiv:2504.17531 [pdf, html, other]
-
Title: Towards Machine-Generated Code for the Resolution of User IntentionsSubjects: Artificial Intelligence (cs.AI)
The growing capabilities of Artificial Intelligence (AI), particularly Large Language Models (LLMs), prompt a reassessment of the interaction mechanisms between users and their devices. Currently, users are required to use a set of high-level applications to achieve their desired results. However, the advent of AI may signal a shift in this regard, as its capabilities have generated novel prospects for user-provided intent resolution through the deployment of model-generated code, which is tantamount to the generation of workflows comprising a multitude of interdependent steps. This development represents a significant progression in the realm of hybrid workflows, where human and artificial intelligence collaborate to address user intentions, with the former responsible for defining these intentions and the latter for implementing the solutions to address them. In this paper, we investigate the feasibility of generating and executing workflows through code generation that results from prompting an LLM with a concrete user intention, such as \emph{Please send my car title to my insurance company}, and a simplified application programming interface for a GUI-less operating system. We provide in-depth analysis and comparison of various user intentions, the resulting code, and its execution. The findings demonstrate a general feasibility of our approach and that the employed LLM, GPT-4o-mini, exhibits remarkable proficiency in the generation of code-oriented workflows in accordance with provided user intentions.
- [260] arXiv:2504.17534 [pdf, other]
-
Title: Learning Isometric Embeddings of Road Networks using Multidimensional ScalingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Symbolic Computation (cs.SC)
The lack of generalization in learning-based autonomous driving applications is shown by the narrow range of road scenarios that vehicles can currently cover. A generalizable approach should capture many distinct road structures and topologies, as well as consider traffic participants, and dynamic changes in the environment, so that vehicles can navigate and perform motion planning tasks even in the most difficult situations. Designing suitable feature spaces for neural network-based motion planers that encapsulate all kinds of road scenarios is still an open research challenge. This paper tackles this learning-based generalization challenge and shows how graph representations of road networks can be leveraged by using multidimensional scaling (MDS) techniques in order to obtain such feature spaces. State-of-the-art graph representations and MDS approaches are analyzed for the autonomous driving use case. Finally, the option of embedding graph nodes is discussed in order to perform easier learning procedures and obtain dimensionality reduction.
- [261] arXiv:2504.17536 [pdf, html, other]
-
Title: Dynamic Membership for Regular Tree LanguagesComments: 40 pages including 16 pages of main text. Complete proofs in appendixSubjects: Formal Languages and Automata Theory (cs.FL); Data Structures and Algorithms (cs.DS)
We study the dynamic membership problem for regular tree languages under relabeling updates: we fix an alphabet ${\Sigma}$ and a regular tree language $L$ over ${\Sigma}$ (expressed, e.g., as a tree automaton), we are given a tree $T$ with labels in ${\Sigma}$, and we must maintain the information of whether the tree $T$ belongs to $L$ while handling relabeling updates that change the labels of individual nodes in $T$. (The shape and size of the tree remain the same throughout.)
Our first contribution is to show that this problem admits an $O(\log n / \log \log n)$ algorithm for any fixed regular tree language, improving over known algorithms that achieve $O(\log n)$. This generalizes the known $O(\log n / \log \log n)$ upper bound over words, and it matches the lower bound of ${\Omega}(\log n / \log \log n)$ from dynamic membership to some word languages and from the existential marked ancestor problem.
Our second contribution is to introduce a class of regular languages, dubbed almost-commutative tree languages, and show that dynamic membership to such languages under relabeling updates can be done in constant time per update. Almost-commutative languages generalize both commutative languages and finite languages, and they are the analogue for trees of the ZG languages enjoying constant-time dynamic membership over words. Our main technical contribution is to show that this class is conditionally optimal when we assume that the alphabet features a neutral letter, i.e., a letter that has no effect on membership to the language. More precisely, we show that any regular tree language with a neutral letter which is not almost-commutative cannot be maintained in constant time under the assumption that prefix-U1 problem from (Amarilli, Jachiet, Paperman, ICALP'21) also does not admit a constant-time algorithm. - [262] arXiv:2504.17539 [pdf, html, other]
-
Title: Proof of Useful Intelligence (PoUI): Blockchain Consensus Beyond Energy WasteSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Blockchain technology enables secure, transparent data management in decentralized systems, supporting applications from cryptocurrencies like Bitcoin to tokenizing real-world assets like property. Its scalability and sustainability hinge on consensus mechanisms balancing security and efficiency. Proof of Work (PoW), used by Bitcoin, ensures security through energy-intensive computations but demands significant resources. Proof of Stake (PoS), as in Ethereum post-Merge, selects validators based on staked cryptocurrency, offering energy efficiency but risking centralization from wealth concentration. With AI models straining computational resources, we propose Proof of Useful Intelligence (PoUI), a hybrid consensus mechanism. In PoUI, workers perform AI tasks like language processing or image analysis to earn coins, which are staked to secure the network, blending security with practical utility. Decentralized nodes--job posters, market coordinators, workers, and validators --collaborate via smart contracts to manage tasks and rewards.
- [263] arXiv:2504.17540 [pdf, html, other]
-
Title: An Explainable Nature-Inspired Framework for Monkeypox Diagnosis: Xception Features Combined with NGBoost and African Vultures Optimization AlgorithmSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
The recent global spread of monkeypox, particularly in regions where it has not historically been prevalent, has raised significant public health concerns. Early and accurate diagnosis is critical for effective disease management and control. In response, this study proposes a novel deep learning-based framework for the automated detection of monkeypox from skin lesion images, leveraging the power of transfer learning, dimensionality reduction, and advanced machine learning techniques. We utilize the newly developed Monkeypox Skin Lesion Dataset (MSLD), which includes images of monkeypox, chickenpox, and measles, to train and evaluate our models. The proposed framework employs the Xception architecture for deep feature extraction, followed by Principal Component Analysis (PCA) for dimensionality reduction, and the Natural Gradient Boosting (NGBoost) algorithm for classification. To optimize the model's performance and generalization, we introduce the African Vultures Optimization Algorithm (AVOA) for hyperparameter tuning, ensuring efficient exploration of the parameter space. Our results demonstrate that the proposed AVOA-NGBoost model achieves state-of-the-art performance, with an accuracy of 97.53%, F1-score of 97.72% and an AUC of 97.47%. Additionally, we enhance model interpretability using Grad-CAM and LIME techniques, providing insights into the decision-making process and highlighting key features influencing classification. This framework offers a highly precise and efficient diagnostic tool, potentially aiding healthcare providers in early detection and diagnosis, particularly in resource-constrained environments.
- [264] arXiv:2504.17542 [pdf, html, other]
-
Title: Large Language Model-Driven Concolic Execution for Highly Structured Test Input GenerationComments: 18 pages (including Appendix)Subjects: Software Engineering (cs.SE)
How can we perform concolic execution to generate highly structured test inputs for systematically testing parsing programs? Existing concolic execution engines are significantly restricted by (1) input structure-agnostic path constraint selection, leading to the waste of testing effort or missing coverage; (2) limited constraint-solving capability, yielding many syntactically invalid test inputs; (3) reliance on manual acquisition of highly structured seed inputs, resulting in non-continuous testing.
This paper proposes Cottontail, a new Large Language Model (LLM)-driven concolic execution engine, to mitigate the above limitations. A more complete program path representation, named Expressive Structural Coverage Tree (ESCT), is first constructed to select structure-aware path constraints. Later, an LLM-driven constraint solver based on a Solve-Complete paradigm is designed to solve the path constraints smartly to get test inputs that are not only satisfiable to the constraints but also valid to the input syntax. Finally, a history-guided seed acquisition is employed to obtain new highly structured test inputs either before testing starts or after testing is saturated.
We implemented Cottontail on top of SymCC and evaluated eight extensively tested open-source libraries across four different formats (XML, SQL, JavaScript, and JSON). The experimental result is promising: it shows that Cottontail outperforms state-of-the-art approaches (SymCC and Marco) by 14.15% and 14.31% in terms of line coverage. Besides, Cottontail found 6 previously unknown vulnerabilities (six new CVEs have been assigned). We have reported these issues to developers, and 4 out of them have been fixed so far. - [265] arXiv:2504.17544 [pdf, other]
-
Title: Auditing the Ethical Logic of Generative AI ModelsSubjects: Artificial Intelligence (cs.AI)
As generative AI models become increasingly integrated into high-stakes domains, the need for robust methods to evaluate their ethical reasoning becomes increasingly important. This paper introduces a five-dimensional audit model -- assessing Analytic Quality, Breadth of Ethical Considerations, Depth of Explanation, Consistency, and Decisiveness -- to evaluate the ethical logic of leading large language models (LLMs). Drawing on traditions from applied ethics and higher-order thinking, we present a multi-battery prompt approach, including novel ethical dilemmas, to probe the models' reasoning across diverse contexts. We benchmark seven major LLMs finding that while models generally converge on ethical decisions, they vary in explanatory rigor and moral prioritization. Chain-of-Thought prompting and reasoning-optimized models significantly enhance performance on our audit metrics. This study introduces a scalable methodology for ethical benchmarking of AI systems and highlights the potential for AI to complement human moral reasoning in complex decision-making contexts.
- [266] arXiv:2504.17545 [pdf, html, other]
-
Title: When Gaussian Meets Surfel: Ultra-fast High-fidelity Radiance Field RenderingSubjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce Gaussian-enhanced Surfels (GESs), a bi-scale representation for radiance field rendering, wherein a set of 2D opaque surfels with view-dependent colors represent the coarse-scale geometry and appearance of scenes, and a few 3D Gaussians surrounding the surfels supplement fine-scale appearance details. The rendering with GESs consists of two passes -- surfels are first rasterized through a standard graphics pipeline to produce depth and color maps, and then Gaussians are splatted with depth testing and color accumulation on each pixel order independently. The optimization of GESs from multi-view images is performed through an elaborate coarse-to-fine procedure, faithfully capturing rich scene appearance. The entirely sorting-free rendering of GESs not only achieves very fast rates, but also produces view-consistent images, successfully avoiding popping artifacts under view changes. The basic GES representation can be easily extended to achieve anti-aliasing in rendering (Mip-GES), boosted rendering speeds (Speedy-GES) and compact storage (Compact-GES), and reconstruct better scene geometries by replacing 3D Gaussians with 2D Gaussians (2D-GES). Experimental results show that GESs advance the state-of-the-arts as a compelling representation for ultra-fast high-fidelity radiance field rendering.
- [267] arXiv:2504.17547 [pdf, html, other]
-
Title: A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning TaskComments: 20 pages, 5 figures, 4 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA) by not only requiring the understanding of visual and textual inputs but also extensive range of knowledge, enabling significant advancements across various real-world applications. KB-VQA introduces unique challenges, including the alignment of heterogeneous information from diverse modalities and sources, the retrieval of relevant knowledge from noisy or large-scale repositories, and the execution of complex reasoning to infer answers from the combined context. With the advancement of Large Language Models (LLMs), KB-VQA systems have also undergone a notable transformation, where LLMs serve as powerful knowledge repositories, retrieval-augmented generators and strong reasoners. Despite substantial progress, no comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods. This survey aims to fill this gap by establishing a structured taxonomy of KB-VQA approaches, and categorizing the systems into main stages: knowledge representation, knowledge retrieval, and knowledge reasoning. By exploring various knowledge integration techniques and identifying persistent challenges, this work also outlines promising future research directions, providing a foundation for advancing KB-VQA models and their applications.
- [268] arXiv:2504.17550 [pdf, html, other]
-
Title: HalluLens: LLM Hallucination BenchmarkYejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, Pascale FungComments: 42 pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination." These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is essential for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks, built upon clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from "factuality," proposing a clear taxonomy that distinguishes between extrinsic and intrinsic hallucinations, to promote consistency and facilitate research. Extrinsic hallucinations, where the generated content is not consistent with the training data, are increasingly important as LLMs evolve. Our benchmark includes dynamic test set generation to mitigate data leakage and ensure robustness against such leakage. We also analyze existing benchmarks, highlighting their limitations and saturation. The work aims to: (1) establish a clear taxonomy of hallucinations, (2) introduce new extrinsic hallucination tasks, with data that can be dynamically regenerated to prevent saturation by leakage, (3) provide a comprehensive analysis of existing benchmarks, distinguishing them from factuality evaluations.
- [269] arXiv:2504.17551 [pdf, html, other]
-
Title: Unsupervised Urban Land Use Mapping with Street View Contrastive Clustering and a Geographical PriorComments: 11 pages, 7 figures, preprint versionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Urban land use classification and mapping are critical for urban planning, resource management, and environmental monitoring. Existing remote sensing techniques often lack precision in complex urban environments due to the absence of ground-level details. Unlike aerial perspectives, street view images provide a ground-level view that captures more human and social activities relevant to land use in complex urban scenes. Existing street view-based methods primarily rely on supervised classification, which is challenged by the scarcity of high-quality labeled data and the difficulty of generalizing across diverse urban landscapes. This study introduces an unsupervised contrastive clustering model for street view images with a built-in geographical prior, to enhance clustering performance. When combined with a simple visual assignment of the clusters, our approach offers a flexible and customizable solution to land use mapping, tailored to the specific needs of urban planners. We experimentally show that our method can generate land use maps from geotagged street view image datasets of two cities. As our methodology relies on the universal spatial coherence of geospatial data ("Tobler's law"), it can be adapted to various settings where street view images are available, to enable scalable, unsupervised land use mapping and updating. The code will be available at this https URL.
- [270] arXiv:2504.17554 [pdf, html, other]
-
Title: Rethinking PM Crash Consistency in the CXL EraComments: 5 pages (2 extra pages for references), 1 figure, 2 algorithmsSubjects: Emerging Technologies (cs.ET)
Persistent Memory (PM) introduces new opportunities for designing crash-consistent applications without the traditional storage overheads. However, ensuring crash consistency in PM demands intricate knowledge of CPU, cache, and memory interactions. Hardware and software mechanisms have been proposed to ease this burden, but neither proved sufficient, prompting a variety of bug detection tools.
With the sunset of Intel Optane comes the rise of Compute Express Link (CXL) for PM. In this position paper, we discuss the impact of CXL's disaggregated and heterogeneous nature in the development of crash-consistent PM applications, and outline three research directions: hardware primitives, persistency frameworks, and bug detection tools. - [271] arXiv:2504.17562 [pdf, html, other]
-
Title: When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free GrammarsRei Higuchi, Ryotaro Kawata, Naoki Nishikawa, Kazusato Oko, Shoichiro Yamaguchi, Sosuke Kobayashi, Seiya Tokui, Kohei Hayashi, Daisuke Okanohara, Taiji SuzukiSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning of texts in the pre-training data, making it easier for the model to access latent semantics before observing the entire text. Previous studies have reported that this technique actually improves the performance of trained models in downstream tasks; however, this improvement has been observed only in specific downstream tasks, without consistent enhancement in average next-token prediction loss. To understand this phenomenon, we closely investigate how prepending metadata during pre-training affects model performance by examining its behavior using artificial data. Interestingly, we found that this approach produces both positive and negative effects on the downstream tasks. We demonstrate that the effectiveness of the approach depends on whether latent semantics can be inferred from the downstream task's prompt. Specifically, through investigations using data generated by probabilistic context-free grammars, we show that training with metadata helps improve model's performance when the given context is long enough to infer the latent semantics. In contrast, the technique negatively impacts performance when the context lacks the necessary information to make an accurate posterior inference.
- [272] arXiv:2504.17563 [pdf, html, other]
-
Title: The Case for External Graph SketchingComments: Full version for paper to appear in ACDA proceedingsSubjects: Data Structures and Algorithms (cs.DS)
Algorithms in the data stream model use $O(polylog(N))$ space to compute some property of an input of size $N$, and many of these algorithms are implemented and used in practice. However, sketching algorithms in the graph semi-streaming model use $O(V polylog(V))$ space for a $V$-vertex graph, and the fact that implementations of these algorithms are not used in the academic literature or in industrial applications may be because this space requirement is too large for RAM on today's hardware.
In this paper we introduce the external semi-streaming model, which addresses the aspects of the semi-streaming model that limit its practical impact. In this model, the input is in the form of a stream and $O(V polylog(V))$ space is available, but most of that space is accessible only via block I/O operations as in the external memory model. The goal in the external semi-streaming model is to simultaneously achieve small space and low I/O cost.
We present a general transformation from any vertex-based sketch algorithm to one which has a low sketching cost in the new model. We prove that this automatic transformation is tight or nearly (up to a $O(\log(V))$ factor) tight via an I/O lower bound for the task of sketching the input stream.
Using this transformation and other techniques, we present external semi-streaming algorithms for connectivity, bipartiteness testing, $(1+\epsilon)$-approximating MST weight, testing k-edge connectivity, $(1+\epsilon)$-approximating the minimum cut of a graph, computing $\epsilon$-cut sparsifiers, and approximating the density of the densest subgraph. These algorithms all use $O(V poly(\log(V), \epsilon^{-1},k)$ space. For many of these problems, our external semi-streaming algorithms outperform the state of the art algorithms in both the sketching and external-memory models. - [273] arXiv:2504.17565 [pdf, html, other]
-
Title: DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data TrainingXiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, Xiangang LiSubjects: Computation and Language (cs.CL)
Although large language models (LLMs) have recently achieved remarkable performance on various complex reasoning benchmarks, the academic community still lacks an in-depth understanding of base model training processes and data quality. To address this, we construct a large-scale, difficulty-graded reasoning dataset containing approximately 3.34 million unique queries of varying difficulty levels and about 40 million distilled responses generated by multiple models over several passes. Leveraging pass rate and Coefficient of Variation (CV), we precisely select the most valuable training data to enhance reasoning capability. Notably, we observe a training pattern shift, indicating that reasoning-focused training based on base models requires higher learning rates for effective training. Using this carefully selected data, we significantly improve the reasoning capabilities of the base model, achieving a pass rate of 79.2\% on the AIME2024 mathematical reasoning benchmark. This result surpasses most current distilled models and closely approaches state-of-the-art performance. We provide detailed descriptions of our data processing, difficulty assessment, and training methodology, and have publicly released all datasets and methods to promote rapid progress in open-source long-reasoning LLMs. The dataset is available at: this https URL
- [274] arXiv:2504.17568 [pdf, html, other]
-
Title: Beyond Cox Models: Assessing the Performance of Machine-Learning Methods in Non-Proportional Hazards and Non-Linear Survival AnalysisSubjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Survival analysis often relies on Cox models, assuming both linearity and proportional hazards (PH). This study evaluates machine and deep learning methods that relax these constraints, comparing their performance with penalized Cox models on a benchmark of three synthetic and three real datasets. In total, eight different models were tested, including six non-linear models of which four were also non-PH. Although Cox regression often yielded satisfactory performance, we showed the conditions under which machine and deep learning models can perform better. Indeed, the performance of these methods has often been underestimated due to the improper use of Harrell's concordance index (C-index) instead of more appropriate scores such as Antolini's concordance index, which generalizes C-index in cases where the PH assumption does not hold. In addition, since occasionally high C-index models happen to be badly calibrated, combining Antolini's C-index with Brier's score is useful to assess the overall performance of a survival method. Results on our benchmark data showed that survival prediction should be approached by testing different methods to select the most appropriate one according to sample size, non-linearity and non-PH conditions. To allow an easy reproducibility of these tests on our benchmark data, code and documentation are freely available at this https URL.
- [275] arXiv:2504.17569 [pdf, html, other]
-
Title: Flying through cluttered and dynamic environments with LiDARSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Navigating unmanned aerial vehicles (UAVs) through cluttered and dynamic environments remains a significant challenge, particularly when dealing with fast-moving or sudden-appearing obstacles. This paper introduces a complete LiDAR-based system designed to enable UAVs to avoid various moving obstacles in complex environments. Benefiting the high computational efficiency of perception and planning, the system can operate in real time using onboard computing resources with low latency. For dynamic environment perception, we have integrated our previous work, M-detector, into the system. M-detector ensures that moving objects of different sizes, colors, and types are reliably detected. For dynamic environment planning, we incorporate dynamic object predictions into the integrated planning and control (IPC) framework, namely DynIPC. This integration allows the UAV to utilize predictions about dynamic obstacles to effectively evade them. We validate our proposed system through both simulations and real-world experiments. In simulation tests, our system outperforms state-of-the-art baselines across several metrics, including success rate, time consumption, average flight time, and maximum velocity. In real-world trials, our system successfully navigates through forests, avoiding moving obstacles along its path.
- [276] arXiv:2504.17571 [pdf, html, other]
-
Title: On the Eigenvalue Tracking of Large-Scale SystemsSubjects: Systems and Control (eess.SY)
The paper focuses on the problem of tracking eigenvalue trajectories in large-scale power system models as system parameters vary. A continuation-based formulation is presented for tracing any single eigenvalue of interest, which supports sparse matrix representations and accommodates both explicit and semi-implicit differential-algebraic models. Key implementation aspects, such as numerical integration, matrix updates, derivative approximations, and handling defective eigenvalues, are discussed in detail and practical recommendations are duly provided. The tracking approach is demonstrated through a comprehensive case study on the IEEE 39-bus system, as well as on a realistic dynamic model of the Irish transmission system.
- [277] arXiv:2504.17574 [pdf, html, other]
-
Title: RAGAT-Mind: A Multi-Granular Modeling Approach for Rumor Detection Based on MindSporeSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
As false information continues to proliferate across social media platforms, effective rumor detection has emerged as a pressing challenge in natural language processing. This paper proposes RAGAT-Mind, a multi-granular modeling approach for Chinese rumor detection, built upon the MindSpore deep learning framework. The model integrates TextCNN for local semantic extraction, bidirectional GRU for sequential context learning, Multi-Head Self-Attention for global dependency focusing, and Bidirectional Graph Convolutional Networks (BiGCN) for structural representation of word co-occurrence graphs. Experiments on the Weibo1-Rumor dataset demonstrate that RAGAT-Mind achieves superior classification performance, attaining 99.2% accuracy and a macro-F1 score of 0.9919. The results validate the effectiveness of combining hierarchical linguistic features with graph-based semantic structures. Furthermore, the model exhibits strong generalization and interpretability, highlighting its practical value for real-world rumor detection applications.
- [278] arXiv:2504.17575 [pdf, other]
-
Title: A Multi-Agent, Laxity-Based Aggregation Strategy for Cost-Effective Electric Vehicle Charging and Local Transformer Overload PreventionJournal-ref: Sustainability, 17(9), (2025), 3847Subjects: Multiagent Systems (cs.MA)
The rapid electrification of transportation, driven by stringent decarbonization targets and supportive policies, poses significant challenges for distribution system operators (DSOs). When numerous electric vehicles (EVs) charge concurrently, local transformers risk overloading - a problem that current tariff-based strategies do not adequately address. This paper introduces an aggregator-based coordination mechanism that shifts EV charging from congested to underutilized periods using a rule-based scheduling algorithm. Unlike conventional methods that depend on complex real-time pricing signals or optimization-heavy solutions, the aggregator approach uses a simple yet effective "laxity" measure to prioritize charging flexibility. To assess technical and economic viability, a multi-agent simulation was developed to replicate residential user behavior and DSO constraints under the use of a 400 kVA low-voltage transformer. The results indicate that overloads are completely eliminated with minimal inconvenience to users, whose increased charging costs are offset by the aggregator at an annual total of under DKK 6000 - significantly lower than the cost of infrastructure reinforcement. This study contributes by (i) quantifying the compensation needed to prevent large-scale overloads, (ii) presenting a replicable, computationally feasible, rule-based aggregator model for DSOs, and (iii) comparing aggregator solutions to costly transformer upgrades, underscoring the aggregator's role as a viable tool for future distribution systems.
- [279] arXiv:2504.17577 [pdf, other]
-
Title: TileLang: A Composable Tiled Programming Model for AI SystemsLei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, Zhi YangSubjects: Machine Learning (cs.LG)
Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations on those tiles. However, writing high-performance kernels remains complex despite the clarity of these patterns. Achieving peak performance requires careful, hardware-centric optimizations to fully leverage modern accelerators. While domain-specific compilers attempt to reduce the burden of writing high-performance kernels, they often struggle with usability and expressiveness gaps. In this paper, we present TileLang, a generalized tiled programming model for more efficient AI Kernel programming. TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives. This approach allows users to focus on the kernel's data-flow itself, while leaving most other optimizations to compilers. We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels, demonstrating that its unified block-and-thread paradigm and transparent scheduling capabilities deliver both the power and flexibility demanded by modern AI system development.
- [280] arXiv:2504.17578 [pdf, html, other]
-
Title: Advancing CMA-ES with Learning-Based Cooperative Coevolution for Scalable OptimizationSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Recent research in Cooperative Coevolution~(CC) have achieved promising progress in solving large-scale global optimization problems. However, existing CC paradigms have a primary limitation in that they require deep expertise for selecting or designing effective variable decomposition strategies. Inspired by advancements in Meta-Black-Box Optimization, this paper introduces LCC, a pioneering learning-based cooperative coevolution framework that dynamically schedules decomposition strategies during optimization processes. The decomposition strategy selector is parameterized through a neural network, which processes a meticulously crafted set of optimization status features to determine the optimal strategy for each optimization step. The network is trained via the Proximal Policy Optimization method in a reinforcement learning manner across a collection of representative problems, aiming to maximize the expected optimization performance. Extensive experimental results demonstrate that LCC not only offers certain advantages over state-of-the-art baselines in terms of optimization effectiveness and resource consumption, but it also exhibits promising transferability towards unseen problems.
- [281] arXiv:2504.17582 [pdf, other]
-
Title: Occlusion-Aware Self-Supervised Monocular Depth Estimation for Weak-Texture Endoscopic ImagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
We propose a self-supervised monocular depth estimation network tailored for endoscopic scenes, aiming to infer depth within the gastrointestinal tract from monocular images. Existing methods, though accurate, typically assume consistent illumination, which is often violated due to dynamic lighting and occlusions caused by GI motility. These variations lead to incorrect geometric interpretations and unreliable self-supervised signals, degrading depth reconstruction quality. To address this, we introduce an occlusion-aware self-supervised framework. First, we incorporate an occlusion mask for data augmentation, generating pseudo-labels by simulating viewpoint-dependent occlusion scenarios. This enhances the model's ability to learn robust depth features under partial visibility. Second, we leverage semantic segmentation guided by non-negative matrix factorization, clustering convolutional activations to generate pseudo-labels in texture-deprived regions, thereby improving segmentation accuracy and mitigating information loss from lighting changes. Experimental results on the SCARED dataset show that our method achieves state-of-the-art performance in self-supervised depth estimation. Additionally, evaluations on the Endo-SLAM and SERV-CT datasets demonstrate strong generalization across diverse endoscopic environments.
- [282] arXiv:2504.17583 [pdf, html, other]
-
Title: Shared Randomness in Locally Checkable Problems: The Role of Computational AssumptionsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Shared randomness is a valuable resource in distributed computing, but what happens when the shared random string can affect the inputs to the system?
Consider the class of distributed graph problems where the correctness of solutions can be checked locally, known as Locally Checkable Labelings (LCL). LCL problems have been extensively studied in the LOCAL model, where nodes operate in synchronous rounds and have access only to local information. This has led to intriguing insights regarding the power of private randomness. E.g., for certain round complexity classes, derandomization does not incur an overhead (asymptotically).
This work considers a setting where the randomness is public. Recently, an LCL problem for which shared randomness can reduce the round complexity was discovered by Balliu et al. (2024). This result applies to inputs set obliviously of the shared randomness, which may not always be a plausible assumption.
We define a model where the inputs can be adversarially chosen, even based on the shared randomness, which we now call preset public coins. We study LCL problems in the preset public coins model, under assumptions regarding the computational power of the adversary that selects the input. We show connections to hardness in the class TFNP. Our results are:
1. Assuming the existence of a hard-on-average problem in TFNP (which follows from fairly benign cryptographic assumptions), we show an LCL problem that, in the preset public coins model, demonstrates a gap in the round complexity between polynomial-time adversaries and unbounded ones.
2. If there exists an LCL problem for which the error probability is significantly higher when facing unbounded adversaries, then a hard-on-average problem in TFNP/poly must exist. - [283] arXiv:2504.17584 [pdf, html, other]
-
Title: L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM InferenceQingyuan Liu, Liyan Chen, Yanning Yang, Haocheng Wang, Dong Du, Zhigang Mao, Naifeng Jing, Yubin Xia, Haibo ChenComments: 16 pages, 11 figuresSubjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth. While HBM-based acceleration offers high bandwidth, its capacity remains constrained. Offloading data to host-side DIMMs improves capacity but introduces costly data swapping overhead. We identify that the critical memory bottleneck lies in the decoding phase of multi-head attention (MHA) exclusively, which demands substantial capacity for storing KV caches and high bandwidth for attention computation. Our key insight reveals this operation uniquely aligns with modern DIMM-based processing-in-memory (PIM) architectures, which offers scalability of both capacity and bandwidth.
Based on this observation and insight, we propose L3, a hardware-software co-designed system integrating DIMM-PIM and GPU devices. L3 introduces three innovations: First, hardware redesigns resolve data layout mismatches and computational element mismatches in DIMM-PIM, enhancing LLM inference utilization. Second, communication optimization enables hiding the data transfer overhead with the computation. Third, an adaptive scheduler coordinates GPU-DIMM-PIM operations to maximize parallelism between devices. Evaluations using real-world traces show L3 achieves up to 6.1$\times$ speedup over state-of-the-art HBM-PIM solutions while significantly improving batch sizes. - [284] arXiv:2504.17586 [pdf, html, other]
-
Title: A Machine Learning Approach for Denoising and Upsampling HRTFsSubjects: Sound (cs.SD); Machine Learning (cs.LG)
The demand for realistic virtual immersive audio continues to grow, with Head-Related Transfer Functions (HRTFs) playing a key role. HRTFs capture how sound reaches our ears, reflecting unique anatomical features and enhancing spatial perception. It has been shown that personalized HRTFs improve localization accuracy, but their measurement remains time-consuming and requires a noise-free environment. Although machine learning has been shown to reduce the required measurement points and, thus, the measurement time, a controlled environment is still necessary. This paper proposes a method to address this constraint by presenting a novel technique that can upsample sparse, noisy HRTF measurements. The proposed approach combines an HRTF Denoisy U-Net for denoising and an Autoencoding Generative Adversarial Network (AE-GAN) for upsampling from three measurement points. The proposed method achieves a log-spectral distortion (LSD) error of 5.41 dB and a cosine similarity loss of 0.0070, demonstrating the method's effectiveness in HRTF upsampling.
- [285] arXiv:2504.17589 [pdf, html, other]
-
Title: MacWilliams Theory over Zk and nu-functions over LatticesSubjects: Information Theory (cs.IT)
Continuing previous works on MacWilliams theory over codes and lattices, a generalization of the MacWilliams theory over $\mathbb{Z}_k$ for $m$ codes is established, and the complete weight enumerator MacWilliams identity also holds for codes over the finitely generated rings $\mathbb{Z}_k[\xi]$. In the context of lattices, the analogy of the MacWilliams identity associated with nu-function was conjectured by Solé in 1995, and we present a new formula for nu-function over the lattices associated with a ternary code, which is rather different from the original conjecture. Furthermore, we provide many counterexamples to show that the Solé conjecture never holds in the general case, except for the lattices associated with a binary code.
- [286] arXiv:2504.17590 [pdf, html, other]
-
Title: Mitigating xApp conflicts for efficient network slicing in 6G O-RAN: a graph convolutional-based attention network approachSubjects: Networking and Internet Architecture (cs.NI)
O-RAN (Open-Radio Access Network) offers a flexible, open architecture for next-generation wireless networks. Network slicing within O-RAN allows network operators to create customized virtual networks, each tailored to meet the specific needs of a particular application or service. Efficiently managing these slices is crucial for future 6G networks. O-RAN introduces specialized software applications called xApps that manage different network functions. In network slicing, an xApp can be responsible for managing a separate network slice. To optimize resource allocation across numerous network slices, these xApps must coordinate. Traditional methods where all xApps communicate freely can lead to excessive overhead, hindering network performance. In this paper, we address the issue of xApp conflict mitigation by proposing an innovative Zero-Touch Management (ZTM) solution for radio resource management in O-RAN. Our approach leverages Multi-Agent Reinforcement Learning (MARL) to enable xApps to learn and optimize resource allocation without the need for constant manual intervention. We introduce a Graph Convolutional Network (GCN)-based attention mechanism to streamline communication among xApps, reducing overhead and improving overall system efficiency. Our results compare traditional MARL, where all xApps communicate, against our MARL GCN-based attention method. The findings demonstrate the superiority of our approach, especially as the number of xApps increases, ultimately providing a scalable and efficient solution for optimal network slicing management in O-RAN.
- [287] arXiv:2504.17594 [pdf, html, other]
-
Title: Tamper-evident Image using JPEG Fixed PointsComments: 6 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
An intriguing phenomenon about JPEG compression has been observed since two decades ago- after repeating JPEG compression and decompression, it leads to a stable image that does not change anymore, which is a fixed point. In this work, we prove the existence of fixed points in the essential JPEG procedures. We analyze JPEG compression and decompression processes, revealing the existence of fixed points that can be reached within a few iterations. These fixed points are diverse and preserve the image's visual quality, ensuring minimal distortion. This result is used to develop a method to create a tamper-evident image from the original authentic image, which can expose tampering operations by showing deviations from the fixed point image.
- [288] arXiv:2504.17595 [pdf, html, other]
-
Title: RGB-D Tracking via Hierarchical Modality Aggregation and Distribution NetworkSubjects: Computer Vision and Pattern Recognition (cs.CV)
The integration of dual-modal features has been pivotal in advancing RGB-Depth (RGB-D) tracking. However, current trackers are less efficient and focus solely on single-level features, resulting in weaker robustness in fusion and slower speeds that fail to meet the demands of real-world applications. In this paper, we introduce a novel network, denoted as HMAD (Hierarchical Modality Aggregation and Distribution), which addresses these challenges. HMAD leverages the distinct feature representation strengths of RGB and depth modalities, giving prominence to a hierarchical approach for feature distribution and fusion, thereby enhancing the robustness of RGB-D tracking. Experimental results on various RGB-D datasets demonstrate that HMAD achieves state-of-the-art performance. Moreover, real-world experiments further validate HMAD's capacity to effectively handle a spectrum of tracking challenges in real-time scenarios.
- [289] arXiv:2504.17598 [pdf, html, other]
-
Title: TSUE: A Two-Stage Data Update Method for an Erasure Coded Cluster File SystemComments: 14 pages, 8 figures, accepted by ACM HPDC 2025Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Compared to replication-based storage systems, erasure-coded storage incurs significantly higher overhead during data updates. To address this issue, various parity logging methods have been pro- posed. Nevertheless, due to the long update path and substantial amount of random I/O involved in erasure code update processes, the resulting long latency and low throughput often fail to meet the requirements of high performance applications. To this end, we propose a two-stage data update method called TSUE. TSUE divides the update process into a synchronous stage that records updates in a data log, and an asynchronous stage that recycles the log in real-time. TSUE effectively reduces update latency by transforming random I/O into sequential I/O, and it significantly reduces recycle overhead by utilizing a three-layer log and the spatio-temporal locality of access patterns. In SSDs cluster, TSUE significantly im- proves update performance, achieving improvements of 7.6X under Ali-Cloud trace, 5X under Ten-Cloud trace, while it also extends the SSD's lifespan by up to 13X through reducing the frequencies of reads/writes and of erase operations.
- [290] arXiv:2504.17601 [pdf, html, other]
-
Title: Interpretable non-linear dimensionality reduction using gaussian weighted linear transformationComments: 11 pages, 5 figuresJournal-ref: Erik Bergh (2025). Interpretable dimensionality reduction using weighted linear transformation. Adv. Artif. Intell. Mach. Learn., 5 (1 ):3465-3475Subjects: Machine Learning (cs.LG)
Dimensionality reduction techniques are fundamental for analyzing and visualizing high-dimensional data. With established methods like t-SNE and PCA presenting a trade-off between representational power and interpretability. This paper introduces a novel approach that bridges this gap by combining the interpretability of linear methods with the expressiveness of non-linear transformations. The proposed algorithm constructs a non-linear mapping between high-dimensional and low-dimensional spaces through a combination of linear transformations, each weighted by Gaussian functions. This architecture enables complex non-linear transformations while preserving the interpretability advantages of linear methods, as each transformation can be analyzed independently. The resulting model provides both powerful dimensionality reduction and transparent insights into the transformed space. Techniques for interpreting the learned transformations are presented, including methods for identifying suppressed dimensions and how space is expanded and contracted. These tools enable practitioners to understand how the algorithm preserves and modifies geometric relationships during dimensionality reduction. To ensure the practical utility of this algorithm, the creation of user-friendly software packages is emphasized, facilitating its adoption in both academia and industry.
- [291] arXiv:2504.17603 [pdf, other]
-
Title: SAPO-RL: Sequential Actuator Placement Optimization for Fuselage Assembly via Reinforcement LearningComments: 27 pages, 14 figuresSubjects: Systems and Control (eess.SY)
Precise assembly of composite fuselages is critical for aircraft assembly to meet the ultra-high precision requirements. Due to dimensional variations, there is a gap when two fuselage assemble. In practice, actuators are required to adjust fuselage dimensions by applying forces to specific points on fuselage edge through pulling or pushing force actions. The positioning and force settings of these actuators significantly influence the efficiency of the shape adjustments. The current literature usually predetermines the fixed number of actuators, which is not optimal in terms of overall quality and corresponding actuator costs. However, optimal placement of actuators in terms of both locations and number is challenging due to compliant structures, complex material properties, and dimensional variabilities of incoming fuselages. To address these challenges, this paper introduces a reinforcement learning (RL) framework that enables sequential decision-making for actuator placement selection and optimal force computation. Specifically, our methodology employs the Dueling Double Deep Q-Learning (D3QN) algorithm to refine the decision-making capabilities of sequential actuator placements. The environment is meticulously crafted to enable sequential and incremental selection of an actuator based on system states. We formulate the actuator selection problem as a submodular function optimization problem, where the sub-modularity properties can be adopted to efficiently achieve near-optimal solutions. The proposed methodology has been comprehensively evaluated through numerical studies and comparison studies, demonstrating its effectiveness and outstanding performance in enhancing assembly precision with limited actuator numbers.
- [292] arXiv:2504.17605 [pdf, html, other]
-
Title: A Constraint Opinion ModelSubjects: Logic in Computer Science (cs.LO)
This paper introduces a generalised opinion model that extends the standard DeGroot model by representing agents' opinions and influences as soft constraints rather than single real values. This allows for modelling scenarios beyond the scope of the DeGroot model, such as agents sharing partial information and preferences, engaging in discussions on multiple topics simultaneously, and representing opinions with different degrees of uncertainty. By considering soft constraints as influences, the proposed model captures also situations where agents impose conditions on how others' opinions are integrated during belief revision. Finally, the flexibility offered by soft constraints allows us to introduce a novel polarisation measure that takes advantage of this generalised framework.
- [293] arXiv:2504.17609 [pdf, html, other]
-
Title: STCL:Curriculum learning Strategies for deep learning image steganography modelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Aiming at the problems of poor quality of steganographic images and slow network convergence of image steganography models based on deep learning, this paper proposes a Steganography Curriculum Learning training strategy (STCL) for deep learning image steganography models. So that only easy images are selected for training when the model has poor fitting ability at the initial stage, and gradually expand to more difficult images, the strategy includes a difficulty evaluation strategy based on the teacher model and an knee point-based training scheduling strategy. Firstly, multiple teacher models are trained, and the consistency of the quality of steganographic images under multiple teacher models is used as the difficulty score to construct the training subsets from easy to difficult. Secondly, a training control strategy based on knee points is proposed to reduce the possibility of overfitting on small training sets and accelerate the training process. Experimental results on three large public datasets, ALASKA2, VOC2012 and ImageNet, show that the proposed image steganography scheme is able to improve the model performance under multiple algorithmic frameworks, which not only has a high PSNR, SSIM score, and decoding accuracy, but also the steganographic images generated by the model under the training of the STCL strategy have a low steganography analysis scores. You can find our code at \href{this https URL}{this https URL}.
- [294] arXiv:2504.17610 [pdf, html, other]
-
Title: Modeling Communication Perception in Development Teams Using Monte Carlo MethodsComments: Accepted for publication at the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE 2025)Subjects: Software Engineering (cs.SE)
Software development is a collaborative task involving diverse development teams, where toxic communication can negatively impact team mood and project success. Mood surveys enable the early detection of underlying tensions or dissatisfaction within development teams, allowing communication issues to be addressed before they escalate, fostering a positive and productive work environment. The mood can be surveyed indirectly by analyzing the text-based communication of the team. However, emotional subjectivity leads to varying sentiment interpretations across team members; a statement perceived neutrally by one developer might be seen as problematic by another developer with a different conversational culture. Early identification of perception volatility can help prevent misunderstandings and enhance team morale while safeguarding the project. This paper analyzes the diversity of perceptions within arbitrary development teams and determines how many team members should report their sentiment to accurately reflect the team's mood. Through a Monte Carlo experiment involving 45 developers, we present a preliminary mathematical model to calculate the minimum agreement among a subset of developers based on the whole team's agreement. This model can guide leadership in mood assessment, demonstrating that omitting even a single member in an average-sized 7-member team can misrepresent the overall mood. Therefore, including all developers in mood surveying is recommended to ensure a reliable evaluation of the team's mood.
- [295] arXiv:2504.17613 [pdf, html, other]
-
Title: TarDiff: Target-Oriented Diffusion Guidance for Synthetic Electronic Health Record Time Series GenerationSubjects: Machine Learning (cs.LG)
Synthetic Electronic Health Record (EHR) time-series generation is crucial for advancing clinical machine learning models, as it helps address data scarcity by providing more training data. However, most existing approaches focus primarily on replicating statistical distributions and temporal dependencies of real-world data. We argue that fidelity to observed data alone does not guarantee better model performance, as common patterns may dominate, limiting the representation of rare but important conditions. This highlights the need for generate synthetic samples to improve performance of specific clinical models to fulfill their target outcomes. To address this, we propose TarDiff, a novel target-oriented diffusion framework that integrates task-specific influence guidance into the synthetic data generation process. Unlike conventional approaches that mimic training data distributions, TarDiff optimizes synthetic samples by quantifying their expected contribution to improving downstream model performance through influence functions. Specifically, we measure the reduction in task-specific loss induced by synthetic samples and embed this influence gradient into the reverse diffusion process, thereby steering the generation towards utility-optimized data. Evaluated on six publicly available EHR datasets, TarDiff achieves state-of-the-art performance, outperforming existing methods by up to 20.4% in AUPRC and 18.4% in AUROC. Our results demonstrate that TarDiff not only preserves temporal fidelity but also enhances downstream model performance, offering a robust solution to data scarcity and class imbalance in healthcare analytics.
- [296] arXiv:2504.17614 [pdf, other]
-
Title: Bolt: Clothing Virtual Characters at ScaleSubjects: Graphics (cs.GR)
Clothing virtual characters is a time-consuming and often manual process. Outfits can be composed of multiple garments, and each garment must be fitted to the unique shape of a character. Since characters can vary widely in size and shape, fitting outfits to many characters is a combinatorially large problem. We present Bolt, a system designed to take outfits originally authored on a source body and fit them to new body shapes via a three stage transfer, drape, and rig process. First, our new garment transfer method transforms each garment's 3D mesh positions to the new character, then optimizes the garment's 2D sewing pattern while maintaining key features of the original seams and boundaries. Second, our system simulates the transferred garments to progressively drape and untangle each garment in the outfit. Finally, the garments are rigged to the new character. This entire process is automatic, making it feasible to clothe characters at scale with no human intervention. Clothed characters are then ready for immediate use in applications such as gaming, animation, synthetic generation, and more.
- [297] arXiv:2504.17615 [pdf, html, other]
-
Title: Linear-Time Multilevel Graph Partitioning via Edge SparsificationSubjects: Data Structures and Algorithms (cs.DS)
The current landscape of balanced graph partitioning is divided into high-quality but expensive multilevel algorithms and cheaper approaches with linear running time, such as single-level algorithms and streaming algorithms. We demonstrate how to achieve the best of both worlds with a \emph{linear time multilevel algorithm}. Multilevel algorithms construct a hierarchy of increasingly smaller graphs by repeatedly contracting clusters of nodes. Our approach preserves their distinct advantage, allowing refinement of the partition over multiple levels with increasing detail. At the same time, we use \emph{edge sparsification} to guarantee geometric size reduction between the levels and thus linear running time.
We provide a proof of the linear running time as well as additional insights into the behavior of multilevel algorithms, showing that graphs with low modularity are most likely to trigger worst-case running time. We evaluate multiple approaches for edge sparsification and integrate our algorithm into the state-of-the-art multilevel partitioner KaMinPar, maintaining its excellent parallel scalability. As demonstrated in detailed experiments, this results in a $1.49\times$ average speedup (up to $4\times$ for some instances) with only 1\% loss in solution quality. Moreover, our algorithm clearly outperforms state-of-the-art single-level and streaming approaches. - [298] arXiv:2504.17617 [pdf, html, other]
-
Title: Decentralized Time Series Classification with ROCKET FeaturesComments: Submitted to Workshop on Federated Learning Advancements 2025, in conjunction with ECML-PKDD, WAFL25Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Time series classification (TSC) is a critical task with applications in various domains, including healthcare, finance, and industrial monitoring. Due to privacy concerns and data regulations, Federated Learning has emerged as a promising approach for learning from distributed time series data without centralizing raw information. However, most FL solutions rely on a client-server architecture, which introduces robustness and confidentiality risks related to the distinguished role of the server, which is a single point of failure and can observe knowledge extracted from clients. To address these challenges, we propose DROCKS, a fully decentralized FL framework for TSC that leverages ROCKET (RandOm Convolutional KErnel Transform) features. In DROCKS, the global model is trained by sequentially traversing a structured path across federation nodes, where each node refines the model and selects the most effective local kernels before passing them to the successor. Extensive experiments on the UCR archive demonstrate that DROCKS outperforms state-of-the-art client-server FL approaches while being more resilient to node failures and malicious attacks. Our code is available at this https URL.
- [299] arXiv:2504.17618 [pdf, html, other]
-
Title: The effects of Hessian eigenvalue spectral density type on the applicability of Hessian analysis to generalization capability assessment of neural networksComments: 11 pages, 10 figures, 4 tables, 4 equationsSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Hessians of neural network (NN) contain essential information about the curvature of NN loss landscapes which can be used to estimate NN generalization capabilities. We have previously proposed generalization criteria that rely on the observation that Hessian eigenvalue spectral density (HESD) behaves similarly for a wide class of NNs. This paper further studies their applicability by investigating factors that can result in different types of HESD. We conduct a wide range of experiments showing that HESD mainly has positive eigenvalues (MP-HESD) for NN training and fine-tuning with various optimizers on different datasets with different preprocessing and augmentation procedures. We also show that mainly negative HESD (MN-HESD) is a consequence of external gradient manipulation, indicating that the previously proposed Hessian analysis methodology cannot be applied in such cases. We also propose criteria and corresponding conditions to determine HESD type and estimate NN generalization potential. These HESD types and previously proposed generalization criteria are combined into a unified HESD analysis methodology. Finally, we discuss how HESD changes during training, and show the occurrence of quasi-singular (QS) HESD and its influence on the proposed methodology and on the conventional assumptions about the relation between Hessian eigenvalues and NN loss landscape curvature.
- [300] arXiv:2504.17619 [pdf, html, other]
-
Title: Enhancing CNNs robustness to occlusions with bioinspired filters for border completionComments: Submitted to the 7th International Conference on Geometric Science of InformationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
We exploit the mathematical modeling of the visual cortex mechanism for border completion to define custom filters for CNNs. We see a consistent improvement in performance, particularly in accuracy, when our modified LeNet 5 is tested with occluded MNIST images.
- [301] arXiv:2504.17626 [pdf, html, other]
-
Title: Improving Open-World Object Localization by Discovering BackgroundAshish Singh, Michael J. Jones, Kuan-Chuan Peng, Anoop Cherian, Moitreya Chatterjee, Erik Learned-MillerSubjects: Computer Vision and Pattern Recognition (cs.CV)
Our work addresses the problem of learning to localize objects in an open-world setting, i.e., given the bounding box information of a limited number of object classes during training, the goal is to localize all objects, belonging to both the training and unseen classes in an image, during inference. Towards this end, recent work in this area has focused on improving the characterization of objects either explicitly by proposing new objective functions (localization quality) or implicitly using object-centric auxiliary-information, such as depth information, pixel/region affinity map etc. In this work, we address this problem by incorporating background information to guide the learning of the notion of objectness. Specifically, we propose a novel framework to discover background regions in an image and train an object proposal network to not detect any objects in these regions. We formulate the background discovery task as that of identifying image regions that are not discriminative, i.e., those that are redundant and constitute low information content. We conduct experiments on standard benchmarks to showcase the effectiveness of our proposed approach and observe significant improvements over the previous state-of-the-art approaches for this task.
- [302] arXiv:2504.17629 [pdf, html, other]
-
Title: Integrated Sensing and Communications for Unsourced Random Access: A Spectrum Sharing Compressive Sensing ApproachSubjects: Information Theory (cs.IT)
This paper addresses the unsourced/uncoordinated random access problem in an integrated sensing and communications (ISAC) system, with a focus on uplink multiple access code design. Recent theoretical advancements highlight that an ISAC system will be overwhelmed by the increasing number of active devices, driven by the growth of massive machine-type communication (mMTC). To meet the demands of future mMTC network, fundamental solutions are required that ensure robust capacity while maintaining favorable energy and spectral efficiency. One promising approach to support emerging massive connectivity is the development of systems based on the unsourced ISAC (UNISAC) framework. This paper proposes a spectrum-sharing compressive sensing-based UNISAC (SSCS-UNISAC) and offers insights into the practical design of UNISAC multiple access codes. In this framework, both communication signals (data transmission) and sensing signals (e.g., radar echoes) overlap within finite channel uses and are transmitted via the proposed UNISAC protocol. The proposed decoder exhibits robust performance, providing 20-30 dB capacity gains compared to conventional protocols such as TDMA and ALOHA. Numerical results validate the promising performance of the proposed scheme.
- [303] arXiv:2504.17632 [pdf, html, other]
-
Title: Are EVs Cleaner Than We Think? Evaluating Consequential Greenhouse Gas Emissions from EV ChargingSubjects: Systems and Control (eess.SY)
While electrifying transportation eliminates tailpipe greenhouse gas (GHG) emissions, electric vehicle (EV) adoption can create additional electricity sector emissions. To quantify this emissions impact, prior work typically employs short-run marginal emissions or average emissions rates calculated from historical data or power systems models that do not consider changes in installed capacity. In this work, we use an electricity system capacity expansion model to consider the full consequential GHG emissions impact from large-scale EV adoption in the western United States, accounting for induced changes in generation and storage capacity. We find that the metrics described above do not accurately reflect the true emissions impact of EV adoption-average emissions rates can either under- or over-estimate emission impacts, and short-run marginal emissions rates can significantly underestimate emission reductions, especially when charging timing is flexible. Our results also show that using short-run marginal emission rates as signals to coordinate EV charging could increase emissions relative to price-based charging signals, indicating the need for alternative control strategies to minimize consequential emissions.
- [304] arXiv:2504.17633 [pdf, html, other]
-
Title: A general framework for finding diverse solutions via network flow and its applicationsSubjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC)
In this paper, we present a general framework for efficiently computing diverse solutions to combinatorial optimization problems. Given a problem instance, the goal is to find $k$ solutions that maximize a specified diversity measure; the sum of pairwise Hamming distances or the size of the union of the $k$ solutions. Our framework applies to problems satisfying two structural properties: (i) All solutions are of equal size and (ii) the family of all solutions can be represented by a surjection from the family of ideals of some finite poset. Under these conditions, we show that the problem of computing $k$ diverse solutions can be reduced to the minimum cost flow problem and the maximum $s$-$t$ flow problem. As applications, we demonstrate that both the unweighted minimum $s$-$t$ cut problem and the stable matching problem satisfy the requirements of our framework. By utilizing the recent advances in network flows algorithms, we improve the previously known time complexities of the diverse problems, which were based on submodular function minimization.
- [305] arXiv:2504.17634 [pdf, html, other]
-
Title: Sparsity-Exploiting Channel Estimation For Unsourced Random Access With Fluid AntennaSubjects: Information Theory (cs.IT)
This work explores the channel estimation (CE) problem in uplink transmission for unsourced random access (URA) with a fluid antenna receiver. The additional spatial diversity in a fluid antenna system (FAS) addresses the needs of URA design in multiple-input and multiple-output (MIMO) systems. We present two CE strategies based on the activation of different FAS ports, namely alternate ports and partial ports CE. Both strategies facilitate the estimation of channel coefficients and angles of arrival (AoAs). Additionally, we discuss how to refine channel estimation by leveraging the sparsity of finite scatterers. Specifically, the proposed partial ports CE strategy is implemented using a regularized estimator, and we optimize the estimator's parameter to achieve the desired AoA precision and refinement. Extensive numerical results demonstrate the feasibility of the proposed strategies, and a comparison with a conventional receiver using half-wavelength antennas highlights the promising future of integrating URA and FAS.
- [306] arXiv:2504.17636 [pdf, html, other]
-
Title: A Guide to Structureless Visual LocalizationVojtech Panek, Qunjie Zhou, Yaqing Ding, Sérgio Agostinho, Zuzana Kukelova, Torsten Sattler, Laura Leal-TaixéSubjects: Computer Vision and Pattern Recognition (cs.CV)
Visual localization algorithms, i.e., methods that estimate the camera pose of a query image in a known scene, are core components of many applications, including self-driving cars and augmented / mixed reality systems. State-of-the-art visual localization algorithms are structure-based, i.e., they store a 3D model of the scene and use 2D-3D correspondences between the query image and 3D points in the model for camera pose estimation. While such approaches are highly accurate, they are also rather inflexible when it comes to adjusting the underlying 3D model after changes in the scene. Structureless localization approaches represent the scene as a database of images with known poses and thus offer a much more flexible representation that can be easily updated by adding or removing images. Although there is a large amount of literature on structure-based approaches, there is significantly less work on structureless methods. Hence, this paper is dedicated to providing the, to the best of our knowledge, first comprehensive discussion and comparison of structureless methods. Extensive experiments show that approaches that use a higher degree of classical geometric reasoning generally achieve higher pose accuracy. In particular, approaches based on classical absolute or semi-generalized relative pose estimation outperform very recent methods based on pose regression by a wide margin. Compared with state-of-the-art structure-based approaches, the flexibility of structureless methods comes at the cost of (slightly) lower pose accuracy, indicating an interesting direction for future work.
- [307] arXiv:2504.17641 [pdf, html, other]
-
Title: PTCL: Pseudo-Label Temporal Curriculum Learning for Label-Limited Dynamic GraphSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Dynamic node classification is critical for modeling evolving systems like financial transactions and academic collaborations. In such systems, dynamically capturing node information changes is critical for dynamic node classification, which usually requires all labels at every timestamp. However, it is difficult to collect all dynamic labels in real-world scenarios due to high annotation costs and label uncertainty (e.g., ambiguous or delayed labels in fraud detection). In contrast, final timestamp labels are easier to obtain as they rely on complete temporal patterns and are usually maintained as a unique label for each user in many open platforms, without tracking the history data. To bridge this gap, we propose PTCL(Pseudo-label Temporal Curriculum Learning), a pioneering method addressing label-limited dynamic node classification where only final labels are available. PTCL introduces: (1) a temporal decoupling architecture separating the backbone (learning time-aware representations) and decoder (strictly aligned with final labels), which generate pseudo-labels, and (2) a Temporal Curriculum Learning strategy that prioritizes pseudo-labels closer to the final timestamp by assigning them higher weights using an exponentially decaying function. We contribute a new academic dataset (CoOAG), capturing long-range research interest in dynamic graph. Experiments across real-world scenarios demonstrate PTCL's consistent superiority over other methods adapted to this task. Beyond methodology, we propose a unified framework FLiD (Framework for Label-Limited Dynamic Node Classification), consisting of a complete preparation workflow, training pipeline, and evaluation standards, and supporting various models and datasets. The code can be found at this https URL.
- [308] arXiv:2504.17643 [pdf, html, other]
-
Title: CLIPSE -- a minimalistic CLIP-based image search engine for researchSubjects: Computer Vision and Pattern Recognition (cs.CV)
A brief overview of CLIPSE, a self-hosted image search engine with the main application of research, is provided. In general, CLIPSE uses CLIP embeddings to process the images and also the text queries. The overall framework is designed with simplicity to enable easy extension and usage. Two benchmark scenarios are described and evaluated, covering indexing and querying time. It is shown that CLIPSE is capable of handling smaller datasets; for larger datasets, a distributed approach with several instances should be considered.
- [309] arXiv:2504.17646 [pdf, html, other]
-
Title: Portability of Optimizations from SC to TSOComments: Submitted Manuscript. This pre-print has not undergone any post-review modifications/improvementsSubjects: Programming Languages (cs.PL)
It is well recognized that the safety of compiler optimizations is at risk in a concurrent context. Existing approaches primarily rely on context-free thread-local guarantees, and prohibit optimizations that introduce a data-race. However, compilers utilize global context-specific information, exposing safe optimizations that may violate such guarantees as well as introduce a race. Such optimizations need to individually be proven safe for each language model. An alternate approach to this would be proving them safe for an intuitive model (like interleaving semantics), and then determine their portability across other concurrent models. In this paper, we address this problem of porting across models of concurrency. We first identify a global guarantee on optimizations portable from Sequential Consistency (SC) to Total Store Order (TSO). Our guarantee is in the form of constraints specifying the syntactic changes an optimization must not incur. We then show these constraints correlate to prohibiting the introduction of triangular races, a subset of data-race relevant to TSO. We conclude by showing how such race inducing optimizations relate to porting across Strong Release Acquire (SRA), a known causally consistent memory model.
- [310] arXiv:2504.17647 [pdf, html, other]
-
Title: Unifying Complementarity Constraints and Control Barrier Functions for Safe Whole-Body Robot ControlRafael I. Cabral Muchacho, Riddhiman Laha, Florian T. Pokorny, Luis F.C. Figueredo, Nilanjan ChakrabortySubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Safety-critical whole-body robot control demands reactive methods that ensure collision avoidance in real-time. Complementarity constraints and control barrier functions (CBF) have emerged as core tools for ensuring such safety constraints, and each represents a well-developed field. Despite addressing similar problems, their connection remains largely unexplored. This paper bridges this gap by formally proving the equivalence between these two methodologies for sampled-data, first-order systems, considering both single and multiple constraint scenarios. By demonstrating this equivalence, we provide a unified perspective on these techniques. This unification has theoretical and practical implications, facilitating the cross-application of robustness guarantees and algorithmic improvements between complementarity and CBF frameworks. We discuss these synergistic benefits and motivate future work in the comparison of the methods in more general cases.
- [311] arXiv:2504.17649 [pdf, html, other]
-
Title: On Josephy-Halley method for generalized equationsComments: 17 pages, 3 figuresSubjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
We extend the classical third-order Halley iteration to the setting of generalized equations of the form \[ 0 \in f(x) + F(x), \] where \(f\colon X\longrightarrow Y\) is twice continuously Fréchet-differentiable on Banach spaces and \(F\colon X\tto Y\) is a set-valued mapping with closed graph. Building on predictor-corrector framework, our scheme first solves a partially linearized inclusion to produce a predictor \(u_{k+1}\), then incorporates second-order information in a Halley-type corrector step to obtain \(x_{k+1}\). Under metric regularity of the linearization at a reference solution and Hölder continuity of \(f''\), we prove that the iterates converge locally with order \(2+p\) (cubically when \(p=1\)). Moreover, by constructing a suitable scalar majorant function we derive semilocal Kantorovich-type conditions guaranteeing well-definedness and R-cubic convergence from an explicit neighbourhood of the initial guess. Numerical experiments-including one- and two-dimensional test problems confirm the theoretical convergence rates and illustrate the efficiency of the Josephy-Halley method compared to its Josephy-Newton counterpart.
- [312] arXiv:2504.17653 [pdf, other]
-
Title: Towards a comprehensive taxonomy of online abusive language informed by machine leaningSubjects: Computation and Language (cs.CL)
The proliferation of abusive language in online communications has posed significant risks to the health and wellbeing of individuals and communities. The growing concern regarding online abuse and its consequences necessitates methods for identifying and mitigating harmful content and facilitating continuous monitoring, moderation, and early intervention. This paper presents a taxonomy for distinguishing key characteristics of abusive language within online text. Our approach uses a systematic method for taxonomy development, integrating classification systems of 18 existing multi-label datasets to capture key characteristics relevant to online abusive language classification. The resulting taxonomy is hierarchical and faceted, comprising 5 categories and 17 dimensions. It classifies various facets of online abuse, including context, target, intensity, directness, and theme of abuse. This shared understanding can lead to more cohesive efforts, facilitate knowledge exchange, and accelerate progress in the field of online abuse detection and mitigation among researchers, policy makers, online platform owners, and other stakeholders.
- [313] arXiv:2504.17655 [pdf, html, other]
-
Title: Aerial Image Classification in Scarce and Unconstrained Environments via Conformal PredictionComments: 17 pages, 5 figures, and 2 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
This paper presents a comprehensive empirical analysis of conformal prediction methods on a challenging aerial image dataset featuring diverse events in unconstrained environments. Conformal prediction is a powerful post-hoc technique that takes the output of any classifier and transforms it into a set of likely labels, providing a statistical guarantee on the coverage of the true label. Unlike evaluations on standard benchmarks, our study addresses the complexities of data-scarce and highly variable real-world settings. We investigate the effectiveness of leveraging pretrained models (MobileNet, DenseNet, and ResNet), fine-tuned with limited labeled data, to generate informative prediction sets. To further evaluate the impact of calibration, we consider two parallel pipelines (with and without temperature scaling) and assess performance using two key metrics: empirical coverage and average prediction set size. This setup allows us to systematically examine how calibration choices influence the trade-off between reliability and efficiency. Our findings demonstrate that even with relatively small labeled samples and simple nonconformity scores, conformal prediction can yield valuable uncertainty estimates for complex tasks. Moreover, our analysis reveals that while temperature scaling is often employed for calibration, it does not consistently lead to smaller prediction sets, underscoring the importance of careful consideration in its application. Furthermore, our results highlight the significant potential of model compression techniques within the conformal prediction pipeline for deployment in resource-constrained environments. Based on our observations, we advocate for future research to delve into the impact of noisy or ambiguous labels on conformal prediction performance and to explore effective model reduction strategies.
- [314] arXiv:2504.17656 [pdf, html, other]
-
Title: polyGen: A Learning Framework for Atomic-level Polymer Structure GenerationSubjects: Computational Engineering, Finance, and Science (cs.CE); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
Synthetic polymeric materials underpin fundamental technologies in the energy, electronics, consumer goods, and medical sectors, yet their development still suffers from prolonged design timelines. Although polymer informatics tools have supported speedup, polymer simulation protocols continue to face significant challenges: on-demand generation of realistic 3D atomic structures that respect the conformational diversity of polymer structures. Generative algorithms for 3D structures of inorganic crystals, bio-polymers, and small molecules exist, but have not addressed synthetic polymers. In this work, we introduce polyGen, the first latent diffusion model designed specifically to generate realistic polymer structures from minimal inputs such as the repeat unit chemistry alone, leveraging a molecular encoding that captures polymer connectivity throughout the architecture. Due to a scarce dataset of only 3855 DFT-optimized polymer structures, we augment our training with DFT-optimized molecular structures, showing improvement in joint learning between similar chemical structures. We also establish structure matching criteria to benchmark our approach on this novel problem. polyGen effectively generates diverse conformations of both linear chains and complex branched structures, though its performance decreases when handling repeat units with a high atom count. Given these initial results, polyGen represents a paradigm shift in atomic-level structure generation for polymer science-the first proof-of-concept for predicting realistic atomic-level polymer conformations while accounting for their intrinsic structural flexibility.
- [315] arXiv:2504.17660 [pdf, html, other]
-
Title: Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation ModelsSubjects: Machine Learning (cs.LG)
Simulation-based inference (SBI) offers a flexible and general approach to performing Bayesian inference: In SBI, a neural network is trained on synthetic data simulated from a model and used to rapidly infer posterior distributions for observed data. A key goal for SBI is to achieve accurate inference with as few simulations as possible, especially for expensive simulators. In this work, we address this challenge by repurposing recent probabilistic foundation models for tabular data: We show how tabular foundation models -- specifically TabPFN -- can be used as pre-trained autoregressive conditional density estimators for SBI. We propose Neural Posterior Estimation with Prior-data Fitted Networks (NPE-PF) and show that it is competitive with current SBI approaches in terms of accuracy for both benchmark tasks and two complex scientific inverse problems. Crucially, it often substantially outperforms them in terms of simulation efficiency, sometimes requiring orders of magnitude fewer simulations. NPE-PF eliminates the need for inference network selection, training, and hyperparameter tuning. We also show that it exhibits superior robustness to model misspecification and can be scaled to simulation budgets that exceed the context size limit of TabPFN. NPE-PF provides a new direction for SBI, where training-free, general-purpose inference models offer efficient, easy-to-use, and flexible solutions for a wide range of stochastic inverse problems.
- [316] arXiv:2504.17662 [pdf, other]
-
Title: Seamless Data Migration between Database Schemas with DAMI-Framework: An Empirical Study on Developer ExperienceDelfina Ramos-Vidal, Alejandro Cortiñas, Miguel R. Luaces, Oscar Pedreira, Ángeles Saavedra Places, Wesley K. G. AssunçãoSubjects: Software Engineering (cs.SE); Databases (cs.DB)
Many businesses depend on legacy systems, which often use outdated technology that complicates maintenance and updates. Therefore, software modernization is essential, particularly data migration between different database schemas. Established methodologies, like model transformation and ETL tools, facilitate this migration; they require deep knowledge of database languages and both the source and target schemas. This necessity renders data migration an error-prone and cognitively demanding task. Our objective is to alleviate developers' workloads during schema evolution through our DAMI-Framework. This framework incorporates a domain-specific language (DSL) and a parser to facilitate data migration between database schemas. DAMI-DSL simplifies schema mapping while the parser automates SQL script generation. We assess developer experience in data migration by conducting an empirical evaluation with 21 developers to assess their experiences using our DSL versus traditional SQL. The study allows us to measure their perceptions of the DSL properties and user experience. The participants praised DAMI-DSL for its readability and ease of use. The findings indicate that our DSL reduces data migration efforts compared to SQL scripts.
- [317] arXiv:2504.17663 [pdf, html, other]
-
Title: The Malicious Technical Ecosystem: Exposing Limitations in Technical Governance of AI-Generated Non-Consensual Intimate Images of AdultsJournal-ref: In the 2025 Conference on Human Factors in Computing Systems Sociotechnical AI Governance Workshop (CHI-STAIG'25), April 2025, Yokahoma, JapanSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
In this paper, we adopt a survivor-centered approach to locate and dissect the role of sociotechnical AI governance in preventing AI-Generated Non-Consensual Intimate Images (AIG-NCII) of adults, colloquially known as "deep fake pornography." We identify a "malicious technical ecosystem" or "MTE," comprising of open-source face-swapping models and nearly 200 "nudifying" software programs that allow non-technical users to create AIG-NCII within minutes. Then, using the National Institute of Standards and Technology (NIST) AI 100-4 report as a reflection of current synthetic content governance methods, we show how the current landscape of practices fails to effectively regulate the MTE for adult AIG-NCII, as well as flawed assumptions explaining these gaps.
- [318] arXiv:2504.17664 [pdf, html, other]
-
Title: On Multivariate Financial Time Series ClassificationSubjects: Machine Learning (cs.LG)
This article investigates the use of Machine Learning and Deep Learning models in multivariate time series analysis within financial markets. It compares small and big data approaches, focusing on their distinct challenges and the benefits of scaling. Traditional methods such as SVMs are contrasted with modern architectures like ConvTimeNet. The results show the importance of using and understanding Big Data in depth in the analysis and prediction of financial time series.
- [319] arXiv:2504.17665 [pdf, html, other]
-
Title: Evaluating Grounded Reasoning by Code-Assisted Large Language Models for MathematicsSubjects: Computation and Language (cs.CL)
Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs' generated programs in response to math reasoning tasks. Our evaluation focuses on the extent to which LLMs ground their programs to math rules, and how that affects their end performance. For this purpose, we assess the generations of five different LLMs, on two different math datasets, both manually and automatically. Our results reveal that the distribution of grounding depends on LLMs' capabilities and the difficulty of math problems. Furthermore, mathematical grounding is more effective for closed-source models, while open-source models fail to employ math rules in their solutions correctly. On MATH500, the percentage of grounded programs decreased to half, while the ungrounded generations doubled in comparison to ASDiv grade-school problems. Our work highlights the need for in-depth evaluation beyond execution accuracy metrics, toward a better understanding of code-assisted LLMs' capabilities and limits in the math domain.
- [320] arXiv:2504.17666 [pdf, html, other]
-
Title: A Systematic Study on the Design of Odd-Sized Highly Nonlinear Boolean Functions via Evolutionary AlgorithmsComments: 28 pages, 10 figures, extended version of the conference paper "A Systematic Evaluation of Evolving Highly Nonlinear Boolean Functions in Odd Sizes" published in EuroGP 2025Subjects: Neural and Evolutionary Computing (cs.NE); Cryptography and Security (cs.CR)
This paper focuses on the problem of evolving Boolean functions of odd sizes with high nonlinearity, a property of cryptographic relevance. Despite its simple formulation, this problem turns out to be remarkably difficult. We perform a systematic evaluation by considering three solution encodings and four problem instances, analyzing how well different types of evolutionary algorithms behave in finding a maximally nonlinear Boolean function. Our results show that genetic programming generally outperforms other evolutionary algorithms, although it falls short of the best-known results achieved by ad-hoc heuristics. Interestingly, by adding local search and restricting the space to rotation symmetric Boolean functions, we show that a genetic algorithm with the bitstring encoding manages to evolve a $9$-variable Boolean function with nonlinearity 241.
- [321] arXiv:2504.17669 [pdf, html, other]
-
Title: Towards a HIPAA Compliant Agentic AI System in HealthcareSubjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Agentic AI systems powered by Large Language Models (LLMs) as their foundational reasoning engine, are transforming clinical workflows such as medical report generation and clinical summarization by autonomously analyzing sensitive healthcare data and executing decisions with minimal human oversight. However, their adoption demands strict compliance with regulatory frameworks such as Health Insurance Portability and Accountability Act (HIPAA), particularly when handling Protected Health Information (PHI). This work-in-progress paper introduces a HIPAA-compliant Agentic AI framework that enforces regulatory compliance through dynamic, context-aware policy enforcement. Our framework integrates three core mechanisms: (1) Attribute-Based Access Control (ABAC) for granular PHI governance, (2) a hybrid PHI sanitization pipeline combining regex patterns and BERT-based model to minimize leakage, and (3) immutable audit trails for compliance verification.
- [322] arXiv:2504.17670 [pdf, html, other]
-
Title: DiMeR: Disentangled Mesh Reconstruction ModelLutao Jiang, Jiantao Lin, Kanghao Chen, Wenhang Ge, Xin Yang, Yifan Jiang, Yuanhuiyi Lyu, Xu Zheng, Yingcong ChenComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
With the advent of large-scale 3D datasets, feed-forward 3D generative models, such as the Large Reconstruction Model (LRM), have gained significant attention and achieved remarkable success. However, we observe that RGB images often lead to conflicting training objectives and lack the necessary clarity for geometry reconstruction. In this paper, we revisit the inductive biases associated with mesh reconstruction and introduce DiMeR, a novel disentangled dual-stream feed-forward model for sparse-view mesh reconstruction. The key idea is to disentangle both the input and framework into geometry and texture parts, thereby reducing the training difficulty for each part according to the Principle of Occam's Razor. Given that normal maps are strictly consistent with geometry and accurately capture surface variations, we utilize normal maps as exclusive input for the geometry branch to reduce the complexity between the network's input and output. Moreover, we improve the mesh extraction algorithm to introduce 3D ground truth supervision. As for texture branch, we use RGB images as input to obtain the textured mesh. Overall, DiMeR demonstrates robust capabilities across various tasks, including sparse-view reconstruction, single-image-to-3D, and text-to-3D. Numerous experiments show that DiMeR significantly outperforms previous methods, achieving over 30% improvement in Chamfer Distance on the GSO and OmniObject3D dataset.
- [323] arXiv:2504.17671 [pdf, html, other]
-
Title: Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal PredictionSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This study addresses the critical challenge of hallucination mitigation in Large Vision-Language Models (LVLMs) for Visual Question Answering (VQA) tasks through a Split Conformal Prediction (SCP) framework. While LVLMs excel in multi-modal reasoning, their outputs often exhibit hallucinated content with high confidence, posing risks in safety-critical applications. We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification. By partitioning data into calibration and test sets, the framework computes nonconformity scores to construct prediction sets with statistical guarantees under user-defined risk levels ($\alpha$). Key innovations include: (1) rigorous control of \textbf{marginal coverage} to ensure empirical error rates remain strictly below $\alpha$; (2) dynamic adjustment of prediction set sizes inversely with $\alpha$, filtering low-confidence outputs; (3) elimination of prior distribution assumptions and retraining requirements. Evaluations on benchmarks (ScienceQA, MMMU) with eight LVLMs demonstrate that SCP enforces theoretical guarantees across all $\alpha$ values. The framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains. This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
- [324] arXiv:2504.17672 [pdf, html, other]
-
Title: Cross-region Model Training with Communication-Computation Overlapping and Delay CompensationSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Training large language models (LLMs) requires massive computational resources, often necessitating the aggregation of geographically distributed data centers (\ie, cross-region training). However, the high communication latency in wide-area networks severely degrades the efficiency of traditional distributed training. While methods like DiLoCo reduce communication frequency, they suffer from blocking synchronization. Streaming DiLoCo alleviates this issue via communication-computation overlapping but introduces update staleness and model inconsistency due to delayed global updates and partial synchronization. These factors impair convergence, especially when aggressive overlap is needed to mask high latency. We propose CoCoDC, a novel distributed training framework with communication-computation overlapping and delay compensation, to explicitly tackle these challenges. Within the CoCoDC framework, we specifically develop a novel Delay Compensation strategy based on Taylor expansion to effectively mitigate the staleness and an Adaptive Transmission strategy that dynamically schedules model fragment synchronization to optimize bandwidth usage and accelerate convergence. Extensive experiments highlight the superior performance of CoCoDC over both DiLoCo and Streaming DiLoCo regarding final accuracy and training speed. Specifically, CoCoDC reduces the training steps needed to reach a comparable perplexity by up to 21.0% compared to Streaming DiLoCo. Our work provides an effective solution for scalable and efficient cross-region LLM training.
- [325] arXiv:2504.17673 [pdf, html, other]
-
Title: DTECM: Digital Twin Enabled Channel Measurement and Modeling in Terahertz Urban MacrocellComments: 14 pages, 17 figures, 1 tableSubjects: Information Theory (cs.IT)
In this work, in the THz UMa, extensive channel measurements are conducted and an accurate channel model is developed by combining ray-tracing, computer vision (CV), and statistical methods. Specifically, substantial channel measurement campaigns with distances up to 410~m are conducted at 220~GHz, with nanosecond-level absolute time synchronization. Based on the measurement results, the propagation phenomena are analyzed in detail and the channel characteristics are calculated and statistically modeled. Furthermore, a digital twin enabled channel model (DTECM) is proposed, which generates THz channel responses in a hybrid manner. Specifically, the dominant paths are generated deterministically by using the ray-tracing technique and CV methods. Apart from the path gains determined by ray-tracing, the additional foliage loss is accurately modeled based on foliage information extracted from panoramic pictures. To maintain a low computational complexity for the DTECM, non-dominant paths are then generated statistically. Numeric results reveal that compared to the traditional statistical channel models, the DTECM reduces the path loss modeling error from 14~dB to 4~dB, showing its great superiority. Furthermore, a preliminary link performance evaluation using the DTECM indicates that THz UMa is feasible, though requiring high antenna gains and coverage extension techniques to achieve high spectral efficiencies and wide coverage.
- [326] arXiv:2504.17674 [pdf, html, other]
-
Title: Energy Considerations of Large Language Model Inference and Efficiency OptimizationsComments: 16 pagesSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. Prior benchmarking efforts have primarily focused on latency reduction in idealized settings, often overlooking the diverse real-world inference workloads that shape energy use. In this work, we systematically analyze the energy implications of common inference efficiency optimizations across diverse Natural Language Processing (NLP) and generative Artificial Intelligence (AI) workloads, including conversational AI and code generation. We introduce a modeling approach that approximates real-world LLM workflows through a binning strategy for input-output token distributions and batch size variations. Our empirical analysis spans software frameworks, decoding strategies, GPU architectures, online and offline serving settings, and model parallelism configurations. We show that the effectiveness of inference optimizations is highly sensitive to workload geometry, software stack, and hardware accelerators, demonstrating that naive energy estimates based on FLOPs or theoretical GPU utilization significantly underestimate real-world energy consumption. Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to 73% from unoptimized baselines. These insights provide a foundation for sustainable LLM deployment and inform energy-efficient design strategies for future AI infrastructure.
- [327] arXiv:2504.17675 [pdf, other]
-
Title: Optimized Cloud Resource Allocation Using Genetic Algorithms for Energy Efficiency and QoS AssuranceComments: 7 pages, 5 figures, accepted for publication (not yet published)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Cloud computing environments demand dynamic and efficient resource management to ensure optimal performance, reduced energy consumption, and adherence to Service Level Agreements (SLAs). This paper presents a Genetic Algorithm (GA)-based approach for Virtual Machine (VM) placement and consolidation, aiming to minimize power usage while maintaining QoS constraints. The proposed method dynamically adjusts VM allocation based on real-time workload variations, outperforming traditional heuristics such as First Fit Decreasing (FFD) and Best Fit Decreasing (BFD). Experimental results show notable reductions in energy consumption, VM migrations, SLA violation rates, and execution time. A correlation heatmap further illustrates strong relationships among these key performance indicators, confirming the effectiveness of our approach in optimizing cloud resource utilization.
- [328] arXiv:2504.17677 [pdf, html, other]
-
Title: INSIGHT: Bridging the Student-Teacher Gap in Times of Large Language ModelsSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
The rise of AI, especially Large Language Models, presents challenges and opportunities to integrate such technology into the classroom. AI has the potential to revolutionize education by helping teaching staff with various tasks, such as personalizing their teaching methods, but it also raises concerns, for example, about the degradation of student-teacher interactions and user privacy. This paper introduces INSIGHT, a proof of concept to combine various AI tools to assist teaching staff and students in the process of solving exercises. INSIGHT has a modular design that allows it to be integrated into various higher education courses. We analyze students' questions to an LLM by extracting keywords, which we use to dynamically build an FAQ from students' questions and provide new insights for the teaching staff to use for more personalized face-to-face support. Future work could build upon INSIGHT by using the collected data to provide adaptive learning and adjust content based on student progress and learning styles to offer a more interactive and inclusive learning experience.
- [329] arXiv:2504.17678 [pdf, html, other]
-
Title: MindFlow: A Network Traffic Anomaly Detection Model Based on MindSporeSubjects: Computers and Society (cs.CY)
With the wide application of IoT and industrial IoT technologies, the network structure is becoming more and more complex, and the traffic scale is growing rapidly, which makes the traditional security protection mechanism face serious challenges in dealing with high-frequency, diversified, and stealthy cyber-attacks. To address this problem, this study proposes MindFlow, a multi-dimensional dynamic traffic prediction and anomaly detection system combining convolutional neural network (CNN) and bi-directional long and short-term memory network (BiLSTM) architectures based on the MindSpore framework, and conducts systematic experiments on the NF-BoT-IoT dataset. The experimental results show that the proposed model achieves 99% in key metrics such as accuracy, precision, recall and F1 score, effectively verifying its accuracy and robustness in network intrusion detection.
- [330] arXiv:2504.17684 [pdf, other]
-
Title: Evaluating the Vulnerability of ML-Based Ethereum Phishing Detectors to Single-Feature Adversarial PerturbationsComments: 24 pages; an extension of a paper that appeared at WISA 2024Subjects: Cryptography and Security (cs.CR)
This paper explores the vulnerability of machine learning models to simple single-feature adversarial attacks in the context of Ethereum fraudulent transaction detection. Through comprehensive experimentation, we investigate the impact of various adversarial attack strategies on model performance metrics. Our findings, highlighting how prone those techniques are to simple attacks, are alarming, and the inconsistency in the attacks' effect on different algorithms promises ways for attack mitigation. We examine the effectiveness of different mitigation strategies, including adversarial training and enhanced feature selection, in enhancing model robustness and show their effectiveness.
- [331] arXiv:2504.17685 [pdf, html, other]
-
Title: Ensemble Bayesian Inference: Leveraging Small Language Models to Achieve LLM-level Accuracy in Profile Matching TasksComments: 13 pages, 2 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
This study explores the potential of small language model(SLM) ensembles to achieve accuracy comparable to proprietary large language models (LLMs). We propose Ensemble Bayesian Inference (EBI), a novel approach that applies Bayesian estimation to combine judgments from multiple SLMs, allowing them to exceed the performance limitations of individual models. Our experiments on diverse tasks(aptitude assessments and consumer profile analysis in both Japanese and English) demonstrate EBI's effectiveness. Notably, we analyze cases where incorporating models with negative Lift values into ensembles improves overall performance, and we examine the method's efficacy across different languages. These findings suggest new possibilities for constructing high-performance AI systems with limited computational resources and for effectively utilizing models with individually lower performance. Building on existing research on LLM performance evaluation, ensemble methods, and open-source LLM utilization, we discuss the novelty and significance of our approach.
- [332] arXiv:2504.17692 [pdf, html, other]
-
Title: User Profiles: The Achilles' Heel of Web BrowsersSubjects: Cryptography and Security (cs.CR)
Web browsers provide the security foundation for our online experiences. Significant research has been done into the security of browsers themselves, but relatively little investigation has been done into how they interact with the operating system or the file system. In this work, we provide the first systematic security study of browser profiles, the on-disk persistence layer of browsers, used for storing everything from users' authentication cookies and browser extensions to certificate trust decisions and device permissions. We show that, except for the Tor Browser, all modern browsers store sensitive data in home directories with little to no integrity or confidentiality controls. We show that security measures like password and cookie encryption can be easily bypassed. In addition, HTTPS can be sidestepped entirely by deploying malicious root certificates within users' browser profiles. The Public Key Infrastructure (PKI), the backbone of the secure Web. HTTPS can be fully bypassed with the deployment of custom potentially malicious root certificates. More worryingly, we show how these powerful attacks can be fully mounted directly from web browsers themselves, through the File System Access API, a recent feature added by Chromium browsers that enables a website to directly manipulate a user's file system via JavaScript. In a series of case studies, we demonstrate how an attacker can install malicious browser extensions, inject additional root certificates, hijack HTTPS traffic, and enable websites to access hardware devices like the camera and GPS. Based on our findings, we argue that researchers and browser vendors need to develop and deploy more secure mechanisms for protecting users' browser data against file system attackers.
- [333] arXiv:2504.17693 [pdf, html, other]
-
Title: BIM-Constrained Optimization for Accurate Localization and Deviation Correction in Construction MonitoringAsier Bikandi, Muhammad Shaheer, Hriday Bavle, Jayan Jevanesan, Holger Voos, Jose Luis Sanchez-LopezSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Augmented reality (AR) applications for construction monitoring rely on real-time environmental tracking to visualize architectural elements. However, construction sites present significant challenges for traditional tracking methods due to featureless surfaces, dynamic changes, and drift accumulation, leading to misalignment between digital models and the physical world. This paper proposes a BIM-aware drift correction method to address these challenges. Instead of relying solely on SLAM-based localization, we align ``as-built" detected planes from the real-world environment with ``as-planned" architectural planes in BIM. Our method performs robust plane matching and computes a transformation (TF) between SLAM (S) and BIM (B) origin frames using optimization techniques, minimizing drift over time. By incorporating BIM as prior structural knowledge, we can achieve improved long-term localization and enhanced AR visualization accuracy in noisy construction environments. The method is evaluated through real-world experiments, showing significant reductions in drift-induced errors and optimized alignment consistency. On average, our system achieves a reduction of 52.24% in angular deviations and a reduction of 60.8% in the distance error of the matched walls compared to the initial manual alignment by the user.
- [334] arXiv:2504.17695 [pdf, html, other]
-
Title: PICO: Reconstructing 3D People In Contact with ObjectsAlpár Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Arjun Lakshmipathy, Agniv Chatterjee, Michael J. Black, Dimitrios TzionasComments: Accepted in CVPR'25. Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction. PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization. Uniquely, PICO-fit works well for many object categories that no existing method can tackle. This is crucial to enable HOI understanding to scale in the wild. Our data and code are available at this https URL.
- [335] arXiv:2504.17696 [pdf, html, other]
-
Title: Hierarchical and Multimodal Data for Daily Activity UnderstandingGhazal Kaviani, Yavuz Yarici, Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib, Mashhour Solh, Ameya PatilSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker.
To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3 procedures are shared between L2 actions. The overlap and unscripted nature of DARai allows counterfactual activities in the dataset.
Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications. Specifically, we conduct unimodal and multimodal sensor fusion experiments for recognition, temporal localization, and future action anticipation across all hierarchical annotation levels. To highlight the limitations of individual sensors, we also conduct domain-variant experiments that are enabled by DARai's multi-sensor and counterfactual activity design setup.
The code, documentation, and dataset are available at the dedicated DARai website: this https URL - [336] arXiv:2504.17697 [pdf, other]
-
Title: 'The Boring and the Tedious': Invisible Labour in India's Gig-EconomyComments: 6 pages, 2 figuresSubjects: Human-Computer Interaction (cs.HC)
India's gig-based food delivery platforms, such as Swiggy and Zomato, provide crucial income to marginalised communities but also entrench workers in cycles of invisible labour. Through 14 semi-structured interviews, we analyse waiting time and repetitive UI itneractions as key burdens that contribute to 'digital discomfort' for gig based food delivery agents. We find that workers employ creative strategies to navigate algorithmic management, yet remain constrained by platform-side 'gamification' and system opacity. We propose worker-centered GUI automation as a potential intervention to reduce friction while preserving agency. In conclusion, this position paper argues for rethinking HCI approaches in the Global South to prioritise worker autonomy over efficiency-driven design optimisations.
- [337] arXiv:2504.17699 [pdf, html, other]
-
Title: Quadratic Interest Network for Multimodal Click-Through Rate PredictionSubjects: Information Retrieval (cs.IR)
Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems. It leverages heterogeneous modalities such as text, images, and behavioral logs to capture high-order feature interactions between users and items, thereby enhancing the system's understanding of user interests and its ability to predict click behavior. The primary challenge in this field lies in effectively utilizing the rich semantic information from multiple modalities while satisfying the low-latency requirements of online inference in real-world applications. To foster progress in this area, the Multimodal CTR Prediction Challenge Track of the WWW 2025 EReL@MIR Workshop formulates the problem into two tasks: (1) Task 1 of Multimodal Item Embedding: this task aims to explore multimodal information extraction and item representation learning methods that enhance recommendation tasks; and (2) Task 2 of Multimodal CTR Prediction: this task aims to explore what multimodal recommendation model can effectively leverage multimodal embedding features and achieve better performance. In this paper, we propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction. Specifically, QIN employs adaptive sparse target attention to extract multimodal user behavior features, and leverages Quadratic Neural Networks to capture high-order feature interactions. As a result, QIN achieved an AUC of 0.9798 on the leaderboard and ranked second in the competition. The model code, training logs, hyperparameter configurations, and checkpoints are available at this https URL.
- [338] arXiv:2504.17701 [pdf, html, other]
-
Title: Network Sampling: An Overview and Comparative AnalysisComments: 10 pages, 6 figures, 2 tablesSubjects: Social and Information Networks (cs.SI); Statistical Mechanics (cond-mat.stat-mech); Data Analysis, Statistics and Probability (physics.data-an)
Network sampling is a crucial technique for analyzing large or partially observable networks. However, the effectiveness of different sampling methods can vary significantly depending on the context. In this study, we empirically compare representative methods from three main categories: node-based, edge-based, and exploration-based sampling. We used two real-world datasets for our analysis: a scientific collaboration network and a temporal message-sending network. Our results indicate that no single sampling method consistently outperforms the others in both datasets. Although advanced methods tend to provide better accuracy on static networks, they often perform poorly on temporal networks, where simpler techniques can be more effective. These findings suggest that the best sampling strategy depends not only on the structural characteristics of the network but also on the specific metrics that need to be preserved or analyzed. Our work offers practical insights for researchers in choosing sampling approaches that are tailored to different types of networks and analytical objectives.
- [339] arXiv:2504.17703 [pdf, html, other]
-
Title: Federated Learning: A Survey on Privacy-Preserving Collaborative IntelligenceSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Federated Learning (FL) has emerged as a transformative paradigm in the field of distributed machine learning, enabling multiple clients such as mobile devices, edge nodes, or organizations to collaboratively train a shared global model without the need to centralize sensitive data. This decentralized approach addresses growing concerns around data privacy, security, and regulatory compliance, making it particularly attractive in domains such as healthcare, finance, and smart IoT systems. This survey provides a concise yet comprehensive overview of Federated Learning, beginning with its core architecture and communication protocol. We discuss the standard FL lifecycle, including local training, model aggregation, and global updates. A particular emphasis is placed on key technical challenges such as handling non-IID (non-independent and identically distributed) data, mitigating system and hardware heterogeneity, reducing communication overhead, and ensuring privacy through mechanisms like differential privacy and secure aggregation. Furthermore, we examine emerging trends in FL research, including personalized FL, cross-device versus cross-silo settings, and integration with other paradigms such as reinforcement learning and quantum computing. We also highlight real-world applications and summarize benchmark datasets and evaluation metrics commonly used in FL research. Finally, we outline open research problems and future directions to guide the development of scalable, efficient, and trustworthy FL systems.
- [340] arXiv:2504.17704 [pdf, html, other]
-
Title: Safety in Large Reasoning Models: A SurveySubjects: Computation and Language (cs.CL)
Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks like mathematics and coding, leveraging their advanced reasoning capabilities. Nevertheless, as these capabilities progress, significant concerns regarding their vulnerabilities and safety have arisen, which can pose challenges to their deployment and application in real-world settings. This paper presents a comprehensive survey of LRMs, meticulously exploring and summarizing the newly emerged safety risks, attacks, and defense strategies. By organizing these elements into a detailed taxonomy, this work aims to offer a clear and structured understanding of the current safety landscape of LRMs, facilitating future research and development to enhance the security and reliability of these powerful models.
- [341] arXiv:2504.17705 [pdf, html, other]
-
Title: LUIDA: Large-scale Unified Infrastructure for Digital Assessments based on Commercial Metaverse PlatformSubjects: Human-Computer Interaction (cs.HC)
Online experiments using metaverse platforms have gained significant traction in Human-Computer Interaction and Virtual Reality (VR) research. However, current research workflows are highly fragmented, as researchers must use separate tools for system implementation, participant recruitment, experiment execution, and data collection, reducing consistency and increasing workload. We present LUIDA (Large-scale Unified Infrastructure for Digital Assessments), a metaverse-based framework that integrates these fragmented processes. LUIDA automatically allocates interconnected virtual environments for parallel experiment execution and provides implementation templates adaptable to various VR research domains, requiring minimal metaverse development expertise. Our evaluation included two studies using a prototype built on Cluster, the commercial metaverse platform. First, VR researchers using LUIDA to develop and run experiments reported high usability scores (SUS: 73.75) and moderate workload (NASA-TLX: 24.11) for overall usage, with interviews confirming streamlined workflows compared to traditional laboratory experiments. Second, we conducted three replicated experiments with public Cluster users, each recruiting approximately 200 participants within one week. These experiments produced results that closely matched the original studies, validating the experimental integrity of LUIDA across research domains. After technical refinements, we plan to release LUIDA as an open platform, providing a standardized protocol to improve research efficiency and experimental reproducibility in VR studies.
- [342] arXiv:2504.17708 [pdf, html, other]
-
Title: Pushing the frontiers of subexponential FPT time for Feedback Vertex SetComments: To appear in the proceedings of ICALP 2025Subjects: Data Structures and Algorithms (cs.DS)
The paper deals with the Feedback Vertex Set problem parameterized by the solution size. Given a graph $G$ and a parameter $k$, one has to decide if there is a set $S$ of at most $k$ vertices such that $G-S$ is acyclic. Assuming the Exponential Time Hypothesis, it is known that FVS cannot be solved in time $2^{o(k)}n^{\mathcal{O}(1)}$ in general graphs. To overcome this, many recent results considered FVS restricted to particular intersection graph classes and provided such $2^{o(k)}n^{\mathcal{O}(1)}$ algorithms.
In this paper we provide generic conditions on a graph class for the existence of an algorithm solving FVS in subexponential FPT time, i.e. time $2^{k^\varepsilon} \mathop{\rm poly}(n)$, for some $\varepsilon<1$, where $n$ denotes the number of vertices of the instance and $k$ the parameter. On the one hand this result unifies algorithms that have been proposed over the years for several graph classes such as planar graphs, map graphs, unit-disk graphs, pseudo-disk graphs, and string graphs of bounded edge-degree. On the other hand it extends the tractability horizon of FVS to new classes that are not amenable to previously used techniques, in particular intersection graphs of ``thin'' objects like segment graphs or more generally $s$-string graphs. - [343] arXiv:2504.17709 [pdf, html, other]
-
Title: Fault Diagnosis in New Wind Turbines using Knowledge from Existing Turbines by Generative Domain AdaptationSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Intelligent condition monitoring of wind turbines is essential for reducing downtimes. Machine learning models trained on wind turbine operation data are commonly used to detect anomalies and, eventually, operation faults. However, data-driven normal behavior models (NBMs) require a substantial amount of training data, as NBMs trained with scarce data may result in unreliable fault diagnosis. To overcome this limitation, we present a novel generative deep learning approach to make SCADA samples from one wind turbine lacking training data resemble SCADA data from wind turbines with representative training data. Through CycleGAN-based domain mapping, our method enables the application of an NBM trained on an existing wind turbine to one with severely limited data. We demonstrate our approach on field data mapping SCADA samples across 7 substantially different WTs. Our findings show significantly improved fault diagnosis in wind turbines with scarce data. Our method achieves the most similar anomaly scores to an NBM trained with abundant data, outperforming NBMs trained on scarce training data with improvements of +10.3% in F1-score when 1 month of training data is available and +16.8% when 2 weeks are available. The domain mapping approach outperforms conventional fine-tuning at all considered degrees of data scarcity, ranging from 1 to 8 weeks of training data. The proposed technique enables earlier and more reliable fault diagnosis in newly installed wind farms, demonstrating a novel and promising research direction to improve anomaly detection when faced with training data scarcity.
- [344] arXiv:2504.17712 [pdf, html, other]
-
Title: Generative Fields: Uncovering Hierarchical Feature Control for StyleGAN via Inverted Receptive FieldsSubjects: Computer Vision and Pattern Recognition (cs.CV)
StyleGAN has demonstrated the ability of GANs to synthesize highly-realistic faces of imaginary people from random noise. One limitation of GAN-based image generation is the difficulty of controlling the features of the generated image, due to the strong entanglement of the low-dimensional latent space. Previous work that aimed to control StyleGAN with image or text prompts modulated sampling in W latent space, which is more expressive than Z latent space. However, W space still has restricted expressivity since it does not control the feature synthesis directly; also the feature embedding in W space requires a pre-training process to reconstruct the style signal, limiting its application. This paper introduces the concept of "generative fields" to explain the hierarchical feature synthesis in StyleGAN, inspired by the receptive fields of convolution neural networks (CNNs). Additionally, we propose a new image editing pipeline for StyleGAN using generative field theory and the channel-wise style latent space S, utilizing the intrinsic structural feature of CNNs to achieve disentangled control of feature synthesis at synthesis time.
- [345] arXiv:2504.17716 [pdf, html, other]
-
Title: Online metric TSPSubjects: Data Structures and Algorithms (cs.DS)
In the online metric traveling salesperson problem, $n$ points of a metric space arrive one by one and have to be placed (immediately and irrevocably) into empty cells of a size-$n$ array. The goal is to minimize the sum of distances between consecutive points in the array. This problem was introduced by Abrahamsen, Bercea, Beretta, Klausen, and Kozma [ESA'24] as a generalization of the online sorting problem, which was introduced by Aamand, Abrahamsen, Beretta, and Kleist [SODA'23] as a tool in their study of online geometric packing problems.
Online metric TSP has been studied for a range of fixed metric spaces. For 1-dimensional Euclidean space, the problem is equivalent to online sorting, where an optimal competitive ratio of $\Theta(\sqrt n)$ is known. For $d$-dimensional Euclidean space, the best-known upper bound is $O(2^{d} \sqrt{dn\log n})$, leaving a gap to the $\Omega(\sqrt n)$ lower bound. Finally, for the uniform metric, where all distances are 0 or 1, the optimal competitive ratio is known to be $\Theta(\log n)$.
We study the problem for a general metric space, presenting an algorithm with competitive ratio $O(\sqrt n)$. In particular, we close the gap for $d$-dimensional Euclidean space, completely removing the dependence on dimension. One might hope to simultaneously guarantee competitive ratio $O(\sqrt n)$ in general and $O(\log n)$ for the uniform metric, but we show that this is impossible. - [346] arXiv:2504.17717 [pdf, html, other]
-
Title: Early Detection of Multidrug Resistance Using Multivariate Time Series Analysis and Interpretable Patient-Similarity RepresentationsÓscar Escudero-Arnanz, Antonio G. Marques, Inmaculada Mora-Jiménez, Joaquín Álvarez-Rodríguez, Cristina Soguero-RuizSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Background and Objectives: Multidrug Resistance (MDR) is a critical global health issue, causing increased hospital stays, healthcare costs, and mortality. This study proposes an interpretable Machine Learning (ML) framework for MDR prediction, aiming for both accurate inference and enhanced explainability.
Methods: Patients are modeled as Multivariate Time Series (MTS), capturing clinical progression and patient-to-patient interactions. Similarity among patients is quantified using MTS-based methods: descriptive statistics, Dynamic Time Warping, and Time Cluster Kernel. These similarity measures serve as inputs for MDR classification via Logistic Regression, Random Forest, and Support Vector Machines, with dimensionality reduction and kernel transformations improving model performance. For explainability, patient similarity networks are constructed from these metrics. Spectral clustering and t-SNE are applied to identify MDR-related subgroups and visualize high-risk clusters, enabling insight into clinically relevant patterns.
Results: The framework was validated on ICU Electronic Health Records from the University Hospital of Fuenlabrada, achieving an AUC of 81%. It outperforms baseline ML and deep learning models by leveraging graph-based patient similarity. The approach identifies key risk factors -- prolonged antibiotic use, invasive procedures, co-infections, and extended ICU stays -- and reveals clinically meaningful clusters. Code and results are available at \this https URL.
Conclusions: Patient similarity representations combined with graph-based analysis provide accurate MDR prediction and interpretable insights. This method supports early detection, risk factor identification, and patient stratification, highlighting the potential of explainable ML in critical care. - [347] arXiv:2504.17720 [pdf, html, other]
-
Title: Multilingual Performance Biases of Large Language Models in EducationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in six languages (Hindi, Arabic, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. Although the models perform reasonably well in most languages, the frequent performance drop from English is significant. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.
- [348] arXiv:2504.17721 [pdf, html, other]
-
Title: Conformal Segmentation in Industrial Surface Defect Detection with Statistical GuaranteesComments: Under ReviewSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
In industrial settings, surface defects on steel can significantly compromise its service life and elevate potential safety risks. Traditional defect detection methods predominantly rely on manual inspection, which suffers from low efficiency and high costs. Although automated defect detection approaches based on Convolutional Neural Networks(e.g., Mask R-CNN) have advanced rapidly, their reliability remains challenged due to data annotation uncertainties during deep model training and overfitting issues. These limitations may lead to detection deviations when processing the given new test samples, rendering automated detection processes unreliable. To address this challenge, we first evaluate the detection model's practical performance through calibration data that satisfies the independent and identically distributed (i.i.d) condition with test data. Specifically, we define a loss function for each calibration sample to quantify detection error rates, such as the complement of recall rate and false discovery rate. Subsequently, we derive a statistically rigorous threshold based on a user-defined risk level to identify high-probability defective pixels in test images, thereby constructing prediction sets (e.g., defect regions). This methodology ensures that the expected error rate (mean error rate) on the test set remains strictly bounced by the predefined risk level. Additionally, we observe a negative correlation between the average prediction set size and the risk level on the test set, establishing a statistically rigorous metric for assessing detection model uncertainty. Furthermore, our study demonstrates robust and efficient control over the expected test set error rate across varying calibration-to-test partitioning ratios, validating the method's adaptability and operational effectiveness.
- [349] arXiv:2504.17723 [pdf, html, other]
-
Title: Towards Robust LLMs: an Adversarial Robustness Measurement FrameworkComments: 17 pages, 5 figuresSubjects: Machine Learning (cs.LG)
The rise of Large Language Models (LLMs) has revolutionized artificial intelligence, yet these models remain vulnerable to adversarial perturbations, undermining their reliability in high-stakes applications. While adversarial robustness in vision-based neural networks has been extensively studied, LLM robustness remains under-explored. We adapt the Robustness Measurement and Assessment (RoMA) framework to quantify LLM resilience against adversarial inputs without requiring access to model parameters. By comparing RoMA's estimates to those of formal verification methods, we demonstrate its accuracy with minimal error margins while maintaining computational efficiency. Our empirical evaluation reveals that robustness varies significantly not only between different models but also across categories within the same task and between various types of perturbations. This non-uniformity underscores the need for task-specific robustness evaluations, enabling practitioners to compare and select models based on application-specific robustness requirements. Our work provides a systematic methodology to assess LLM robustness, advancing the development of more reliable language models for real-world deployment.
- [350] arXiv:2504.17725 [pdf, html, other]
-
Title: STGen: A Novel Lightweight IoT Testbed for Generating Sensor Traffic for the Experimentation of IoT Protocol and its Application in Hybrid NetworkComments: 23 Pages, 12 Figures, Submitted to ACM Transactions on Sensor NetworksSubjects: Networking and Internet Architecture (cs.NI)
A Wireless Sensor Network (WSN) is a network that does not rely on a fixed infrastructure and consists of numerous sensors, such as temperature, humidity, GPS, and cameras, equipped with onboard processors that manage and monitor the environment in a specific area. As a result, building a real sensor network testbed for verifying, validating, or experimenting with a newly designed protocol presents considerable challenges in adapting a laboratory scenario due to the significant financial and logistical barriers, such as the need for specialized hardware and large-scale deployments. Additionally, WSN suffers from severe constraints such as restricted power supply, short communication range, limited bandwidth availability, and restricted memory storage. Addressing these challenges, this work presents a flexible testbed solution named STGen that enables researchers to experiment with IoT protocols in a hybrid environment that emulates WSN implementations with the physical Internet through a dedicated physical server named STGen core, which receives sensor traffic and processes it for further actions. The STGen testbed is lightweight in memory usage and easy to deploy. Most importantly, STGen supports large-scale distributed systems, facilitates experimentation with IoT protocols, and enables integration with back-end services for big data analytics and statistical insights. The key feature of STGen is the integration of real-world IoT protocols and their applications with WSN. Its modular and lightweight design makes STGen efficient and enables it to outperform other popular testbeds, such as Gotham and GothX, reducing memory usage by 89\%. While GothX takes approximately 26 minutes to establish a large topology with four VM nodes and 498 Docker nodes, STGen requires only 1.645 seconds to initialize the platform with 500 sensor nodes.
- [351] arXiv:2504.17728 [pdf, html, other]
-
Title: CasualHDRSplat: Robust High Dynamic Range 3D Gaussian Splatting from Casually Captured VideosComments: Source Code: this https URLSubjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Recently, photo-realistic novel view synthesis from multi-view images, such as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), have garnered widespread attention due to their superior performance. However, most works rely on low dynamic range (LDR) images, which limits the capturing of richer scene details. Some prior works have focused on high dynamic range (HDR) scene reconstruction, typically require capturing of multi-view sharp images with different exposure times at fixed camera positions during exposure times, which is time-consuming and challenging in practice. For a more flexible data acquisition, we propose a one-stage method: \textbf{CasualHDRSplat} to easily and robustly reconstruct the 3D HDR scene from casually captured videos with auto-exposure enabled, even in the presence of severe motion blur and varying unknown exposure time. \textbf{CasualHDRSplat} contains a unified differentiable physical imaging model which first applies continuous-time trajectory constraint to imaging process so that we can jointly optimize exposure time, camera response function (CRF), camera poses, and sharp 3D HDR scene. Extensive experiments demonstrate that our approach outperforms existing methods in terms of robustness and rendering quality. Our source code will be available at this https URL
- [352] arXiv:2504.17729 [pdf, html, other]
-
Title: Fully-Mixed Virtual Element Method for the Biot ProblemSubjects: Numerical Analysis (math.NA)
Poroelasticity describes the interaction of deformation and fluid flow in saturated porous media. A fully-mixed formulation of Biot's poroelasticity problem has the advantage of producing a better approximation of the Darcy velocity and stress field, as well as satisfying local mass and momentum conservation. In this work, we focus on a novel four-fields Virtual Element discretization of Biot's equations. The stress symmetry is strongly imposed in the definition of the discrete space, thus avoiding the use of an additional Lagrange multiplier. A complete a priori analysis is performed, showing the robustness of the proposed numerical method with respect to limiting material properties. The first order convergence of the lowest-order fully-discrete numerical method, which is obtained by coupling the spatial approximation with the backward Euler time-advancing scheme, is confirmed by a complete 3D numerical validation. A well known poroelasticity benchmark is also considered to assess the robustness properties and computational performance.
- [353] arXiv:2504.17732 [pdf, html, other]
-
Title: DPMambaIR:All-in-One Image Restoration via Degradation-Aware Prompt State Space ModelSubjects: Computer Vision and Pattern Recognition (cs.CV)
All-in-One image restoration aims to address multiple image degradation problems using a single model, significantly reducing training costs and deployment complexity compared to traditional methods that design dedicated models for each degradation type. Existing approaches typically rely on Degradation-specific models or coarse-grained degradation prompts to guide image restoration. However, they lack fine-grained modeling of degradation information and face limitations in balancing multi-task conflicts. To overcome these limitations, we propose DPMambaIR, a novel All-in-One image restoration framework. By integrating a Degradation-Aware Prompt State Space Model (DP-SSM) and a High-Frequency Enhancement Block (HEB), DPMambaIR enables fine-grained modeling of complex degradation information and efficient global integration, while mitigating the loss of high-frequency details caused by task competition. Specifically, the DP-SSM utilizes a pre-trained degradation extractor to capture fine-grained degradation features and dynamically incorporates them into the state space modeling process, enhancing the model's adaptability to diverse degradation types. Concurrently, the HEB supplements high-frequency information, effectively addressing the loss of critical details, such as edges and textures, in multi-task image restoration scenarios. Extensive experiments on a mixed dataset containing seven degradation types show that DPMambaIR achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM, respectively. These results highlight the potential and superiority of DPMambaIR as a unified solution for All-in-One image restoration.
- [354] arXiv:2504.17735 [pdf, html, other]
-
Title: EgoCHARM: Resource-Efficient Hierarchical Activity Recognition using an Egocentric IMU SensorAkhil Padmanabha, Saravanan Govindarajan, Hwanmun Kim, Sergio Ortiz, Rahul Rajan, Doruk Senkal, Sneha KadetotadSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Human activity recognition (HAR) on smartglasses has various use cases, including health/fitness tracking and input for context-aware AI assistants. However, current approaches for egocentric activity recognition suffer from low performance or are resource-intensive. In this work, we introduce a resource (memory, compute, power, sample) efficient machine learning algorithm, EgoCHARM, for recognizing both high level and low level activities using a single egocentric (head-mounted) Inertial Measurement Unit (IMU). Our hierarchical algorithm employs a semi-supervised learning strategy, requiring primarily high level activity labels for training, to learn generalizable low level motion embeddings that can be effectively utilized for low level activity recognition. We evaluate our method on 9 high level and 3 low level activities achieving 0.826 and 0.855 F1 scores on high level and low level activity recognition respectively, with just 63k high level and 22k low level model parameters, allowing the low level encoder to be deployed directly on current IMU chips with compute. Lastly, we present results and insights from a sensitivity analysis and highlight the opportunities and limitations of activity recognition using egocentric IMUs.
- [355] arXiv:2504.17736 [pdf, html, other]
-
Title: Design and benchmarking of a two degree of freedom tendon driver unit for cable-driven wearable technologiesSubjects: Systems and Control (eess.SY)
Exosuits have recently been developed as alternatives to rigid exoskeletons and are increasingly adopted for both upper and lower limb therapy and assistance in clinical and home environments. Many cable-driven exosuits have been developed but little has been published on their electromechanical designs and performance. Therefore, this paper presents a comprehensive design and performance analysis of a two degree of freedom tendon driver unit (TDU) for cable-driven wearable exosuits. Detailed methodologies are presented to benchmark the functionality of the TDU. A static torque output test compares the commanded and measured torques. A velocity control test evaluates the attenuation and phase shift across velocities. A noise test evaluates how loud the TDU is for the wearer under different speeds. A thermal stress test captures the cooling performance of the TDU to ensure safe operation at higher loads. Finally, a battery endurance test evaluates the runtime of the TDU under various loading conditions to inform the usable time. To demonstrate these tests, a modular TDU system for cable-driven applications is introduced, which allows components such as motors, pulleys, and sensors to be adapted based on the requirements of the intended application. By sharing detailed methodologies and performance results, this study aims to provide a TDU design that may be leveraged by others and resources for researchers and engineers to better document the capabilities of their TDU designs.
- [356] arXiv:2504.17739 [pdf, html, other]
-
Title: Interpretable Early Detection of Parkinson's Disease through Speech AnalysisSubjects: Machine Learning (cs.LG)
Parkinson's disease is a progressive neurodegenerative disorder affecting motor and non-motor functions, with speech impairments among its earliest symptoms. Speech impairments offer a valuable diagnostic opportunity, with machine learning advances providing promising tools for timely detection. In this research, we propose a deep learning approach for early Parkinson's disease detection from speech recordings, which also highlights the vocal segments driving predictions to enhance interpretability. This approach seeks to associate predictive speech patterns with articulatory features, providing a basis for interpreting underlying neuromuscular impairments. We evaluated our approach using the Italian Parkinson's Voice and Speech Database, containing 831 audio recordings from 65 participants, including both healthy individuals and patients. Our approach showed competitive classification performance compared to state-of-the-art methods, while providing enhanced interpretability by identifying key speech features influencing predictions.
- [357] arXiv:2504.17740 [pdf, html, other]
-
Title: Embedding Empirical Distributions for Computing Optimal Transport MapsSubjects: Machine Learning (cs.LG)
Distributional data have become increasingly prominent in modern signal processing, highlighting the necessity of computing optimal transport (OT) maps across multiple probability distributions. Nevertheless, recent studies on neural OT methods predominantly focused on the efficient computation of a single map between two distributions. To address this challenge, we introduce a novel approach to learning transport maps for new empirical distributions. Specifically, we employ the transformer architecture to produce embeddings from distributional data of varying length; these embeddings are then fed into a hypernetwork to generate neural OT maps. Various numerical experiments were conducted to validate the embeddings and the generated OT maps. The model implementation and the code are provided on this https URL.
- [358] arXiv:2504.17743 [pdf, html, other]
-
Title: Realization of Temporally Connected Graphs Based on Degree SequencesSubjects: Data Structures and Algorithms (cs.DS)
Given an undirected graph $G$, the problem of deciding whether $G$ admits a simple and proper time-labeling that makes it temporally connected is known to be NP-hard (Göbel et al., 1991). In this article, we relax this problem and ask whether a given degree sequence can be realized as a temporally connected graph. Our main results are a complete characterization of the feasible cases, and a recognition algorithm that runs in $O(n)$ time for graphical degree sequences (realized as simple temporal graphs) and in $O(n+m)$ time for multigraphical degree sequences (realized as non-simple temporal graphs, where the number of time labels on an edge corresponds to the multiplicity of the edge in the multigraph). In fact, these algorithms can be made constructive at essentially no cost. Namely, we give a constructive $O(n+m)$ time algorithm that outputs, for a given (multi)graphical degree sequence $\mathbf{d}$, a temporally connected graph whose underlying (multi)graph is a realization of $\mathbf{d}$, if one exists.
- [359] arXiv:2504.17748 [pdf, html, other]
-
Title: Robotic Task Ambiguity Resolution via Natural Language InteractionSubjects: Robotics (cs.RO)
Language-conditioned policies have recently gained substantial adoption in robotics as they allow users to specify tasks using natural language, making them highly versatile. While much research has focused on improving the action prediction of language-conditioned policies, reasoning about task descriptions has been largely overlooked. Ambiguous task descriptions often lead to downstream policy failures due to misinterpretation by the robotic agent. To address this challenge, we introduce AmbResVLM, a novel method that grounds language goals in the observed scene and explicitly reasons about task ambiguity. We extensively evaluate its effectiveness in both simulated and real-world domains, demonstrating superior task ambiguity detection and resolution compared to recent state-of-the-art baselines. Finally, real robot experiments show that our model improves the performance of downstream robot policies, increasing the average success rate from 69.6% to 97.1%. We make the data, code, and trained models publicly available at this https URL.
- [360] arXiv:2504.17749 [pdf, html, other]
-
Title: MSGCN: Multiplex Spatial Graph Convolution Network for Interlayer Link Weight PredictionSubjects: Machine Learning (cs.LG)
Graph Neural Networks (GNNs) have been widely used for various learning tasks, ranging from node classification to link prediction. They have demonstrated excellent performance in multiple domains involving graph-structured data. However, an important category of learning tasks, namely link weight prediction, has received less emphasis due to its increased complexity compared to binary link classification. Link weight prediction becomes even more challenging when considering multilayer networks, where nodes can be interconnected across multiple layers. To address these challenges, we propose a new method named Multiplex Spatial Graph Convolution Network (MSGCN), which spatially embeds information across multiple layers to predict interlayer link weights. The MSGCN model generalizes spatial graph convolution to multiplex networks and captures the geometric structure of nodes across multiple layers. Extensive experiments using data with known interlayer link information show that the MSGCN model has robust, accurate, and generalizable link weight prediction performance across a wide variety of multiplex network structures.
- [361] arXiv:2504.17751 [pdf, html, other]
-
Title: Revisiting Reset Mechanisms in Spiking Neural Networks for Sequential Modeling: Specialized Discretization for Binary Activated RNNSubjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
In the field of image recognition, spiking neural networks (SNNs) have achieved performance comparable to conventional artificial neural networks (ANNs). In such applications, SNNs essentially function as traditional neural networks with quantized activation values. This article focuses on an another alternative perspective,viewing SNNs as binary-activated recurrent neural networks (RNNs) for sequential modeling this http URL this viewpoint, current SNN architectures face several fundamental challenges in sequence modeling: (1) Traditional models lack effective memory mechanisms for long-range sequence modeling; (2) The biological-inspired components in SNNs (such as reset mechanisms and refractory period applications) remain theoretically under-explored for sequence tasks; (3) The RNN-like computational paradigm in SNNs prevents parallel training across different this http URL address these challenges, this study conducts a systematic analysis of the fundamental mechanisms underlying reset operations and refractory periods in binary-activated RNN-based SNN sequence models. We re-examine whether such biological mechanisms are strictly necessary for generating sparse spiking patterns, provide new theoretical explanations and insights, and ultimately propose the fixed-refractory-period SNN architecture for sequence modeling.
- [362] arXiv:2504.17752 [pdf, html, other]
-
Title: Disaggregated Deep Learning via In-Physics Computing at Radio FrequencyComments: 11 pages, 4 figures. Supplementary Information: 54 pages, 20 figures, 1 tableSubjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Signal Processing (eess.SP); Applied Physics (physics.app-ph)
Modern edge devices, such as cameras, drones, and Internet-of-Things nodes, rely on deep learning to enable a wide range of intelligent applications, including object recognition, environment perception, and autonomous navigation. However, deploying deep learning models directly on the often resource-constrained edge devices demands significant memory footprints and computational power for real-time inference using traditional digital computing architectures. In this paper, we present WISE, a novel computing architecture for wireless edge networks designed to overcome energy constraints in deep learning inference. WISE achieves this goal through two key innovations: disaggregated model access via wireless broadcasting and in-physics computation of general complex-valued matrix-vector multiplications directly at radio frequency. Using a software-defined radio platform with wirelessly broadcast model weights over the air, we demonstrate that WISE achieves 95.7% image classification accuracy with ultra-low operation power of 6.0 fJ/MAC per client, corresponding to a computation efficiency of 165.8 TOPS/W. This approach enables energy-efficient deep learning inference on wirelessly connected edge devices, achieving more than two orders of magnitude improvement in efficiency compared to traditional digital computing.
- [363] arXiv:2504.17753 [pdf, html, other]
-
Title: Conversational Assistants to support Heart Failure Patients: comparing a Neurosymbolic Architecture with ChatGPTAnuja Tayal, Devika Salunke, Barbara Di Eugenio, Paula Allen-Meares, Eulalia Puig Abril, Olga Garcia, Carolyn Dickens, Andrew BoydSubjects: Computation and Language (cs.CL)
Conversational assistants are becoming more and more popular, including in healthcare, partly because of the availability and capabilities of Large Language Models. There is a need for controlled, probing evaluations with real stakeholders which can highlight advantages and disadvantages of more traditional architectures and those based on generative AI. We present a within-group user study to compare two versions of a conversational assistant that allows heart failure patients to ask about salt content in food. One version of the system was developed in-house with a neurosymbolic architecture, and one is based on ChatGPT. The evaluation shows that the in-house system is more accurate, completes more tasks and is less verbose than the one based on ChatGPT; on the other hand, the one based on ChatGPT makes fewer speech errors and requires fewer clarifications to complete the task. Patients show no preference for one over the other.
- [364] arXiv:2504.17756 [pdf, html, other]
-
Title: On the Degree Automatability of Sum-of-Squares ProofsSubjects: Computational Complexity (cs.CC); Optimization and Control (math.OC)
The Sum-of-Squares (SoS) hierarchy, also known as Lasserre hierarchy, has emerged as a promising tool in optimization. However, it remains unclear whether fixed-degree SoS proofs can be automated [O'Donnell (2017)]. Indeed, there are examples of polynomial systems with bounded coefficients that admit low-degree SoS proofs, but these proofs necessarily involve numbers with an exponential number of bits, implying that low-degree SoS proofs cannot always be found efficiently.
A sufficient condition derived from the Nullstellensatz proof system [Raghavendra and Weitz (2017)] identifies cases where bit complexity issues can be circumvented. One of the main problems left open by Raghavendra and Weitz is proving any result for refutations, as their condition applies only to polynomial systems with a large set of solutions.
In this work, we broaden the class of polynomial systems for which degree-$d$ SoS proofs can be automated. To achieve this, we develop a new criterion and we demonstrate how our criterion applies to polynomial systems beyond the scope of Raghavendra and Weitz's result. In particular, we establish a separation for instances arising from Constraint Satisfaction Problems (CSPs). Moreover, our result extends to refutations, establishing that polynomial-time refutation is possible for broad classes of polynomial time solvable constraint problems, highlighting a first advancement in this area. - [365] arXiv:2504.17759 [pdf, html, other]
-
Title: Identity Control Plane: The Unifying Layer for Zero Trust InfrastructureComments: Part of the Zero Trust Identity Foundations series. Authored Jan 2025. Introduces the Identity Control Plane (ICP) as a unifying layer for SPIFFE, brokered automation, and ABAC policy. 10 pages, 1 figure, 1 table. IEEE format. Keywords: Zero Trust, SPIFFE, WIMSE, Identity Control Plane, ABAC, CI/CD SecuritySubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
This paper introduces the Identity Control Plane (ICP), an architectural framework for enforcing identity-aware Zero Trust access across human users, workloads, and automation systems. The ICP model unifies SPIFFE-based workload identity, OIDC/SAML user identity, and scoped automation credentials via broker-issued transaction tokens. We propose a composable enforcement layer using ABAC policy engines (e.g., OPA, Cedar), aligned with IETF WIMSE drafts and OAuth transaction tokens. The paper includes architectural components, integration patterns, use cases, a comparative analysis with current models, and theorized performance metrics. A FedRAMP and SLSA compliance mapping is also presented. This is a theoretical infrastructure architecture paper intended for security researchers and platform architects. No prior version of this work has been published.
- [366] arXiv:2504.17761 [pdf, html, other]
-
Title: Step1X-Edit: A Practical Framework for General Image EditingShiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin JiangComments: code: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
- [367] arXiv:2504.17768 [pdf, html, other]
-
Title: The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.
- [368] arXiv:2504.17771 [pdf, html, other]
-
Title: Integrating Learning-Based Manipulation and Physics-Based Locomotion for Whole-Body Badminton Robot ControlHaochen Wang, Zhiwei Shi, Chengxi Zhu, Yafei Qiao, Cheng Zhang, Fan Yang, Pengjie Ren, Lan Lu, Dong XuanComments: Accepted to ICRA 2025. Project page: this https URLSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Learning-based methods, such as imitation learning (IL) and reinforcement learning (RL), can produce excel control policies over challenging agile robot tasks, such as sports robot. However, no existing work has harmonized learning-based policy with model-based methods to reduce training complexity and ensure the safety and stability for agile badminton robot control. In this paper, we introduce \ourmethod, a novel hybrid control system for agile badminton robots. Specifically, we propose a model-based strategy for chassis locomotion which provides a base for arm policy. We introduce a physics-informed ``IL+RL'' training framework for learning-based arm policy. In this train framework, a model-based strategy with privileged information is used to guide arm policy training during both IL and RL phases. In addition, we train the critic model during IL phase to alleviate the performance drop issue when transitioning from IL to RL. We present results on our self-engineered badminton robot, achieving 94.5% success rate against the serving machine and 90.7% success rate against human players. Our system can be easily generalized to other agile mobile manipulation tasks such as agile catching and table tennis. Our project website: this https URL.
- [369] arXiv:2504.17776 [pdf, html, other]
-
Title: Fitting Tree Metrics and Ultrametrics in Data StreamsComments: Accepted for publication in the 52nd EATCS International Colloquium on Automata, Languages, and Programming (ICALP)Subjects: Data Structures and Algorithms (cs.DS)
Fitting distances to tree metrics and ultrametrics are two widely used methods in hierarchical clustering, primarily explored within the context of numerical taxonomy. Given a positive distance function $D:\binom{V}{2}\rightarrow\mathbb{R}_{>0}$, the goal is to find a tree (or ultrametric) $T$ including all elements of set $V$ such that the difference between the distances among vertices in $T$ and those specified by $D$ is minimized. In this paper, we initiate the study of ultrametric and tree metric fitting problems in the semi-streaming model, where the distances between pairs of elements from $V$ (with $|V|=n$), defined by the function $D$, can arrive in an arbitrary order. We study these problems under various distance norms:
For the $\ell_0$ objective, we provide a single-pass polynomial-time $\tilde{O}(n)$-space $O(1)$ approximation algorithm for ultrametrics and prove that no single-pass exact algorithm exists, even with exponential time.
Next, we show that the algorithm for $\ell_0$ implies an $O(\Delta/\delta)$ approximation for the $\ell_1$ objective, where $\Delta$ is the maximum and $\delta$ is the minimum absolute difference between distances in the input. This bound matches the best-known approximation for the RAM model using a combinatorial algorithm when $\Delta/\delta=O(n)$.
For the $\ell_\infty$ objective, we provide a complete characterization of the ultrametric fitting problem. We present a single-pass polynomial-time $\tilde{O}(n)$-space 2-approximation algorithm and show that no better than 2-approximation is possible, even with exponential time. We also show that, with an additional pass, it is possible to achieve a polynomial-time exact algorithm for ultrametrics.
Finally, we extend the results for all these objectives to tree metrics by using only one additional pass through the stream and without asymptotically increasing the approximation factor. - [370] arXiv:2504.17780 [pdf, html, other]
-
Title: Replay to Remember: Retaining Domain Knowledge in Streaming Language ModelsSneh Pillai (University of Massachusetts Dartmouth)Comments: 8 pages 3 figures, 3 tablesSubjects: Machine Learning (cs.LG)
Continual learning in large language models (LLMs) typically encounters the critical challenge of catastrophic forgetting, where previously acquired knowledge deteriorates upon exposure to new data. While techniques like replay buffers and parameter-efficient tuning (e.g., Low-Rank Adaptation or LoRA) have been proposed, few studies investigate real-time domain adaptation under strict computational and data-stream constraints. In this paper, we demonstrate a lightweight method combining LoRA and a minimal replay mechanism in a realistic streaming setting across three diverse knowledge domains: medical question answering, genetics, and law. Using perplexity, semantic similarity, and GPT-based human-like evaluation metrics, we quantify the model's adaptation, forgetting, and recovery over time. Our experiments reveal that while catastrophic forgetting naturally occurs, even minimal replay significantly stabilizes and partially restores domain-specific knowledge. This study contributes practical insights for deploying adaptable LLMs in resource-constrained, real-world scenarios.
- [371] arXiv:2504.17782 [pdf, html, other]
-
Title: Unleashing the Power of Natural Audio Featuring Multiple Sound SourcesComments: Work in ProgressSubjects: Sound (cs.SD); Machine Learning (cs.LG)
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio, which is critical for artificial auditory perception. However, current methods heavily rely on artificially mixed audio for training, which limits their ability to generalize to naturally mixed audio collected in real-world environments. To overcome this limitation, we propose ClearSep, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios. We introduce two remix-based evaluation metrics to quantitatively assess separation quality and use these metrics as thresholds to iteratively apply the data engine alongside model training, progressively optimizing separation performance. In addition, we propose a series of training strategies tailored to these separated independent tracks to make the best use of them. Extensive experiments demonstrate that ClearSep achieves state-of-the-art performance across multiple sound separation tasks, highlighting its potential for advancing sound separation in natural audio scenarios. For more examples and detailed results, please visit our demo page at this https URL.
- [372] arXiv:2504.17784 [pdf, html, other]
-
Title: Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic ManipulationComments: Published at Robotics: Science and Systems (RSS) 2025Subjects: Robotics (cs.RO)
Bimanual manipulation is a challenging yet crucial robotic capability, demanding precise spatial localization and versatile motion trajectories, which pose significant challenges to existing approaches. Existing approaches fall into two categories: keyframe-based strategies, which predict gripper poses in keyframes and execute them via motion planners, and continuous control methods, which estimate actions sequentially at each timestep. The keyframe-based method lacks inter-frame supervision, struggling to perform consistently or execute curved motions, while the continuous method suffers from weaker spatial perception. To address these issues, this paper introduces an end-to-end framework PPI (keyPose and Pointflow Interface), which integrates the prediction of target gripper poses and object pointflow with the continuous actions estimation. These interfaces enable the model to effectively attend to the target manipulation area, while the overall framework guides diverse and collision-free trajectories. By combining interface predictions with continuous actions estimation, PPI demonstrates superior performance in diverse bimanual manipulation tasks, providing enhanced spatial localization and satisfying flexibility in handling movement restrictions. In extensive evaluations, PPI significantly outperforms prior methods in both simulated and real-world experiments, achieving state-of-the-art performance with a +16.1% improvement on the RLBench2 simulation benchmark and an average of +27.5% gain across four challenging real-world tasks. Notably, PPI exhibits strong stability, high precision, and remarkable generalization capabilities in real-world scenarios. Project page: this https URL
- [373] arXiv:2504.17785 [pdf, html, other]
-
Title: Silenzio: Secure Non-Interactive Outsourced MLP TrainingSubjects: Cryptography and Security (cs.CR)
Outsourcing the ML training to cloud providers presents a compelling opportunity for resource constrained clients, while it simultaneously bears inherent privacy risks, especially for highly sensitive training data. We introduce Silenzio, the first fully non-interactive outsourcing scheme for the training of multi-layer perceptrons that achieves 128 bit security using FHE. Unlike traditional MPC based protocols that necessitate interactive communication between the client and server(s) or non-collusion assumptions among multiple servers, Silenzio enables the fire-and-forget paradigm without such assumptions. In this approach, the client encrypts the training data once, and the cloud server performs the training without any further interaction.
Silenzio operates over low bitwidth integers - never exceeding 8 bit - to mitigate the computational overhead of FHE. Our approach features a novel low-bitwidth matrix multiplication that leverages input-dependent residue number systems and a Karatsuba-inspired multiplication routine, ensuring that no intermediate FHE-processed value overflows 8 bit. Starting from an RNS-to-MRNS conversion process, we propose an efficient block-scaling mechanism, which approximately shifts encrypted tensor values to the user-specified most significant bits. To instantiate the backpropagation of the error, Silenzio introduces a low-bitwidth and TFHE friendly gradient computation for the cross entropy loss.
Implemented using the state-of-the-art Concrete library, we evaluate Silenzio on standard MLP training tasks regarding runtime as well as model performance and achieve similar classification accuracy as MLPs trained using standard PyTorch with 32 bit floating-point computations. Our open-source implementation represents a significant advancement in privacy-preserving ML, providing a new baseline for secure and non-interactive outsourced MLP training. - [374] arXiv:2504.17787 [pdf, html, other]
-
Title: The Fourth Monocular Depth Estimation ChallengeAnton Obukhov, Matteo Poggi, Fabio Tosi, Ripudaman Singh Arora, Jaime Spencer, Chris Russell, Simon Hadfield, Richard Bowden, Shuaihang Wang, Zhenxin Ma, Weijie Chen, Baobei Xu, Fengyu Sun, Di Xie, Jiang Zhu, Mykola Lavreniuk, Haining Guan, Qun Wu, Yupei Zeng, Chao Lu, Huanran Wang, Guangyuan Zhou, Haotian Zhang, Jianxiong Wang, Qiang Rao, Chunjie Wang, Xiao Liu, Zhiqiang Lou, Hualie Jiang, Yihao Chen, Rui Xu, Minglang Tan, Zihan Qin, Yifan Mao, Jiayang Liu, Jialei Xu, Yifan Yang, Wenbo Zhao, Junjun Jiang, Xianming Liu, Mingshuai Zhao, Anlong Ming, Wu Chen, Feng Xue, Mengying Yu, Shida Gao, Xiangfeng Wang, Gbenga Omotara, Ramy Farag, Jacket Demby, Seyed Mohamad Ali Tousi, Guilherme N DeSouza, Tuan-Anh Yang, Minh-Quang Nguyen, Thien-Phuc Tran, Albert Luginov, Muhammad ShahzadComments: To appear in CVPRW2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
This paper presents the results of the fourth edition of the Monocular Depth Estimation Challenge (MDEC), which focuses on zero-shot generalization to the SYNS-Patches benchmark, a dataset featuring challenging environments in both natural and indoor settings. In this edition, we revised the evaluation protocol to use least-squares alignment with two degrees of freedom to support disparity and affine-invariant predictions. We also revised the baselines and included popular off-the-shelf methods: Depth Anything v2 and Marigold. The challenge received a total of 24 submissions that outperformed the baselines on the test set; 10 of these included a report describing their approach, with most leading methods relying on affine-invariant predictions. The challenge winners improved the 3D F-Score over the previous edition's best result, raising it from 22.58% to 23.05%.
- [375] arXiv:2504.17788 [pdf, html, other]
-
Title: Dynamic Camera Poses and Where to Find ThemComments: Accepted to CVPR 2025. Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation. However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation. Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-theart methods. In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models. For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches. Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.
- [376] arXiv:2504.17789 [pdf, html, other]
-
Title: Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive ModelsXu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, Yushi Hu, Artsiom Sanakoyeu, Felix Juefei-Xu, Ji Hou, Junjiao Tian, Tao Xu, Tingbo Hou, Yen-Cheng Liu, Zecheng He, Zijian He, Matt Feiszli, Peizhao Zhang, Peter Vajda, Sam Tsai, Yun FuSubjects: Computer Vision and Pattern Recognition (cs.CV)
Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.
- [377] arXiv:2504.17791 [pdf, html, other]
-
Title: LiDPM: Rethinking Point Diffusion for Lidar Scene CompletionComments: Accepted to IEEE IV 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Training diffusion models that work directly on lidar points at the scale of outdoor scenes is challenging due to the difficulty of generating fine-grained details from white noise over a broad field of view. The latest works addressing scene completion with diffusion models tackle this problem by reformulating the original DDPM as a local diffusion process. It contrasts with the common practice of operating at the level of objects, where vanilla DDPMs are currently used. In this work, we close the gap between these two lines of work. We identify approximations in the local diffusion formulation, show that they are not required to operate at the scene level, and that a vanilla DDPM with a well-chosen starting point is enough for completion. Finally, we demonstrate that our method, LiDPM, leads to better results in scene completion on SemanticKITTI. The project page is this https URL .
New submissions (showing 377 of 377 entries)
- [378] arXiv:2504.16940 (cross-list from q-bio.NC) [pdf, other]
-
Title: Can deep neural networks learn biological vision?Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Deep neural networks (DNNs) once showed increasing alignment with primate neural responses as they improved on computer vision benchmarks. This trend raised the exciting possibility that better models of biological vision would come as a byproduct of the deep learning revolution in artificial intelligence. However, the trend has reversed over recent years as DNNs have scaled to human or superhuman recognition accuracy, a divergence that may stem from modern DNNs learning to rely on different visual features than primates to solve tasks. Where will better computational models of biological vision come from? We propose that vision science must break from artificial intelligence to develop algorithms that are designed with biological visual systems in mind instead of internet data benchmarks. We predict that the next generation of deep learning models of biological vision will be trained with data diets, training routines, and objectives that are closer to those that shape human vision than those that are in use today.
- [379] arXiv:2504.16941 (cross-list from q-bio.BM) [pdf, html, other]
-
Title: Mathematical Modeling of Protein Structures: A Cohomology-Based Approach to the Flagellar MotorSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Algebraic Topology (math.AT)
This study presents a novel mathematical model derived from cohomology, leveraging the KEEL-proven theorem that establishes cohomology as tautological, generated by boundary classes of curves with fixed dual graphs. Simplicial complexes are constructed using skew-commutative graded algebra, and the structure theorem is applied to connect distinct homologies, enabling precise interpretations of the resulting geometric forms. The proposed model is utilized for protein structure analysis and prediction, with a specific application to the Flagellar Motor structure. This approach offers new insights into the geometric and algebraic foundations of biological macromolecular modeling, highlighting its potential for advancement in structural biology.
- [380] arXiv:2504.16945 (cross-list from physics.soc-ph) [pdf, html, other]
-
Title: Graph Percolation as Decision Threshold for Risk Management in Cross-Country Thermal SoaringSubjects: Physics and Society (physics.soc-ph); Social and Information Networks (cs.SI)
Long range flight by fixed-wing aircraft without propulsion systems can be accomplished by "soaring" -- exploiting randomly located updrafts to gain altitude which is expended in gliding flight. As the location of updrafts is uncertain and cannot be determined except through in situ observation, aircraft exploiting this energy source are at risk of failing to find a subsequent updraft. Determining when an updraft must be exploited to continue flight is essential to managing risk and optimizing speed. Graph percolation offers a theoretical explanation for this risk, and a framework for evaluating it using information available to the operator of a soaring aircraft in flight. The utility of graph percolation as a risk measure is examined by analyzing flight logs from human soaring pilots. This analysis indicates that in sport soaring pilots rarely operate in a condition which does not satisfy graph percolation, identifies an apparent desired minimum node degree, and shows that pilots accept reduced climb rates in order to maintain percolation.
- [381] arXiv:2504.16979 (cross-list from q-bio.QM) [pdf, other]
-
Title: Automating tumor-infiltrating lymphocyte assessment in breast cancer histopathology images using QuPath: a transparent and accessible machine learning pipelineMasoud Tafavvoghi, Lars Ailo Bongo, André Berli Delgado, Nikita Shvetsov, Anders Sildnes, Line Moi, Lill-Tove Rasmussen Busund, Kajsa MøllersenComments: 16 Pages, 9 Figures, 3 tablesSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
In this study, we built an end-to-end tumor-infiltrating lymphocytes (TILs) assessment pipeline within QuPath, demonstrating the potential of easily accessible tools to perform complex tasks in a fully automatic fashion. First, we trained a pixel classifier to segment tumor, tumor-associated stroma, and other tissue compartments in breast cancer H&E-stained whole-slide images (WSI) to isolate tumor-associated stroma for subsequent analysis. Next, we applied a pre-trained StarDist deep learning model in QuPath for cell detection and used the extracted cell features to train a binary classifier distinguishing TILs from other cells. To evaluate our TILs assessment pipeline, we calculated the TIL density in each WSI and categorized them as low, medium, or high TIL levels. Our pipeline was evaluated against pathologist-assigned TIL scores, achieving a Cohen's kappa of 0.71 on the external test set, corroborating previous research findings. These results confirm that existing software can offer a practical solution for the assessment of TILs in H&E-stained WSIs of breast cancer.
- [382] arXiv:2504.17029 (cross-list from astro-ph.IM) [pdf, html, other]
-
Title: Fried Parameter Estimation from Single Wavefront Sensor Image with Artificial Neural NetworksSubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
Atmospheric turbulence degrades the quality of astronomical observations in ground-based telescopes, leading to distorted and blurry images. Adaptive Optics (AO) systems are designed to counteract these effects, using atmospheric measurements captured by a wavefront sensor to make real-time corrections to the incoming wavefront. The Fried parameter, r0, characterises the strength of atmospheric turbulence and is an essential control parameter for optimising the performance of AO systems and more recently sky profiling for Free Space Optical (FSO) communication channels. In this paper, we develop a novel data-driven approach, adapting machine learning methods from computer vision for Fried parameter estimation from a single Shack-Hartmann or pyramid wavefront sensor image. Using these data-driven methods, we present a detailed simulation-based evaluation of our approach using the open-source COMPASS AO simulation tool to evaluate both the Shack-Hartmann and pyramid wavefront sensors. Our evaluation is over a range of guide star magnitudes, and realistic noise, atmospheric and instrument conditions. Remarkably, we are able to develop a single network-based estimator that is accurate in both open and closed-loop AO configurations. Our method accurately estimates the Fried parameter from a single WFS image directly from AO telemetry to a few millimetres. Our approach is suitable for real time control, exhibiting 0.83ms r0 inference times on retail NVIDIA RTX 3090 GPU hardware, and thereby demonstrating a compelling economic solution for use in real-time instrument control.
- [383] arXiv:2504.17041 (cross-list from math.LO) [pdf, html, other]
-
Title: Feasibility of Primality in Bounded ArithmeticSubjects: Logic (math.LO); Computational Complexity (cs.CC)
We prove the correctness of the AKS algorithm \cite{AKS} within the bounded arithmetic theory $T^{count}_2$ or, equivalently, the first-order consequence of the theory $VTC^0$ expanded by the smash function, which we denote by $VTC^0_2$. Our approach initially demonstrates the correctness within the theory $S^1_2 + iWPHP$ augmented by two algebraic axioms and then show that they are provable in $VTC^0_2$. The two axioms are: a generalized version of Fermat's Little Theorem and an axiom adding a new function symbol which injectively maps roots of polynomials over a definable finite field to numbers bounded by the degree of the given polynomial. To obtain our main result, we also give new formalizations of parts of number theory and algebra:
$\bullet$ In $PV_1$: We formalize Legendre's Formula on the prime factorization of $n!$, key properties of the Combinatorial Number System and the existence of cyclotomic polynomials over the finite fields $Z/p$.
$\bullet$ In $S^1_2$: We prove the inequality $lcm(1,\dots, 2n) \geq 2^n$.
$\bullet$ In $VTC^0$: We verify the correctness of the Kung--Sieveking algorithm for polynomial division. - [384] arXiv:2504.17077 (cross-list from physics.optics) [pdf, html, other]
-
Title: Physics-guided and fabrication-aware inverse design of photonic devices using diffusion modelsComments: 25 pages, 7 FiguresSubjects: Optics (physics.optics); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
Designing free-form photonic devices is fundamentally challenging due to the vast number of possible geometries and the complex requirements of fabrication constraints. Traditional inverse-design approaches--whether driven by human intuition, global optimization, or adjoint-based gradient methods--often involve intricate binarization and filtering steps, while recent deep learning strategies demand prohibitively large numbers of simulations (10^5 to 10^6). To overcome these limitations, we present AdjointDiffusion, a physics-guided framework that integrates adjoint sensitivity gradients into the sampling process of diffusion models. AdjointDiffusion begins by training a diffusion network on a synthetic, fabrication-aware dataset of binary masks. During inference, we compute the adjoint gradient of a candidate structure and inject this physics-based guidance at each denoising step, steering the generative process toward high figure-of-merit (FoM) solutions without additional post-processing. We demonstrate our method on two canonical photonic design problems--a bent waveguide and a CMOS image sensor color router--and show that our method consistently outperforms state-of-the-art nonlinear optimizers (such as MMA and SLSQP) in both efficiency and manufacturability, while using orders of magnitude fewer simulations (approximately 2 x 10^2) than pure deep learning approaches (approximately 10^5 to 10^6). By eliminating complex binarization schedules and minimizing simulation overhead, AdjointDiffusion offers a streamlined, simulation-efficient, and fabrication-aware pipeline for next-generation photonic device design. Our open-source implementation is available at this https URL.
- [385] arXiv:2504.17093 (cross-list from math.OC) [pdf, html, other]
-
Title: Singular Arcs in Optimal Control: Closed-loop Implementations without WorkaroundsComments: Submitted to CDC 2025Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
Singular arcs emerge in the solutions of Optimal Control Problems (OCPs) when the optimal inputs on some finite time intervals cannot be directly obtained via the optimality conditions. Solving OCPs with singular arcs often requires tailored treatments, suitable for offline trajectory optimization. This approach can become increasingly impractical for online closed-loop implementations, especially for large-scale engineering problems. Recent development of Integrated Residual Methods (IRM) have indicated their suitability for handling singular arcs; the convergence of error measures in IRM automatically suppresses singular arc-induced fluctuations and leads to non-fluctuating solutions more suitable for practical problems. Through several examples, we demonstrate the advantages of solving OCPs with singular arcs using {IRM} under an economic model predictive control framework. In particular, the following observations are made: (i) IRM does not require special treatment for singular arcs, (ii) it solves the OCPs reliably with singular arc fluctuation suppressed, and (iii) the closed-loop results closely match the analytic optimal solutions.
- [386] arXiv:2504.17102 (cross-list from math.OC) [pdf, other]
-
Title: Neural Contraction Metrics with Formal Guarantees for Discrete-Time Nonlinear Dynamical SystemsComments: Accepted by L4DC 2025Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
Contraction metrics are crucial in control theory because they provide a powerful framework for analyzing stability, robustness, and convergence of various dynamical systems. However, identifying these metrics for complex nonlinear systems remains an open challenge due to the lack of scalable and effective tools. This paper explores the approach of learning verifiable contraction metrics parametrized as neural networks (NNs) for discrete-time nonlinear dynamical systems. While prior works on formal verification of contraction metrics for general nonlinear systems have focused on convex optimization methods (e.g. linear matrix inequalities, etc) under the assumption of continuously differentiable dynamics, the growing prevalence of NN-based controllers, often utilizing ReLU activations, introduces challenges due to the non-smooth nature of the resulting closed-loop dynamics. To bridge this gap, we establish a new sufficient condition for establishing formal neural contraction metrics for general discrete-time nonlinear systems assuming only the continuity of the dynamics. We show that from a computational perspective, our sufficient condition can be efficiently verified using the state-of-the-art neural network verifier $\alpha,\!\beta$-CROWN, which scales up non-convex neural network verification via novel integration of symbolic linear bound propagation and branch-and-bound. Built upon our analysis tool, we further develop a learning method for synthesizing neural contraction metrics from sampled data. Finally, our approach is validated through the successful synthesis and verification of NN contraction metrics for various nonlinear examples.
- [387] arXiv:2504.17112 (cross-list from stat.ML) [pdf, html, other]
-
Title: Physics-informed features in supervised machine learningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
Supervised machine learning involves approximating an unknown functional relationship from a limited dataset of features and corresponding labels. The classical approach to feature-based machine learning typically relies on applying linear regression to standardized features, without considering their physical meaning. This may limit model explainability, particularly in scientific applications. This study proposes a physics-informed approach to feature-based machine learning that constructs non-linear feature maps informed by physical laws and dimensional analysis. These maps enhance model interpretability and, when physical laws are unknown, allow for the identification of relevant mechanisms through feature ranking. The method aims to improve both predictive performance in regression tasks and classification skill scores by integrating domain knowledge into the learning process, while also enabling the potential discovery of new physical equations within the context of explainable machine learning.
- [388] arXiv:2504.17114 (cross-list from eess.IV) [pdf, html, other]
-
Title: Anatomy-constrained modelling of image-derived input functions in dynamic PET using multi-organ segmentationComments: The code is available under this https URLSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
Accurate kinetic analysis of [$^{18}$F]FDG distribution in dynamic positron emission tomography (PET) requires anatomically constrained modelling of image-derived input functions (IDIFs). Traditionally, IDIFs are obtained from the aorta, neglecting anatomical variations and complex vascular contributions. This study proposes a multi-organ segmentation-based approach that integrates IDIFs from the aorta, portal vein, pulmonary artery, and ureters. Using high-resolution CT segmentations of the liver, lungs, kidneys, and bladder, we incorporate organ-specific blood supply sources to improve kinetic modelling. Our method was evaluated on dynamic [$^{18}$F]FDG PET data from nine patients, resulting in a mean squared error (MSE) reduction of $13.39\%$ for the liver and $10.42\%$ for the lungs. These initial results highlight the potential of multiple IDIFs in improving anatomical modelling and fully leveraging dynamic PET imaging. This approach could facilitate the integration of tracer kinetic modelling into clinical routine.
- [389] arXiv:2504.17116 (cross-list from quant-ph) [pdf, html, other]
-
Title: OneAdapt: Adaptive Compilation for Resource-Constrained Photonic One-Way Quantum ComputingSubjects: Quantum Physics (quant-ph); Hardware Architecture (cs.AR)
Measurement-based quantum computing (MBQC), a.k.a. one-way quantum computing (1WQC), is a universal quantum computing model, which is particularly well-suited for photonic platforms. In this model, computation is driven by measurements on an entangled state, which serves as an intermediate representation (IR) between program and hardware. However, compilers on previous IRs lacks the adaptability to the resource constraint in photonic quantum computers. In this work, we propose a novel IR with new optimization passes. Based on this, it realizes a resource-adaptive compiler that minimizes the required hardware size and execution time while restricting the requirement for fusion devices within an adaptive limit. Moreover, our optimization can be integrated with Quantum Error Correction (QEC) to improve the efficiency of photonic fault-tolerant quantum computing (FTQC).
- [390] arXiv:2504.17122 (cross-list from eess.IV) [pdf, html, other]
-
Title: Physiological neural representation for personalised tracer kinetic parameter estimation from dynamic PETComments: The code is available at: this https URLSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Dynamic positron emission tomography (PET) with [$^{18}$F]FDG enables non-invasive quantification of glucose metabolism through kinetic analysis, often modelled by the two-tissue compartment model (TCKM). However, voxel-wise kinetic parameter estimation using conventional methods is computationally intensive and limited by spatial resolution. Deep neural networks (DNNs) offer an alternative but require large training datasets and significant computational resources. To address these limitations, we propose a physiological neural representation based on implicit neural representations (INRs) for personalized kinetic parameter estimation. INRs, which learn continuous functions, allow for efficient, high-resolution parametric imaging with reduced data requirements. Our method also integrates anatomical priors from a 3D CT foundation model to enhance robustness and precision in kinetic modelling. We evaluate our approach on an [$^{18}$F]FDG dynamic PET/CT dataset and compare it to state-of-the-art DNNs. Results demonstrate superior spatial resolution, lower mean-squared error, and improved anatomical consistency, particularly in tumour and highly vascularized regions. Our findings highlight the potential of INRs for personalized, data-efficient tracer kinetic modelling, enabling applications in tumour characterization, segmentation, and prognostic assessment.
- [391] arXiv:2504.17124 (cross-list from physics.app-ph) [pdf, html, other]
-
Title: Demonstration of an AI-driven workflow for dynamic x-ray spectroscopySubjects: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY)
X-ray absorption near edge structure (XANES) spectroscopy is a powerful technique for characterizing the chemical state and symmetry of individual elements within materials, but requires collecting data at many energy points which can be time-consuming. While adaptive sampling methods exist for efficiently collecting spectroscopic data, they often lack domain-specific knowledge about XANES spectra structure. Here we demonstrate a knowledge-injected Bayesian optimization approach for adaptive XANES data collection that incorporates understanding of spectral features like absorption edges and pre-edge peaks. We show this method accurately reconstructs the absorption edge of XANES spectra using only 15-20% of the measurement points typically needed for conventional sampling, while maintaining the ability to determine the x-ray energy of the sharp peak after absorption edge with errors less than 0.03 eV, the absorption edge with errors less than 0.1 eV; and overall root-mean-square errors less than 0.005 compared to compared to traditionally sampled spectra. Our experiments on battery materials and catalysts demonstrate the method's effectiveness for both static and dynamic XANES measurements, improving data collection efficiency and enabling better time resolution for tracking chemical changes. This approach advances the degree of automation in XANES experiments reducing the common errors of under- or over-sampling points in near the absorption edge and enabling dynamic experiments that require high temporal resolution or limited measurement time.
- [392] arXiv:2504.17133 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum Technologies for Beyond 5G and 6G Networks: Applications, Opportunities, and ChallengesEngin Zeydan, Chamitha De Alwis, Rabia Khan, Yekta Turk, Abdullah Aydeger, Thippa Reddy Gadekallu, Madhusanka LiyanageComments: 30 pages, 6 figures, 10 tablesSubjects: Quantum Physics (quant-ph); Networking and Internet Architecture (cs.NI)
As the world prepares for the advent of 6G networks, quantum technologies are becoming critical enablers of the next generation of communication systems. This survey paper investigates the convergence of quantum technologies and 6G networks, focusing on their applications, opportunities and challenges. We begin with an examination of the motivations for integrating quantum technologies into 6G, investigating the potential to overcome the limits of classical computing and cryptography. We then highlight key research gaps, particularly in quantum communication, quantum computing integration and security enhancement. A comprehensive overview of quantum technologies relevant to 6G, including quantum communication devices, quantum computing paradigms, and hybrid quantum-classical approaches is provided. A particular focus is on the role of quantum technologies in enhancing 6G Radio Access Networks (RAN), 6G core and edge network optimization, and 6G security. The survey paper also explores the application of quantum cryptography with a focus on Quantum Key Distribution (QKD), Quantum Secure Direct Communication (QSDC) and quantum-resistant cryptographic algorithms and assesses their implementation challenges and potential impact on 6G networks. We also discuss the significant challenges associated with integrating quantum technologies into existing communications infrastructures, including issues of technological maturity, standardization, and economic considerations. Finally, we summarize the lessons learned from current research and outline future research directions to guide the ongoing development of quantum-enabled 6G networks.
- [393] arXiv:2504.17142 (cross-list from physics.comp-ph) [pdf, html, other]
-
Title: Reinforcement learning framework for the mechanical design of microelectronic components under multiphysics constraintsComments: 27 pages of main text, 15 figuresSubjects: Computational Physics (physics.comp-ph); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
This study focuses on the development of reinforcement learning based techniques for the design of microelectronic components under multiphysics constraints. While traditional design approaches based on global optimization approaches are effective when dealing with a small number of design parameters, as the complexity of the solution space and of the constraints increases different techniques are needed. This is an important reason that makes the design and optimization of microelectronic components (characterized by large solution space and multiphysics constraints) very challenging for traditional methods. By taking as prototypical elements an application-specific integrated circuit (ASIC) and a heterogeneously integrated (HI) interposer, we develop and numerically test an optimization framework based on reinforcement learning (RL). More specifically, we consider the optimization of the bonded interconnect geometry for an ASIC chip as well as the placement of components on a HI interposer while satisfying thermoelastic and design constraints. This placement problem is particularly interesting because it features a high-dimensional solution space.
- [394] arXiv:2504.17154 (cross-list from math.OC) [pdf, html, other]
-
Title: Advancing Frontiers of Path Integral Theory for Stochastic Optimal ControlSubjects: Optimization and Control (math.OC); Robotics (cs.RO); Systems and Control (eess.SY)
Stochastic Optimal Control (SOC) problems arise in systems influenced by uncertainty, such as autonomous robots or financial models. Traditional methods like dynamic programming are often intractable for high-dimensional, nonlinear systems due to the curse of dimensionality. This dissertation explores the path integral control framework as a scalable, sampling-based alternative. By reformulating SOC problems as expectations over stochastic trajectories, it enables efficient policy synthesis via Monte Carlo sampling and supports real-time implementation through GPU parallelization.
We apply this framework to six classes of SOC problems: Chance-Constrained SOC, Stochastic Differential Games, Deceptive Control, Task Hierarchical Control, Risk Mitigation of Stealthy Attacks, and Discrete-Time LQR. A sample complexity analysis for the discrete-time case is also provided. These contributions establish a foundation for simulator-driven autonomy in complex, uncertain environments. - [395] arXiv:2504.17166 (cross-list from stat.ML) [pdf, html, other]
-
Title: Causal rule ensemble approach for multi-arm dataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Heterogeneous treatment effect (HTE) estimation is critical in medical research. It provides insights into how treatment effects vary among individuals, which can provide statistical evidence for precision medicine. While most existing methods focus on binary treatment situations, real-world applications often involve multiple interventions. However, current HTE estimation methods are primarily designed for binary comparisons and often rely on black-box models, which limit their applicability and interpretability in multi-arm settings. To address these challenges, we propose an interpretable machine learning framework for HTE estimation in multi-arm trials. Our method employs a rule-based ensemble approach consisting of rule generation, rule ensemble, and HTE estimation, ensuring both predictive accuracy and interpretability. Through extensive simulation studies and real data applications, the performance of our method was evaluated against state-of-the-art multi-arm HTE estimation approaches. The results indicate that our approach achieved lower bias and higher estimation accuracy compared with those of existing methods. Furthermore, the interpretability of our framework allows clearer insights into how covariates influence treatment effects, facilitating clinical decision making. By bridging the gap between accuracy and interpretability, our study contributes a valuable tool for multi-arm HTE estimation, supporting precision medicine.
- [396] arXiv:2504.17237 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum-Enhanced Change Detection and Joint Communication-DetectionComments: 9 pages, 5 figures. to be submitted to Physical Review A. Conference version accepted by ISIT 2025Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT)
Quick detection of transmittance changes in optical channel is crucial for secure communication. We demonstrate that pre-shared entanglement using two-mode squeezed vacuum states significantly reduces detection latency compared to classical and entanglement-augmented coherent-state probes. The change detection latency is inversely proportional to the quantum relative entropy (QRE), which goes to infinity in the absence of thermal noise, suggesting idealized instantaneous detection. However, in realistic scenarios, we show that QRE scales logarithmically with the inverse of the thermal noise mean photon number. We propose a receiver that achieves this scaling and quantify its performance gains over existing methods. Additionally, we explore the fundamental trade-off between communication capacity and change detection latency, highlighting how pre-shared entanglement enhances both.
- [397] arXiv:2504.17255 (cross-list from eess.IV) [pdf, other]
-
Title: 3D Deep-learning-based Segmentation of Human Skin Sweat Glands and Their 3D Morphological Response to Temperature VariationsShaoyu Pei, Renxiong Wu, Hao Zheng, Lang Qin, Shuaichen Lin, Yuxing Gan, Wenjing Huang, Zhixuan Wang, Mohan Qin, Yong Liu, Guangming NiSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Optics (physics.optics)
Skin, the primary regulator of heat exchange, relies on sweat glands for thermoregulation. Alterations in sweat gland morphology play a crucial role in various pathological conditions and clinical diagnoses. Current methods for observing sweat gland morphology are limited by their two-dimensional, in vitro, and destructive nature, underscoring the urgent need for real-time, non-invasive, quantifiable technologies. We proposed a novel three-dimensional (3D) transformer-based multi-object segmentation framework, integrating a sliding window approach, joint spatial-channel attention mechanism, and architectural heterogeneity between shallow and deep layers. Our proposed network enables precise 3D sweat gland segmentation from skin volume data captured by optical coherence tomography (OCT). For the first time, subtle variations of sweat gland 3D morphology in response to temperature changes, have been visualized and quantified. Our approach establishes a benchmark for normal sweat gland morphology and provides a real-time, non-invasive tool for quantifying 3D structural parameters. This enables the study of individual variability and pathological changes in sweat gland structure, advancing dermatological research and clinical applications, including thermoregulation and bromhidrosis treatment.
- [398] arXiv:2504.17286 (cross-list from math.CO) [pdf, html, other]
-
Title: Vertex evaluation of multiplex graphs using Forman CurvatureComments: 16 pages, 9 figuresSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
Identifying vertices that play a central role is a fundamental problem in network analysis. Although traditional centrality measures have been widely used for this purpose, the growing complexity of contemporary networks necessitates more sophisticated indicators. Forman curvature has recently emerged as a promising approach. In this paper, we define Forman curvature for multilayer networks, a class of complex networks characterized by multiple types of connections or layers between nodes, which are increasingly used to model intricate real-world phenomena. We establish the key properties of Forman curvature in the context of multilayer networks and demonstrate its utility for identifying vertices that hold central positions within these networks. Furthermore, we show that Forman curvature can also serve as an effective tool for the structural classification of entire multilayer networks.
- [399] arXiv:2504.17321 (cross-list from physics.geo-ph) [pdf, html, other]
-
Title: Dargana: fine-tuning EarthPT for dynamic tree canopy mapping from spaceComments: 9 pages, 6 figures, spotlight at `Tackling Climate Change with Machine Learning', ICLR 2025Subjects: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
We present Dargana, a fine-tuned variant of the EarthPT time-series foundation model that achieves specialisation using <3% of its pre-training data volume and 5% of its pre-training compute. Dargana is fine-tuned to generate regularly updated classification of tree canopy cover at 10m resolution, distinguishing conifer and broadleaved tree types. Using Cornwall, UK, as a test case, the model achieves a pixel-level ROC-AUC of 0.98 and a PR-AUC of 0.83 on unseen satellite imagery. Dargana can identify fine structures like hedgerows and coppice below the training sample limit, and can track temporal changes to canopy cover such as new woodland establishment. Our results demonstrate how pre-trained Large Observation Models like EarthPT can be specialised for granular, dynamic land cover monitoring from space, providing a valuable, scalable tool for natural capital management and conservation.
- [400] arXiv:2504.17379 (cross-list from eess.IV) [pdf, html, other]
-
Title: A Spatially-Aware Multiple Instance Learning Framework for Digital PathologySubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Multiple instance learning (MIL) is a promising approach for weakly supervised classification in pathology using whole slide images (WSIs). However, conventional MIL methods such as Attention-Based Deep Multiple Instance Learning (ABMIL) typically disregard spatial interactions among patches that are crucial to pathological diagnosis. Recent advancements, such as Transformer based MIL (TransMIL), have incorporated spatial context and inter-patch relationships. However, it remains unclear whether explicitly modeling patch relationships yields similar performance gains in ABMIL, which relies solely on Multi-Layer Perceptrons (MLPs). In contrast, TransMIL employs Transformer-based layers, introducing a fundamental architectural shift at the cost of substantially increased computational complexity. In this work, we enhance the ABMIL framework by integrating interaction-aware representations to address this question. Our proposed model, Global ABMIL (GABMIL), explicitly captures inter-instance dependencies while preserving computational efficiency. Experimental results on two publicly available datasets for tumor subtyping in breast and lung cancers demonstrate that GABMIL achieves up to a 7 percentage point improvement in AUPRC and a 5 percentage point increase in the Kappa score over ABMIL, with minimal or no additional computational overhead. These findings underscore the importance of incorporating patch interactions within MIL frameworks.
- [401] arXiv:2504.17384 (cross-list from physics.geo-ph) [pdf, html, other]
-
Title: On the workflow, opportunities and challenges of developing foundation model in geophysicsSubjects: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
Foundation models, as a mainstream technology in artificial intelligence, have demonstrated immense potential across various domains in recent years, particularly in handling complex tasks and multimodal data. In the field of geophysics, although the application of foundation models is gradually expanding, there is currently a lack of comprehensive reviews discussing the full workflow of integrating foundation models with geophysical data. To address this gap, this paper presents a complete framework that systematically explores the entire process of developing foundation models in conjunction with geophysical data. From data collection and preprocessing to model architecture selection, pre-training strategies, and model deployment, we provide a detailed analysis of the key techniques and methodologies at each stage. In particular, considering the diversity, complexity, and physical consistency constraints of geophysical data, we discuss targeted solutions to address these challenges. Furthermore, we discuss how to leverage the transfer learning capabilities of foundation models to reduce reliance on labeled data, enhance computational efficiency, and incorporate physical constraints into model training, thereby improving physical consistency and interpretability. Through a comprehensive summary and analysis of the current technological landscape, this paper not only fills the gap in the geophysics domain regarding a full-process review of foundation models but also offers valuable practical guidance for their application in geophysical data analysis, driving innovation and advancement in the field.
- [402] arXiv:2504.17417 (cross-list from math.OC) [pdf, html, other]
-
Title: Obtaining Structural Network Controllability with Higher-Order Local DynamicsComments: Submitted to Transactions on Control of Network SystemsSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
We consider a network of identical, first-order linear systems, and investigate how replacing a subset of the systems composing the network with higher-order ones, either taken to be generic or specifically designed, may affect its controllability. After establishing a correspondence between state controllability in networks of first-order systems with output controllability in networks of higher-order systems, we show that adding higher-order dynamics may require significantly fewer subsystem modifications to achieve structural controllability, when compared to first-order heterogeneous subsystems. Furthermore, we characterize the topology of networks (which we call X-networks) in which the introduction of heterogeneous local dynamics is not necessary for structural output controllability, as the latter can be attained by suitable higher-order subsystems with homogeneous internal dynamics.
- [403] arXiv:2504.17420 (cross-list from physics.geo-ph) [pdf, html, other]
-
Title: HydroStartML: A combined machine learning and physics-based approach to reduce hydrological model spin-up timeComments: 13 pages, 14 figures. To be published in Advances in Water ResourcesSubjects: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
Finding the initial depth-to-water table (DTWT) configuration of a catchment is a critical challenge when simulating the hydrological cycle with integrated models, significantly impacting simulation outcomes. Traditionally, this involves iterative spin-up computations, where the model runs under constant atmospheric settings until steady-state is achieved. These so-called model spin-ups are computationally expensive, often requiring many years of simulated time, particularly when the initial DTWT configuration is far from steady state.
To accelerate the model spin-up process we developed HydroStartML, a machine learning emulator trained on steady-state DTWT configurations across the contiguous United States. HydroStartML predicts, based on available data like conductivity and surface slopes, a DTWT configuration of the respective watershed, which can be used as an initial DTWT.
Our results show that initializing spin-up computations with HydroStartML predictions leads to faster convergence than with other initial configurations like spatially constant DTWTs. The emulator accurately predicts configurations close to steady state, even for terrain configurations not seen in training, and allows especially significant reductions in computational spin-up effort in regions with deep DTWTs. This work opens the door for hybrid approaches that blend machine learning and traditional simulation, enhancing predictive accuracy and efficiency in hydrology for improving water resource management and understanding complex environmental interactions. - [404] arXiv:2504.17458 (cross-list from math.CO) [pdf, html, other]
-
Title: Boundedness and Separation in the Graph Covering Number FrameworkSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
For a graph class $\mathcal G$ and a graph $H$, the four $\mathcal G$-covering numbers of $H$, namely global ${\rm cn}_{g}^{\mathcal{G}}(H)$, union ${\rm cn}_{u}^{\mathcal{G}}(H)$, local ${\rm cn}_{l}^{\mathcal{G}}(H)$, and folded ${\rm cn}_{f}^{\mathcal{G}}(H)$, each measure in a slightly different way how well $H$ can be covered with graphs from $\mathcal G$. For every $\mathcal G$ and $H$ it holds \[
{\rm cn}_{g}^{\mathcal{G}}(H) \geq {\rm cn}_{u}^{\mathcal{G}}(H) \geq {\rm cn}_{l}^{\mathcal{G}}(H) \geq {\rm cn}_{f}^{\mathcal{G}}(H) \] and in general each inequality can be arbitrarily far apart. We investigate structural properties of graph classes $\mathcal G$ and $\mathcal H$ such that for all graphs $H \in \mathcal{H}$, a larger $\mathcal G$-covering number of $H$ can be bounded in terms of a smaller $\mathcal G$-covering number of $H$. For example, we prove that if $\mathcal G$ is hereditary and the chromatic number of graphs in $\mathcal H$ is bounded, then there exists a function $f$ (called a binding function) such that for all $H \in \mathcal{H}$ it holds ${\rm cn}_{u}^{\mathcal{G}}(H) \leq f({\rm cn}_{g}^{\mathcal{G}}(H))$.
For $\mathcal G$ we consider graph classes that are component-closed, hereditary, monotone, sparse, or of bounded chromatic number. For $\mathcal H$ we consider graph classes that are sparse, $M$-minor-free, of bounded chromatic number, or of bounded treewidth. For each combination and every pair of $\mathcal G$-covering numbers, we either give a binding function $f$ or provide an example of such $\mathcal{G},\mathcal{H}$ for which no binding function exists. - [405] arXiv:2504.17538 (cross-list from physics.soc-ph) [pdf, html, other]
-
Title: SimFLEX: a methodology for comparative analysis of urban areas for implementing new on-demand feeder bus servicesSubjects: Physics and Society (physics.soc-ph); Computers and Society (cs.CY)
On-demand feeder bus services present an innovative solution to urban mobility challenges, yet their success depends on thorough assessment and strategic planning. Despite their potential, a comprehensive framework for evaluating feasibility and identifying suitable service areas remains underdeveloped. Simulation Framework for Feeder Location Evaluation (SimFLEX) uses spatial, demographic, and transport-specific data to run microsimulations and compute key performance indicators (KPIs), including service attractiveness, waiting time reduction, and added value. SimFLEX employs multiple replications to estimate demand and mode choices and integrates OpenTripPlanner (OTP) for public transport routing and ExMAS for calculating shared trip attributes and KPIs. For each demand scenario, we model the traveler learning process using the method of successive averages (MSA), stabilizing the system. After stabilization, we calculate KPIs for comparative and sensitivity analyzes. We applied SimFLEX to compare two remote urban areas in Krakow, Poland - Bronowice and Skotniki - the candidates for service launch. Our analysis revealed notable differences between analyzed areas: Skotniki exhibited higher service attractiveness (up to 30%) and added value (up to 7%), while Bronowice showed greater potential for reducing waiting times (by nearly 77%). To assess the reliability of our model output, we conducted a sensitivity analysis across a range of alternative-specific constants (ASC). The results consistently confirmed Skotniki as the superior candidate for service implementation. SimFLEX can be instrumental for policymakers to estimate new service performance in the considered area, publicly available and applicable to various use cases. It can integrate alternative models and approaches, making it a versatile tool for policymakers and urban planners to enhance urban mobility.
- [406] arXiv:2504.17546 (cross-list from stat.CO) [pdf, html, other]
-
Title: An introduction to R package `mvs`Comments: 15 pages, 4 figures. Package vignette corresponding to this https URLSubjects: Computation (stat.CO); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
In biomedical science, a set of objects or persons can often be described by multiple distinct sets of features obtained from different data sources or modalities (called "multi-view data"). Classical machine learning methods ignore the multi-view structure of such data, limiting model interpretability and performance. The R package `mvs` provides methods that were designed specifically for dealing with multi-view data, based on the multi-view stacking (MVS) framework. MVS is a form of supervised (machine) learning used to train multi-view classification or prediction models. MVS works by training a learning algorithm on each view separately, estimating the predictive power of each view-specific model through cross-validation, and then using another learning algorithm to assign weights to the view-specific models based on their estimated predictions. MVS is a form of ensemble learning, dividing the large multi-view learning problem into smaller sub-problems. Most of these sub-problems can be solved in parallel, making it computationally attractive. Additionally, the number of features of the sub-problems is greatly reduced compared with the full multi-view learning problem. This makes MVS especially useful when the total number of features is larger than the number of observations (i.e., high-dimensional data). MVS can still be applied even if the sub-problems are themselves high-dimensional by adding suitable penalty terms to the learning algorithms. Furthermore, MVS can be used to automatically select the views which are most important for prediction. The R package `mvs` makes fitting MVS models, including such penalty terms, easily and openly accessible. `mvs` allows for the fitting of stacked models with any number of levels, with different penalty terms, different outcome distributions, and provides several options for missing data handling.
- [407] arXiv:2504.17548 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum Autoencoder for Multivariate Time Series Anomaly DetectionKilian Tscharke, Maximilian Wendlinger, Afrae Ahouzi, Pallavi Bhardwaj, Kaweh Amoi-Taleghani, Michael Schrödl-Baumann, Pascal DebusComments: Submitted to IEEE International Conference on Quantum Computing and Engineering (QCE) 2025Subjects: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Anomaly Detection (AD) defines the task of identifying observations or events that deviate from typical - or normal - patterns, a critical capability in IT security for recognizing incidents such as system misconfigurations, malware infections, or cyberattacks. In enterprise environments like SAP HANA Cloud systems, this task often involves monitoring high-dimensional, multivariate time series (MTS) derived from telemetry and log data. With the advent of quantum machine learning offering efficient calculations in high-dimensional latent spaces, many avenues open for dealing with such complex data. One approach is the Quantum Autoencoder (QAE), an emerging and promising method with potential for application in both data compression and AD. However, prior applications of QAEs to time series AD have been restricted to univariate data, limiting their relevance for real-world enterprise systems. In this work, we introduce a novel QAE-based framework designed specifically for MTS AD towards enterprise scale. We theoretically develop and experimentally validate the architecture, demonstrating that our QAE achieves performance competitive with neural-network-based autoencoders while requiring fewer trainable parameters. We evaluate our model on datasets that closely reflect SAP system telemetry and show that the proposed QAE is a viable and efficient alternative for semisupervised AD in real-world enterprise settings.
- [408] arXiv:2504.17596 (cross-list from math.OC) [pdf, html, other]
-
Title: Rescaling and unconstrained minimisation of convex quadratic mapsComments: 19 pages, 9 figuresSubjects: Optimization and Control (math.OC); Numerical Analysis (math.NA)
We investigate the properties of a class of piecewise-fractional maps arising from the introduction of an invariance under rescaling into convex quadratic maps. The subsequent maps are quasiconvex, and pseudoconvex on specific convex cones; they can be optimised via exact line search along admissible directions, and the iterates then inherit a bidimensional optimality property. We study the minimisation of such relaxed maps via coordinate descents with gradient-based rules, placing a special emphasis on coordinate directions verifying a maximum-alignment property in the reproducing kernel Hilbert spaces related to the underlying positive-semidefinite matrices. In this setting, we illustrate that accounting for the optimal rescaling of the iterates can in certain situations substantially accelerate the unconstrained minimisation of convex quadratic maps.
- [409] arXiv:2504.17622 (cross-list from stat.ML) [pdf, html, other]
-
Title: Likelihood-Free Variational AutoencodersSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Variational Autoencoders (VAEs) typically rely on a probabilistic decoder with a predefined likelihood, most commonly an isotropic Gaussian, to model the data conditional on latent variables. While convenient for optimization, this choice often leads to likelihood misspecification, resulting in blurry reconstructions and poor data fidelity, especially for high-dimensional data such as images. In this work, we propose \textit{EnVAE}, a novel likelihood-free generative framework that has a deterministic decoder and employs the energy score -- a proper scoring rule -- to build the reconstruction loss. This enables likelihood-free inference without requiring explicit parametric density functions. To address the computational inefficiency of the energy score, we introduce a fast variant, \textit{FEnVAE}, based on the local smoothness of the decoder and the sharpness of the posterior distribution of latent variables. This yields an efficient single-sample training objective that integrates seamlessly into existing VAE pipelines with minimal overhead. Empirical results on standard benchmarks demonstrate that \textit{EnVAE} achieves superior reconstruction and generation quality compared to likelihood-based baselines. Our framework offers a general, scalable, and statistically principled alternative for flexible and nonparametric distribution learning in generative modeling.
- [410] arXiv:2504.17624 (cross-list from q-bio.BM) [pdf, other]
-
Title: Deciphering the unique dynamic activation pathway in a G protein-coupled receptor enables unveiling biased signaling and identifying cryptic allosteric sites in conformational intermediatesSubjects: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
Neurotensin receptor 1 (NTSR1), a member of the Class A G protein-coupled receptor superfamily, plays an important role in modulating dopaminergic neuronal activity and eliciting opioid-independent analgesia. Recent studies suggest that promoting \{beta}-arrestin-biased signaling in NTSR1 may diminish drugs of abuse, such as psychostimulants, thereby offering a potential avenue for treating human addiction-related disorders. In this study, we utilized a novel computational and experimental approach that combined nudged elastic band-based molecular dynamics simulations, Markov state models, temporal communication network analysis, site-directed mutagenesis, and conformational biosensors, to explore the intricate mechanisms underlying NTSR1 activation and biased signaling. Our study reveals a dynamic stepwise transition mechanism and activated transmission network associated with NTSR1 activation. It also yields valuable insights into the complex interplay between the unique polar network, non-conserved ion locks, and aromatic clusters in NTSR1 signaling. Moreover, we identified a cryptic allosteric site located in the intracellular region of the receptor that exists in an intermediate state within the activation pathway. Collectively, these findings contribute to a more profound understanding of NTSR1 activation and biased signaling at the atomic level, thereby providing a potential strategy for the development of NTSR1 allosteric modulators in the realm of G protein-coupled receptor biology, biophysics, and medicine.
- [411] arXiv:2504.17628 (cross-list from eess.IV) [pdf, other]
-
Title: Beyond Labels: Zero-Shot Diabetic Foot Ulcer Wound Segmentation with Self-attention Diffusion Models and the Potential for Text-Guided CustomizationAbderrachid Hamrani, Daniela Leizaola, Renato Sousa, Jose P. Ponce, Stanley Mathis, David G. Armstrong, Anuradha GodavartyComments: 12 pages, 8 figures, journal articleSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Diabetic foot ulcers (DFUs) pose a significant challenge in healthcare, requiring precise and efficient wound assessment to enhance patient outcomes. This study introduces the Attention Diffusion Zero-shot Unsupervised System (ADZUS), a novel text-guided diffusion model that performs wound segmentation without relying on labeled training data. Unlike conventional deep learning models, which require extensive annotation, ADZUS leverages zero-shot learning to dynamically adapt segmentation based on descriptive prompts, offering enhanced flexibility and adaptability in clinical applications. Experimental evaluations demonstrate that ADZUS surpasses traditional and state-of-the-art segmentation models, achieving an IoU of 86.68\% and the highest precision of 94.69\% on the chronic wound dataset, outperforming supervised approaches such as FUSegNet. Further validation on a custom-curated DFU dataset reinforces its robustness, with ADZUS achieving a median DSC of 75\%, significantly surpassing FUSegNet's 45\%. The model's text-guided segmentation capability enables real-time customization of segmentation outputs, allowing targeted analysis of wound characteristics based on clinical descriptions. Despite its competitive performance, the computational cost of diffusion-based inference and the need for potential fine-tuning remain areas for future improvement. ADZUS represents a transformative step in wound segmentation, providing a scalable, efficient, and adaptable AI-driven solution for medical imaging.
- [412] arXiv:2504.17650 (cross-list from quant-ph) [pdf, html, other]
-
Title: Near-Term Pseudorandom and Pseudoresource Quantum StatesComments: 17 pages, 1 figureSubjects: Quantum Physics (quant-ph); Cryptography and Security (cs.CR)
A pseudorandom quantum state (PRS) is an ensemble of quantum states indistinguishable from Haar-random states to observers with efficient quantum computers. It allows one to substitute the costly Haar-random state with efficiently preparable PRS as a resource for cryptographic protocols, while also finding applications in quantum learning theory, black hole physics, many-body thermalization, quantum foundations, and quantum chaos. All existing constructions of PRS equate the notion of efficiency to quantum computers which runtime is bounded by a polynomial in its input size. In this work, we relax the notion of efficiency for PRS with respect to observers with near-term quantum computers implementing algorithms with runtime that scales slower than polynomial-time. We introduce the $\mathbf{T}$-PRS which is indistinguishable to quantum algorithms with runtime $\mathbf{T}(n)$ that grows slower than polynomials in the input size $n$. We give a set of reasonable conditions that a $\mathbf{T}$-PRS must satisfy and give two constructions by using quantum-secure pseudorandom functions and pseudorandom functions. For $\mathbf{T}(n)$ being linearithmic, linear, polylogarithmic, and logarithmic function, we characterize the amount of quantum resources a $\mathbf{T}$-PRS must possess, particularly on its coherence, entanglement, and magic. Our quantum resource characterization applies generally to any two state ensembles that are indistinguishable to observers with computational power $\mathbf{T}(n)$, giving a general necessary condition of whether a low-resource ensemble can mimic a high-resource ensemble, forming a $\mathbf{T}$-pseudoresource pair. We demonstate how the necessary amount of resource decreases as the observer's computational power is more restricted, giving a $\mathbf{T}$-pseudoresource pair with larger resource gap for more computationally limited observers.
- [413] arXiv:2504.17676 (cross-list from eess.SP) [pdf, other]
-
Title: UNILoc: Unified Localization Combining Model-Based Geometry and Unsupervised LearningComments: 6 pages, submitted to IEEE conferenceSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
Accurate mobile device localization is critical for emerging 5G/6G applications such as autonomous vehicles and augmented reality. In this paper, we propose a unified localization method that integrates model-based and machine learning (ML)-based methods to reap their respective advantages by exploiting available map information. In order to avoid supervised learning, we generate training labels automatically via optimal transport (OT) by fusing geometric estimates with building layouts. Ray-tracing based simulations are carried out to demonstrate that the proposed method significantly improves positioning accuracy for both line-of-sight (LoS) users (compared to ML-based methods) and non-line-of-sight (NLoS) users (compared to model-based methods). Remarkably, the unified method is able to achieve competitive overall performance with the fully-supervised fingerprinting, while eliminating the need for cumbersome labeled data measurement and collection.
- [414] arXiv:2504.17690 (cross-list from quant-ph) [pdf, html, other]
-
Title: On the Generalization of Adversarially Trained Quantum ClassifiersComments: 22 pages, 6 figuresSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG)
Quantum classifiers are vulnerable to adversarial attacks that manipulate their input classical or quantum data. A promising countermeasure is adversarial training, where quantum classifiers are trained by using an attack-aware, adversarial loss function. This work establishes novel bounds on the generalization error of adversarially trained quantum classifiers when tested in the presence of perturbation-constrained adversaries. The bounds quantify the excess generalization error incurred to ensure robustness to adversarial attacks as scaling with the training sample size $m$ as $1/\sqrt{m}$, while yielding insights into the impact of the quantum embedding. For quantum binary classifiers employing \textit{rotation embedding}, we find that, in the presence of adversarial attacks on classical inputs $\mathbf{x}$, the increase in sample complexity due to adversarial training over conventional training vanishes in the limit of high dimensional inputs $\mathbf{x}$. In contrast, when the adversary can directly attack the quantum state $\rho(\mathbf{x})$ encoding the input $\mathbf{x}$, the excess generalization error depends on the choice of embedding only through its Hilbert space dimension. The results are also extended to multi-class classifiers. We validate our theoretical findings with numerical experiments.
- [415] arXiv:2504.17710 (cross-list from physics.plasm-ph) [pdf, other]
-
Title: Plasma State Monitoring and Disruption Characterization using Multimodal VAEsYoeri Poels, Alessandro Pau, Christian Donner, Giulio Romanelli, Olivier Sauter, Cristina Venturini, Vlado Menkovski, the TCV team, the WPTE teamSubjects: Plasma Physics (physics.plasm-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
When a plasma disrupts in a tokamak, significant heat and electromagnetic loads are deposited onto the surrounding device components. These forces scale with plasma current and magnetic field strength, making disruptions one of the key challenges for future devices. Unfortunately, disruptions are not fully understood, with many different underlying causes that are difficult to anticipate. Data-driven models have shown success in predicting them, but they only provide limited interpretability. On the other hand, large-scale statistical analyses have been a great asset to understanding disruptive patterns. In this paper, we leverage data-driven methods to find an interpretable representation of the plasma state for disruption characterization. Specifically, we use a latent variable model to represent diagnostic measurements as a low-dimensional, latent representation. We build upon the Variational Autoencoder (VAE) framework, and extend it for (1) continuous projections of plasma trajectories; (2) a multimodal structure to separate operating regimes; and (3) separation with respect to disruptive regimes. Subsequently, we can identify continuous indicators for the disruption rate and the disruptivity based on statistical properties of measurement data. The proposed method is demonstrated using a dataset of approximately 1600 TCV discharges, selecting for flat-top disruptions or regular terminations. We evaluate the method with respect to (1) the identified disruption risk and its correlation with other plasma properties; (2) the ability to distinguish different types of disruptions; and (3) downstream analyses. For the latter, we conduct a demonstrative study on identifying parameters connected to disruptions using counterfactual-like analysis. Overall, the method can adequately identify distinct operating regimes characterized by varying proximity to disruptions in an interpretable manner.
- [416] arXiv:2504.17718 (cross-list from math.OC) [pdf, html, other]
-
Title: Recursive feasibility for stochastic MPC and the rationale behind fixing flat tiresSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
In this paper, we address the problem of designing stochastic model predictive control (SMPC) schemes for linear systems affected by unbounded disturbances. The contribution of the paper is rooted in a measured-state initialization strategy. First, due to the nonzero probability of violating chance-constraints in the case of unbounded noise, we introduce ellipsoidal-based probabilistic reachable sets and we include constraint relaxations to recover recursive feasibility conditioned to the measured state. Second, we prove that the solution of this novel SMPC scheme guarantees closed-loop chance constraints satisfaction under minimum relaxation. Last, we demonstrate that, in expectation, the need of relaxing the constraints vanishes over time, which leads the closed-loop trajectories steered towards the unconstrained LQR invariant region. This novel SMPC scheme is proven to satisfy the recursive feasibility conditioned to the state realization, and its superiority with respect to open-loop initialization schemes is shown through numerical examples.
- [417] arXiv:2504.17719 (cross-list from stat.ML) [pdf, html, other]
-
Title: Evaluating Uncertainty in Deep Gaussian ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Reliable uncertainty estimates are crucial in modern machine learning. Deep Gaussian Processes (DGPs) and Deep Sigma Point Processes (DSPPs) extend GPs hierarchically, offering promising methods for uncertainty quantification grounded in Bayesian principles. However, their empirical calibration and robustness under distribution shift relative to baselines like Deep Ensembles remain understudied. This work evaluates these models on regression (CASP dataset) and classification (ESR dataset) tasks, assessing predictive performance (MAE, Accu- racy), calibration using Negative Log-Likelihood (NLL) and Expected Calibration Error (ECE), alongside robustness under various synthetic feature-level distribution shifts. Results indicate DSPPs provide strong in-distribution calibration leveraging their sigma point approximations. However, compared to Deep Ensembles, which demonstrated superior robustness in both per- formance and calibration under the tested shifts, the GP-based methods showed vulnerabilities, exhibiting particular sensitivity in the observed metrics. Our findings underscore ensembles as a robust baseline, suggesting that while deep GP methods offer good in-distribution calibration, their practical robustness under distribution shift requires careful evaluation. To facilitate reproducibility, we make our code available at this https URL.
- [418] arXiv:2504.17724 (cross-list from eess.SP) [pdf, html, other]
-
Title: Unsupervised EEG-based decoding of absolute auditory attention with canonical correlation analysisSubjects: Signal Processing (eess.SP); Sound (cs.SD)
We propose a fully unsupervised algorithm that detects from encephalography (EEG) recordings when a subject actively listens to sound, versus when the sound is ignored. This problem is known as absolute auditory attention decoding (aAAD). We propose an unsupervised discriminative CCA model for feature extraction and combine it with an unsupervised classifier called minimally informed linear discriminant analysis (MILDA) for aAAD classification. Remarkably, the proposed unsupervised algorithm performs significantly better than a state-of-the-art supervised model. A key reason is that the unsupervised algorithm can successfully adapt to the non-stationary test data at a low computational cost. This opens the door to the analysis of the auditory attention of a subject using EEG signals with a model that automatically tunes itself to the subject without requiring an arduous supervised training session beforehand.
- [419] arXiv:2504.17790 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum Error Correction with Girth-16 Non-Binary LDPC Codes via Affine Permutation ConstructionSubjects: Quantum Physics (quant-ph); Information Theory (cs.IT)
We propose a method for constructing quantum error-correcting codes based on non-binary low-density parity-check codes with girth 16. In conventional constructions using circulant permutation matrices, the girth is upper-bounded by 12, which limits the suppression of harmful short cycles. Our construction employs affine permutation matrices and a randomized sequential selection procedure designed to eliminate short cycles, which are known to limit decoding performance.
Joint belief propagation decoding is applied over depolarizing channels. Numerical experiments confirm that the proposed codes reduce the number of low-weight codewords in $C_X \setminus C_Z^\perp$ and $C_Z \setminus C_X^\perp$, and thus have the potential to suppress error floors. In addition, we obtain a significantly improved upper bound on the minimum distance, which we conjecture to be tight.
Cross submissions (showing 42 of 42 entries)
- [420] arXiv:2001.00078 (replaced) [pdf, other]
-
Title: Regulatory Markets for AI SafetySubjects: Computers and Society (cs.CY); General Economics (econ.GN)
We propose a new model for regulation to achieve AI safety: global regulatory markets. We first sketch the model in general terms and provide an overview of the costs and benefits of this approach. We then demonstrate how the model might work in practice: responding to the risk of adversarial attacks on AI models employed in commercial drones.
- [421] arXiv:2212.11478 (replaced) [pdf, html, other]
-
Title: Runtime Performance of Evolutionary Algorithms for the Chance-constrained Makespan Scheduling ProblemSubjects: Neural and Evolutionary Computing (cs.NE)
The Makespan Scheduling problem is an extensively studied NP-hard problem, and its simplest version looks for an allocation approach for a set of jobs with deterministic processing times to two identical machines such that the makespan is minimized. However, in real life scenarios, the actual processing time of each job may be stochastic around the expected value with a variance, under the influence of external factors, and the actual processing times of these jobs may be correlated with covariances. Thus within this paper, we propose a chance-constrained version of the Makespan Scheduling problem and investigate the theoretical performance of the classical Randomized Local Search and (1+1) EA for it. More specifically, we first study two variants of the Chance-constrained Makespan Scheduling problem and their computational complexities, then separately analyze the expected runtime of the two algorithms to obtain an optimal solution or almost optimal solution to the instances of the two variants. In addition, we investigate the experimental performance of the two algorithms for the two variants.
- [422] arXiv:2303.04442 (replaced) [pdf, other]
-
Title: Aczel-Mendler Bisimulations in a Regular CategoryComments: Submission to the CALCO 2023 special LMCS issueSubjects: Logic in Computer Science (cs.LO)
Aczel-Mendler bisimulations are a coalgebraic extension of a variety of computational relations between systems. It is usual to assume that the underlying category satisfies some form of the axiom of choice, so that the collection of bisimulations enjoys desirable properties, such as closure under composition. In this paper, we accommodate the definition in general regular categories and toposes. We show that this general definition: 1) is closed under composition without using the axiom of choice, 2) coincides with other types of coalgebraic formulations under milder conditions, 3) coincides with the usual definition when the category satisfies the regular axiom of choice. In particular, the case of toposes heavily relies on power-objects, for which we recover some favourable properties along the way. Finally, we describe several examples in Stone spaces, toposes for name-passing, and modules over a ring.
- [423] arXiv:2306.13962 (replaced) [pdf, html, other]
-
Title: QoS-based Beamforming and Compression Design for Cooperative Cellular Networks via Lagrangian DualityComments: 20 pages, 7 figures, accepted for publication in IEEE Transactions on Signal ProcessingSubjects: Information Theory (cs.IT); Signal Processing (eess.SP); Optimization and Control (math.OC)
This paper considers the quality-of-service (QoS)-based joint beamforming and compression design problem in the downlink cooperative cellular network, where multiple relay-like base stations (BSs), connected to the central processor via rate-limited fronthaul links, cooperatively transmit messages to the users. The problem of interest is formulated as the minimization of the total transmit power of the BSs, subject to all users' signal-to-interference-plus-noise ratio (SINR) constraints and all BSs' fronthaul rate constraints. In this paper, we first show that there is no duality gap between the considered joint optimization problem and its Lagrangian dual by showing the tightness of its semidefinite relaxation (SDR). Then, we propose an efficient algorithm based on the above duality result for solving the considered problem. The proposed algorithm judiciously exploits the special structure of an enhanced Karush-Kuhn-Tucker (KKT) conditions of the considered problem and finds the solution that satisfies the enhanced KKT conditions via two fixed point iterations. Two key features of the proposed algorithm are: (1) it is able to detect whether the considered problem is feasible or not and find its globally optimal solution when it is feasible; (2) it is highly efficient because both of the fixed point iterations in the proposed algorithm are linearly convergent and evaluating the functions in the fixed point iterations are computationally cheap. Numerical results show the global optimality and efficiency of the proposed algorithm.
- [424] arXiv:2308.04867 (replaced) [pdf, html, other]
-
Title: Learning Type-Generalized Actions for Symbolic PlanningComments: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2023Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Symbolic planning is a powerful technique to solve complex tasks that require long sequences of actions and can equip an intelligent agent with complex behavior. The downside of this approach is the necessity for suitable symbolic representations describing the state of the environment as well as the actions that can change it. Traditionally such representations are carefully hand-designed by experts for distinct problem domains, which limits their transferability to different problems and environment complexities. In this paper, we propose a novel concept to generalize symbolic actions using a given entity hierarchy and observed similar behavior. In a simulated grid-based kitchen environment, we show that type-generalized actions can be learned from few observations and generalize to novel situations. Incorporating an additional on-the-fly generalization mechanism during planning, unseen task combinations, involving longer sequences, novel entities and unexpected environment behavior, can be solved.
- [425] arXiv:2308.08903 (replaced) [pdf, html, other]
-
Title: The Incentive Guarantees Behind Nash Welfare in Divisible Resources AllocationComments: published in WINE'23 and Artificial IntelligenceSubjects: Computer Science and Game Theory (cs.GT)
We study the problem of allocating divisible resources among $n$ agents, hopefully in a fair and efficient manner. With the presence of strategic agents, additional incentive guarantees are also necessary, and the problem of designing fair and efficient mechanisms becomes much less tractable. While there are flourishing positive results against strategic agents for homogeneous divisible items, very few of them are known to hold in cake cutting.
We show that the Maximum Nash Welfare (MNW) mechanism, which provides desirable fairness and efficiency guarantees and achieves an incentive ratio of $2$ for homogeneous divisible items, also has an incentive ratio of $2$ in cake cutting. Remarkably, this result holds even without the free disposal assumption, which is hard to get rid of in the design of truthful cake cutting mechanisms.
Moreover, we show that, for cake cutting, the Partial Allocation (PA) mechanism proposed by Cole et al. (EC'13), which is truthful and $1/e$-MNW for homogeneous divisible items, has an incentive ratio between $[e^{1 / e}, e]$ and when randomization is allowed, can be turned to be truthful in expectation. Given two alternatives for a trade-off between incentive ratio and Nash welfare provided by the MNW and PA mechanisms, we establish an interpolation between them for both cake cutting and homogeneous divisible items.
Finally, we study the optimal incentive ratio achievable by envy-free cake cutting mechanisms. We first give an envy-free mechanism for two agents with an incentive ratio of $4 / 3$. Then, we show that any envy-free cake cutting mechanism with the connected pieces constraint has an incentive ratio of $\Theta(n)$. - [426] arXiv:2308.12452 (replaced) [pdf, html, other]
-
Title: ARF-Plus: Controlling Perceptual Factors in Artistic Radiance Fields for 3D Scene StylizationComments: Accepted at WACV 2025. The published version is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
The radiance fields style transfer is an emerging field that has recently gained popularity as a means of 3D scene stylization, thanks to the outstanding performance of neural radiance fields in 3D reconstruction and view synthesis. We highlight a research gap in radiance fields style transfer, the lack of sufficient perceptual controllability, motivated by the existing concept in the 2D image style transfer. In this paper, we present ARF-Plus, a 3D neural style transfer framework offering manageable control over perceptual factors, to systematically explore the perceptual controllability in 3D scene stylization. Four distinct types of controls - color preservation control, (style pattern) scale control, spatial (selective stylization area) control, and depth enhancement control - are proposed and integrated into this framework. Results from real-world datasets, both quantitative and qualitative, show that the four types of controls in our ARF-Plus framework successfully accomplish their corresponding perceptual controls when stylizing 3D scenes. These techniques work well for individual style inputs as well as for the simultaneous application of multiple styles within a scene. This unlocks a realm of limitless possibilities, allowing customized modifications of stylization effects and flexible merging of the strengths of different styles, ultimately enabling the creation of novel and eye-catching stylistic effects on 3D scenes.
- [427] arXiv:2310.07263 (replaced) [pdf, html, other]
-
Title: CoPAL: Corrective Planning of Robot Actions with Large Language ModelsFrank Joublin, Antonello Ceravola, Pavel Smirnov, Felix Ocker, Joerg Deigmoeller, Anna Belardinelli, Chao Wang, Stephan Hasler, Daniel Tanneberg, Michael GiengerComments: IEEE International Conference on Robotics and Automation (ICRA) 2024Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
In the pursuit of fully autonomous robotic systems capable of taking over tasks traditionally performed by humans, the complexity of open-world environments poses a considerable challenge. Addressing this imperative, this study contributes to the field of Large Language Models (LLMs) applied to task and motion planning for robots. We propose a system architecture that orchestrates a seamless interplay between multiple cognitive levels, encompassing reasoning, planning, and motion generation. At its core lies a novel replanning strategy that handles physically grounded, logical, and semantic errors in the generated plans. We demonstrate the efficacy of the proposed feedback architecture, particularly its impact on executability, correctness, and time complexity via empirical evaluation in the context of a simulation and two intricate real-world scenarios: blocks world, barman and pizza preparation.
- [428] arXiv:2311.00081 (replaced) [pdf, html, other]
-
Title: Convolution Quadrature for the quasilinear subdiffusion equationSubjects: Numerical Analysis (math.NA)
We construct a Convolution Quadrature (CQ) scheme for the quasilinear subdiffusion equation of order $\alpha$ and supply it with the fast and oblivious implementation. In particular, we find a condition for the CQ to be admissible and discretize the spatial part of the equation with the Finite Element Method. We prove the unconditional stability and convergence of the scheme and find a bound on the error. Our estimates are globally optimal for all $0<\alpha<1$ and pointwise for $\alpha\geq 1/2$ in the sense that they reduce to the well-known results for the linear equation. For the semilinear case, our estimates are optimal both globally and locally. As a passing result, we also obtain a discrete Grönwall inequality for the CQ, which is a crucial ingredient in our convergence proof based on the energy method. The paper is concluded with numerical examples verifying convergence and computation time reduction when using fast and oblivious quadrature.
- [429] arXiv:2311.07283 (replaced) [pdf, html, other]
-
Title: Predictive and prescriptive analytics for multi-site modelling of frail and elderly patient servicesSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Many economies are challenged by the effects of an ageing population, particularly in sectors where resource capacity planning is critical, such as healthcare. This research addresses the operational challenges of bed and staffing capacity planning in hospital wards by using predictive and prescriptive analytical methods, both individually and in tandem. We applied these methodologies to a study of 165,000 patients across a network of 11 hospitals in the UK. Predictive modelling, specifically Classification and Regression Trees, forecasts patient length of stay based on clinical and demographic data. On the prescriptive side, deterministic and two-stage stochastic optimisation models determine optimal bed and staff planning strategies to minimise costs. Linking the predictive models with the prescriptive optimisation models, generates demand forecasts that inform the optimisation process, providing accurate and practical solutions. The results demonstrate that this integrated approach captures real-world variations in patient LOS and offers a 7% cost saving compared to average-based planning. This approach helps healthcare managers make robust decisions by incorporating patient-specific characteristics, improving capacity allocation, and mitigating risks associated with demand variability. Consequently, this combined methodology can be broadly extended across various sectors facing similar challenges, showcasing the versatility and effectiveness of integrating predictive and prescriptive analytics.
- [430] arXiv:2311.11762 (replaced) [pdf, html, other]
-
Title: MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric RepresentationsComments: Daniel Bogdoll and Yitian Yang contributed equally. Accepted for publication at IV 2025Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
World models for autonomous driving have the potential to dramatically improve the reasoning capabilities of today's systems. However, most works focus on camera data, with only a few that leverage lidar data or combine both to better represent autonomous vehicle sensor setups. In addition, raw sensor predictions are less actionable than 3D occupancy predictions, but there are no works examining the effects of combining both multimodal sensor data and 3D occupancy prediction. In this work, we perform a set of experiments with a MUltimodal World Model with Geometric VOxel representations (MUVO) to evaluate different sensor fusion strategies to better understand the effects on sensor data prediction. We also analyze potential weaknesses of current sensor fusion approaches and examine the benefits of additionally predicting 3D occupancy.
- [431] arXiv:2311.17684 (replaced) [pdf, html, other]
-
Title: Online posting effects: Unveiling the non-linear journeys of users in depression communities on RedditVirginia Morini, Salvatore Citraro, Elena Sajno, Maria Sansoni, Giuseppe Riva, Massimo Stella, Giulio RossettiComments: updated final published versionSubjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY)
Social media platforms have become pivotal as self-help forums, enabling individuals to share personal experiences and seek support. However, on topics as sensitive as depression, what are the consequences of online self-disclosure? Here, we delve into the dynamics of mental health discourse on various Reddit boards focused on depression. To this aim, we introduce a data-informed framework reconstructing online dynamics from 303k users interacting over two years. Through user-generated content, we identify 4 distinct clusters representing different psychological states. Our analysis unveils online posting effects: a user can transition to another psychological state after online exposure to peers' emotional/semantic content. As described by conditional Markov chains and different levels of social exposure, users' transitions reveal navigation through both positive and negative phases in a spiral rather than a linear progression. Interpreted in light of psychological literature, related particularly to the Patient Health Engagement (PHE) model, our findings can provide evidence that the type and layout of online social interactions have an impact on users' "journeys" when posting about depression.
- [432] arXiv:2312.00113 (replaced) [pdf, html, other]
-
Title: Event-based Continuous Color Video Decompression from Single FramesSubjects: Computer Vision and Pattern Recognition (cs.CV)
We present ContinuityCam, a novel approach to generate a continuous video from a single static RGB image and an event camera stream. Conventional cameras struggle with high-speed motion capture due to bandwidth and dynamic range limitations. Event cameras are ideal sensors to solve this problem because they encode compressed change information at high temporal resolution. In this work, we tackle the problem of event-based continuous color video decompression, pairing single static color frames and event data to reconstruct temporally continuous videos. Our approach combines continuous long-range motion modeling with a neural synthesis model, enabling frame prediction at arbitrary times within the events. Our method only requires an initial image, thus increasing the robustness to sudden motions, light changes, minimizing the prediction latency, and decreasing bandwidth usage. We also introduce a novel single-lens beamsplitter setup that acquires aligned images and events, and a novel and challenging Event Extreme Decompression Dataset (E2D2) that tests the method in various lighting and motion profiles. We thoroughly evaluate our method by benchmarking color frame reconstruction, outperforming the baseline methods by 3.61 dB in PSNR and by 33% decrease in LPIPS, as well as showing superior results on two downstream tasks.
- [433] arXiv:2312.10925 (replaced) [pdf, html, other]
-
Title: Delving Deeper Into Astromorphic TransformersSubjects: Neural and Evolutionary Computing (cs.NE)
Preliminary attempts at incorporating the critical role of astrocytes - cells that constitute more than 50\% of human brain cells - in brain-inspired neuromorphic computing remain in infancy. This paper seeks to delve deeper into various key aspects of neuron-synapse-astrocyte interactions to mimic self-attention mechanisms in Transformers. The cross-layer perspective explored in this work involves bioplausible modeling of Hebbian and presynaptic plasticities in neuron-astrocyte networks, incorporating effects of non-linearities and feedback along with algorithmic formulations to map the neuron-astrocyte computations to self-attention mechanism and evaluating the impact of incorporating bio-realistic effects from the machine learning application side. Our analysis on sentiment and image classification tasks (IMDB and CIFAR10 datasets) highlights the advantages of Astromorphic Transformers, offering improved accuracy and learning speed. Furthermore, the model demonstrates strong natural language generation capabilities on the WikiText-2 dataset, achieving better perplexity compared to conventional models, thus showcasing enhanced generalization and stability across diverse machine learning tasks.
- [434] arXiv:2312.14425 (replaced) [pdf, html, other]
-
Title: Coriolis Factorizations and their Connections to Riemannian GeometryComments: working draft; comments welcomeSubjects: Systems and Control (eess.SY)
Many energy-based control strategies for mechanical systems require the choice of a Coriolis factorization satisfying a skew-symmetry property. This paper (a) explores if and when a control designer has flexibility in this choice, (b) develops a canonical choice related to the Christoffel symbols, and (c) describes how to efficiently perform control computations with it for constrained mechanical systems. We link the choice of a Coriolis factorization to the notion of an affine connection on the configuration manifold and show how properties of the connection relate with the associated factorization. In particular, the factorization based on the Christoffel symbols is linked with a torsion-free property that can limit the twisting of system trajectories during passivity-based control. We then develop a way to induce Coriolis factorizations for constrained mechanisms from unconstrained ones, which provides a pathway to use the theory for efficient control computations with high-dimensional systems such as humanoids and quadruped robots with open- and closed-chain mechanisms. A collection of algorithms is provided (and made available open source) to support the recursive computation of passivity-based control laws, adaptation laws, and regressor matrices in future applications.
- [435] arXiv:2401.15159 (replaced) [pdf, html, other]
-
Title: RABBIT: A Robot-Assisted Bed Bathing System with Multimodal Perception and Integrated ComplianceRishabh Madan, Skyler Valdez, David Kim, Sujie Fang, Luoyan Zhong, Diego Virtue, Tapomayukh BhattacharjeeComments: 10 pages, 8 figures, 19th Annual ACM/IEEE International Conference on Human Robot Interaction (HRI)Subjects: Robotics (cs.RO)
This paper introduces RABBIT, a novel robot-assisted bed bathing system designed to address the growing need for assistive technologies in personal hygiene tasks. It combines multimodal perception and dual (software and hardware) compliance to perform safe and comfortable physical human-robot interaction. Using RGB and thermal imaging to segment dry, soapy, and wet skin regions accurately, RABBIT can effectively execute washing, rinsing, and drying tasks in line with expert caregiving practices. Our system includes custom-designed motion primitives inspired by human caregiving techniques, and a novel compliant end-effector called Scrubby, optimized for gentle and effective interactions. We conducted a user study with 12 participants, including one participant with severe mobility limitations, demonstrating the system's effectiveness and perceived comfort. Supplementary material and videos can be found on our website this https URL.
- [436] arXiv:2402.04869 (replaced) [pdf, html, other]
-
Title: Learning by Doing: An Online Causal Reinforcement Learning Framework with Causal-Aware PolicyRuichu Cai, Siyang Huang, Jie Qiao, Wei Chen, Yan Zeng, Keli Zhang, Fuchun Sun, Yang Yu, Zhifeng HaoComments: Accepted by Science China Information SciencesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
As a key component to intuitive cognition and reasoning solutions in human intelligence, causal knowledge provides great potential for reinforcement learning (RL) agents' interpretability towards decision-making by helping reduce the searching space. However, there is still a considerable gap in discovering and incorporating causality into RL, which hinders the rapid development of causal RL. In this paper, we consider explicitly modeling the generation process of states with the causal graphical model, based on which we augment the policy. We formulate the causal structure updating into the RL interaction process with active intervention learning of the environment. To optimize the derived objective, we propose a framework with theoretical performance guarantees that alternates between two steps: using interventions for causal structure learning during exploration and using the learned causal structure for policy guidance during exploitation. Due to the lack of public benchmarks that allow direct intervention in the state space, we design the root cause localization task in our simulated fault alarm environment and then empirically show the effectiveness and robustness of the proposed method against state-of-the-art baselines. Theoretical analysis shows that our performance improvement attributes to the virtuous cycle of causal-guided policy learning and causal structure learning, which aligns with our experimental results. Codes are available at this https URL.
- [437] arXiv:2402.14781 (replaced) [pdf, html, other]
-
Title: Effective Bayesian Causal Inference via Structural Marginalisation and Autoregressive OrdersComments: 9 pages + references + appendices (37 pages total)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
The traditional two-stage approach to causal inference first identifies a single causal model (or equivalence class of models), which is then used to answer causal queries. However, this neglects any epistemic model uncertainty. In contrast, Bayesian causal inference does incorporate epistemic uncertainty into query estimates via Bayesian marginalisation (posterior averaging) over all causal models. While principled, this marginalisation over entire causal models, i.e., both causal structures (graphs) and mechanisms, poses a tremendous computational challenge. In this work, we address this challenge by decomposing structure marginalisation into the marginalisation over (i) causal orders and (ii) directed acyclic graphs (DAGs) given an order. We can marginalise the latter in closed form by limiting the number of parents per variable and utilising Gaussian processes to model mechanisms. To marginalise over orders, we use a sampling-based approximation, for which we devise a novel auto-regressive distribution over causal orders (ARCO). Our method outperforms state-of-the-art in structure learning on simulated non-linear additive noise benchmarks, and yields competitive results on real-world data. Furthermore, we can accurately infer interventional distributions and average causal effects.
- [438] arXiv:2402.16201 (replaced) [pdf, html, other]
-
Title: Honeybee: Byzantine Tolerant Decentralized Peer Sampling with Verifiable Random WalksComments: 32 pages; acmsmall-confSubjects: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Multiagent Systems (cs.MA)
Popular blockchains today have hundreds of thousands of nodes and need to be able to support sophisticated scaling solutions$\unicode{x2013}$such as sharding, data availability sampling, and layer-2 methods. Designing secure and efficient peer-to-peer (p2p) networking protocols at these scales to support the tight demands of the upper layer crypto-economic primitives is a highly non-trivial endeavor. We identify decentralized, uniform random sampling of nodes as a fundamental capability necessary for building robust p2p networks in emerging blockchain networks. Sampling algorithms used in practice today (primarily for address discovery) rely on either distributed hash tables (e.g., Kademlia) or sharing addresses with neighbors (e.g., GossipSub), and are not secure in a Sybil setting. We present Honeybee, a decentralized algorithm for sampling nodes that uses verifiable random walks and table consistency checks. Honeybee is secure against attacks even in the presence of an overwhelming number of Byzantine nodes (e.g., $\geq50\%$ of the network). We evaluate Honeybee through experiments and show that the quality of sampling achieved by Honeybee is significantly better compared to the state-of-the-art. Our proposed algorithm has implications for network design in both full nodes and light nodes.
- [439] arXiv:2403.06573 (replaced) [pdf, other]
-
Title: Enhancing Industrial Flexibility and Market Participation in Cement Manufacturing Through Optimized Production SchedulingSebastián Rojas-Innocenti, Enrique Baeyens, Alejandro Martín-Crespo, Sergio Saludes-Rodil, Fernando Frechoso-EscuderoSubjects: Systems and Control (eess.SY)
The growing share of variable renewable energy (VRE) sources in power systems is increasing the need for short term operational flexibility, particularly from large industrial electricity consumers. This study proposes a practical, two stage optimization framework to unlock this flexibility in cement manufacturing and support participation in electricity balancing markets. In Stage 1, a mixed integer linear programming (MILP) model minimizes electricity procurement costs by optimally scheduling the raw milling subsystem. In Stage 2, a flexibility assessment model evaluates profitable deviations, targeting participation in Spain manual Frequency Restoration Reserve (mFRR) market. A real world case study in a Spanish cement plant (including PV and battery storage) shows that flexibility services can yield monthly revenues of up to 800 EUR and paybacks as short as six years. This framework offers a replicable pathway for industrial flexibility in energy intensive sectors.
- [440] arXiv:2403.11743 (replaced) [pdf, html, other]
-
Title: PARMESAN: Parameter-Free Memory Search and Transduction for Dense Prediction TasksPhilip Matthias Winter, Maria Wimmer, David Major, Dimitrios Lenis, Astrid Berg, Theresa Neubauer, Gaia Romana De Paolis, Johannes Novotny, Sophia Ulonska, Katja BühlerComments: This is the author's accepted manuscript of a paper published in Lecture Notes in Computer Science (LNCS), volume 15297, Proceedings of DAGM GCPR 2024. 25 pages, 7 figuresJournal-ref: LNCS, volume 15297, 2025Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This work addresses flexibility in deep learning by means of transductive reasoning. For adaptation to new data and tasks, e.g., in continual learning, existing methods typically involve tuning learnable parameters or complete re-training from scratch, rendering such approaches unflexible in practice. We argue that the notion of separating computation from memory by the means of transduction can act as a stepping stone for solving these issues. We therefore propose PARMESAN (parameter-free memory search and transduction), a scalable method which leverages a memory module for solving dense prediction tasks. At inference, hidden representations in memory are being searched to find corresponding patterns. In contrast to other methods that rely on continuous training of learnable parameters, PARMESAN learns via memory consolidation simply by modifying stored contents. Our method is compatible with commonly used architectures and canonically transfers to 1D, 2D, and 3D grid-based data. The capabilities of our approach are demonstrated at the complex task of continual learning. PARMESAN learns by 3-4 orders of magnitude faster than established baselines while being on par in terms of predictive performance, hardware-efficiency, and knowledge retention.
- [441] arXiv:2403.12533 (replaced) [pdf, other]
-
Title: To Help or Not to Help: LLM-based Attentive Support for Human-Robot Group InteractionsDaniel Tanneberg, Felix Ocker, Stephan Hasler, Joerg Deigmoeller, Anna Belardinelli, Chao Wang, Heiko Wersing, Bernhard Sendhoff, Michael GiengerComments: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
How can a robot provide unobtrusive physical support within a group of humans? We present Attentive Support, a novel interaction concept for robots to support a group of humans. It combines scene perception, dialogue acquisition, situation understanding, and behavior generation with the common-sense reasoning capabilities of Large Language Models (LLMs). In addition to following user instructions, Attentive Support is capable of deciding when and how to support the humans, and when to remain silent to not disturb the group. With a diverse set of scenarios, we show and evaluate the robot's attentive behavior, which supports and helps the humans when required, while not disturbing if no help is needed.
- [442] arXiv:2404.00146 (replaced) [pdf, html, other]
-
Title: Fast OMP for Exact Recovery and Sparse ApproximationComments: It has been published in ICPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
Orthogonal Matching Pursuit (OMP) has been a powerful method in sparse signal recovery and approximation. However OMP suffers computational issue when the signal has large number of non-zeros. This paper advances OMP in two fronts: it offers a fast algorithm for the orthogonal projection of the input signal at each iteration, and a new selection criterion for making the greedy choice, which reduces the number of iterations it takes to recover the signal. The proposed modifications to OMP directly reduce the computational complexity. Experiment results show significant improvement over the classical OMP in computation time. The paper also provided a sufficient condition for exact recovery under the new greedy choice criterion. For general signals that may not have sparse representations, the paper provides a bound for the approximation error. The approximation error is at the same order as OMP but is obtained within fewer iterations and less time.
- [443] arXiv:2404.00507 (replaced) [pdf, html, other]
-
Title: THEMIS: Time, Heterogeneity, and Energy Minded Scheduling for Fair Multi-Tenant Use in FPGAsComments: 12 Pages, 8 Figures, 3 TablesSubjects: Operating Systems (cs.OS); Distributed, Parallel, and Cluster Computing (cs.DC)
Using correct design metrics and understanding the limitations of the underlying technology is critical to developing effective scheduling algorithms. Unfortunately, existing scheduling techniques used \emph{incorrect} metrics and had \emph{unrealistic} assumptions for fair scheduling of multi-tenant FPGAs where each tenant is aimed to share approximately the same number of resources both spatially and temporally.
This paper introduces an enhanced fair scheduling algorithm for multi-tenant FPGA use, addressing previous metric and assumption issues, with three specific improvements claimed First, our method ensures spatiotemporal fairness by considering both spatial and temporal aspects, addressing the limitation of prior work that assumed uniform task latency. Second, we incorporate energy considerations into fairness by adjusting scheduling intervals and accounting for energy overhead, thereby balancing energy efficiency with fairness. Third, we acknowledge overlooked aspects of FPGA multi-tenancy, including heterogeneous regions and the constraints on dynamically merging/splitting partially reconfigurable regions. We develop and evaluate our improved fair scheduling algorithm with these three enhancements. Inspired by the Greek goddess of law and personification of justice, we name our fair scheduling solution THEMIS: \underline{T}ime, \underline{H}eterogeneity, and \underline{E}nergy \underline{Mi}nded \underline{S}cheduling.
We used the Xilinx Zedboard XC7Z020 to quantify our approach's savings. Compared to previous algorithms, our improved scheduling algorithm enhances fairness between 24.2--98.4\% and allows a trade-off between 55.3$\times$ in energy vs. 69.3$\times$ in fairness. The paper thus informs cloud providers about future scheduling optimizations for fairness with related challenges and opportunities. - [444] arXiv:2404.16495 (replaced) [pdf, html, other]
-
Title: T-Explainer: A Model-Agnostic Explainability Framework Based on GradientsComments: Copyright 2025 IEEE. All rights reserved, including rights for text, data mining and training of artificial intelligence and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission. Article accepted for publication in IEEE Intelligent Systems. This author's version includes the supplementary material. Content may change prior to final publicationSubjects: Machine Learning (cs.LG)
The development of machine learning applications has increased significantly in recent years, motivated by the remarkable ability of learning-powered systems to discover and generalize intricate patterns hidden in massive datasets. Modern learning models, while powerful, often exhibit a complexity level that renders them opaque black boxes, lacking transparency and hindering our understanding of their decision-making processes. Opacity challenges the practical application of machine learning, especially in critical domains requiring informed decisions. Explainable Artificial Intelligence (XAI) addresses that challenge, unraveling the complexity of black boxes by providing explanations. Feature attribution/importance XAI stands out for its ability to delineate the significance of input features in predictions. However, most attribution methods have limitations, such as instability, when divergent explanations result from similar or the same instance. This work introduces T-Explainer, a novel additive attribution explainer based on the Taylor expansion that offers desirable properties such as local accuracy and consistency. We demonstrate T-Explainer's effectiveness and stability over multiple runs in quantitative benchmark experiments against well-known attribution methods. Additionally, we provide several tools to evaluate and visualize explanations, turning T-Explainer into a comprehensive XAI framework.
- [445] arXiv:2404.17230 (replaced) [pdf, html, other]
-
Title: ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification FashionComments: 13 pages in totalSubjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce ObjectAdd, a training-free diffusion modification method to add user-expected objects into user-specified area. The motive of ObjectAdd stems from: first, describing everything in one prompt can be difficult, and second, users often need to add objects into the generated image. To accommodate with real world, our ObjectAdd maintains accurate image consistency after adding objects with technical innovations in: (1) embedding-level concatenation to ensure correct text embedding coalesce; (2) object-driven layout control with latent and attention injection to ensure objects accessing user-specified area; (3) prompted image inpainting in an attention refocusing & object expansion fashion to ensure rest of the image stays the same. With a text-prompted image, our ObjectAdd allows users to specify a box and an object, and achieves: (1) adding object inside the box area; (2) exact content outside the box area; (3) flawless fusion between the two areas
- [446] arXiv:2404.17325 (replaced) [pdf, html, other]
-
Title: Towards Scalable Multi-Chip Wireless Networks with Near-Field Time ReversalAma Bandara, Fátima Rodríguez-Galán, Pau Talarn, Elana Pereira de Santana, Evgenii Vinogradov, Peter Haring Bolívar, Eduard Alarcón, Sergi AbadalSubjects: Emerging Technologies (cs.ET); Signal Processing (eess.SP)
The concept of Wireless Network-on-Chip (WNoC) has emerged as a potential solution to address the escalating communication demands of modern computing systems due to its low-latency, versatility, and reconfigurability. However, for WNoC to fulfill its potential, it is essential to establish multiple high-speed wireless links across chips. Unfortunately, the compact and enclosed nature of computing packages introduces significant challenges in the form of Co-Channel Interference and Inter-Symbol Interference, which not only hinder the deployment of multiple spatial channels but also severely restrict the symbol rate of each individual channel. In this paper, we posit that Time Reversal (TR) could be effective in addressing both impairments in this static scenario thanks to its spatiotemporal focusing capabilities even in the near field. Through comprehensive full-wave simulations and bit error rate analysis in multiple scenarios and at multiple frequency bands, we provide evidence that TR can increase the symbol rate by an order of magnitude, enabling the deployment of multiple concurrent links and achieving aggregate speeds exceeding 100 Gb/s. Finally, we evaluate the impact of reducing the sampling rate of the TR filter on the achievable speeds, paving the way to practical TR-based wireless communications at the chip scale.
- [447] arXiv:2404.18896 (replaced) [pdf, html, other]
-
Title: Overcoming Knowledge Barriers: Online Imitation Learning from Visual Observation with Pretrained World ModelsComments: Accepted at TMLRSubjects: Machine Learning (cs.LG)
Pretraining and finetuning models has become increasingly popular in decision-making. But there are still serious impediments in Imitation Learning from Observation (ILfO) with pretrained models. This study identifies two primary obstacles: the Embodiment Knowledge Barrier (EKB) and the Demonstration Knowledge Barrier (DKB). The EKB emerges due to the pretrained models' limitations in handling novel observations, which leads to inaccurate action inference. Conversely, the DKB stems from the reliance on limited demonstration datasets, restricting the model's adaptability across diverse scenarios. We propose separate solutions to overcome each barrier and apply them to Action Inference by Maximising Evidence (AIME), a state-of-the-art algorithm. This new algorithm, AIME-NoB, integrates online interactions and a data-driven regulariser to mitigate the EKB. Additionally, it uses a surrogate reward function to broaden the policy's supported states, addressing the DKB. Our experiments on vision-based control tasks from the DeepMind Control Suite and MetaWorld benchmarks show that AIME-NoB significantly improves sample efficiency and converged performance, presenting a robust framework for overcoming the challenges in ILfO with pretrained models. Code available at this https URL.
- [448] arXiv:2405.02237 (replaced) [pdf, html, other]
-
Title: Analysis and improvement of a semi-Lagrangian exponential scheme for the shallow-water equations on the rotating sphereComments: 37 pages, 12 figuresSubjects: Numerical Analysis (math.NA)
In this work, we study and extend a class of semi-Lagrangian exponential methods, which combine exponential time integration techniques, suitable for integrating stiff linear terms, with a semi-Lagrangian treatment of nonlinear advection terms. Partial differential equations involving both processes arise for instance in atmospheric circulation models. Through a truncation error analysis, we show that previously formulated semi-Lagrangian exponential schemes are limited to first-order accuracy due to the approximation of the integration factor acting on the discretization of the linear term; we then formulate a new discretization leading to second-order accuracy. Also, a detailed stability study is conducted to compare several Eulerian and semi-Lagrangian exponential schemes, as well as a well-established semi-Lagrangian semi-implicit method, which is used in operational atmospheric models. Numerical simulations of the shallow-water equations on the rotating sphere are performed to assess the orders of convergence, stability properties, and computational cost of each method. The proposed second-order semi-Lagrangian exponential method was shown to be more stable and accurate than the previously formulated schemes of the same class at the expense of larger wall-clock times; however, the method is more stable and has a similar cost compared to the well-established semi-Lagrangian semi-implicit method; therefore, it is a competitive candidate for potential operational applications in atmospheric circulation modeling.
- [449] arXiv:2405.04605 (replaced) [pdf, other]
-
Title: AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan DatasetsFakrul Islam Tushar, Avivah Wang, Lavsen Dahal, Michael R. Harowicz, Kyle J. Lafata, Tina D. Tailor, Joseph Y. LoComments: 2 tables, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Lung cancer remains the leading cause of cancer-related mortality worldwide, and early detection through low-dose computed tomography (LDCT) has shown significant promise in reducing death rates. With the growing integration of artificial intelligence (AI) into medical imaging, the development and evaluation of robust AI models require access to large, well-annotated datasets. In this study, we introduce the utility of Duke Lung Cancer Screening (DLCS) Dataset, the largest open-access LDCT dataset with over 2,000 scans and 3,000 expert-verified nodules. We benchmark deep learning models for both 3D nodule detection and lung cancer classification across internal and external datasets including LUNA16, LUNA25, and NLST-3D+. For detection, we develop two MONAI-based RetinaNet models (DLCSDmD and LUNA16-mD), evaluated using the Competition Performance Metric (CPM). For classification, we compare five models, including state-of-the-art pretrained models (Models Genesis, Med3D), a selfsupervised foundation model (FMCB), a randomly initialized ResNet50, and proposed a novel Strategic Warm-Start++ (SWS++) model. SWS++ uses curated candidate patches to pretrain a classification backbone within the same detection pipeline, enabling task-relevant feature learning. Our models demonstrated strong generalizability, with SWS++ achieving comparable or superior performance to existing foundational models across multiple datasets (AUC: 0.71 to 0.90). All code, models, and data are publicly released to promote reproducibility and collaboration. This work establishes a standardized benchmarking resource for lung cancer AI research, supporting future efforts in model development, validation, and clinical translation.
- [450] arXiv:2405.05235 (replaced) [pdf, html, other]
-
Title: RACH Traffic Prediction in Massive Machine Type CommunicationsJournal-ref: H. Mehri, H. Mehrpouyan and H. Chen, "RACH Traffic Prediction in Massive Machine Type Communications," in IEEE Transactions on Machine Learning in Communications and Networking, vol. 3, pp. 315-331, 2025Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Traffic pattern prediction has emerged as a promising approach for efficiently managing and mitigating the impacts of event-driven bursty traffic in massive machine-type communication (mMTC) networks. However, achieving accurate predictions of bursty traffic remains a non-trivial task due to the inherent randomness of events, and these challenges intensify within live network environments. Consequently, there is a compelling imperative to design a lightweight and agile framework capable of assimilating continuously collected data from the network and accurately forecasting bursty traffic in mMTC networks. This paper addresses these challenges by presenting a machine learning-based framework tailored for forecasting bursty traffic in multi-channel slotted ALOHA networks. The proposed machine learning network comprises long-term short-term memory (LSTM) and a DenseNet with feed-forward neural network (FFNN) layers, where the residual connections enhance the training ability of the machine learning network in capturing complicated patterns. Furthermore, we develop a new low-complexity online prediction algorithm that updates the states of the LSTM network by leveraging frequently collected data from the mMTC network. Simulation results and complexity analysis demonstrate the superiority of our proposed algorithm in terms of both accuracy and complexity, making it well-suited for time-critical live scenarios. We evaluate the performance of the proposed framework in a network with a single base station and thousands of devices organized into groups with distinct traffic-generating characteristics. Comprehensive evaluations and simulations indicate that our proposed machine learning approach achieves a remarkable $52\%$ higher accuracy in long-term predictions compared to traditional methods, without imposing additional processing load on the system.
- [451] arXiv:2405.12519 (replaced) [pdf, html, other]
-
Title: MAGE: Model-Level Graph Neural Networks Explanations via Motif-based Graph GenerationComments: arXiv admin note: text overlap with arXiv:2405.08419 The Thirteenth International Conference on Learning Representations 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Graph Neural Networks (GNNs) have shown remarkable success in molecular tasks, yet their interpretability remains challenging. Traditional model-level explanation methods like XGNN and GNNInterpreter often fail to identify valid substructures like rings, leading to questionable interpretability. This limitation stems from XGNN's atom-by-atom approach and GNNInterpreter's reliance on average graph embeddings, which overlook the essential structural elements crucial for molecules. To address these gaps, we introduce an innovative \textbf{M}otif-b\textbf{A}sed \textbf{G}NN \textbf{E}xplainer (MAGE) that uses motifs as fundamental units for generating explanations. Our approach begins with extracting potential motifs through a motif decomposition technique. Then, we utilize an attention-based learning method to identify class-specific motifs. Finally, we employ a motif-based graph generator for each class to create molecular graph explanations based on these class-specific motifs. This novel method not only incorporates critical substructures into the explanations but also guarantees their validity, yielding results that are human-understandable. Our proposed method's effectiveness is demonstrated through quantitative and qualitative assessments conducted on six real-world molecular datasets.
- [452] arXiv:2405.13901 (replaced) [pdf, html, other]
-
Title: Discrete Cosine Transform Based Decorrelated Attention for Vision TransformersSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
Central to the Transformer architectures' effectiveness is the self-attention mechanism, a function that maps queries, keys, and values into a high-dimensional vector space. However, training the attention weights of queries, keys, and values is non-trivial from a state of random initialization. In this paper, we propose two methods. (i) We first address the initialization problem of Vision Transformers by introducing a simple, yet highly innovative, initialization approach utilizing discrete cosine transform (DCT) coefficients. Our proposed DCT-based \textit{attention} initialization marks a significant gain compared to traditional initialization strategies; offering a robust foundation for the attention mechanism. Our experiments reveal that the DCT-based initialization enhances the accuracy of Vision Transformers in classification tasks. (ii) We also recognize that since DCT effectively decorrelates image information in the frequency domain, this decorrelation is useful for compression because it allows the quantization step to discard many of the higher-frequency components. Based on this observation, we propose a novel DCT-based compression technique for the attention function of Vision Transformers. Since high-frequency DCT coefficients usually correspond to noise, we truncate the high-frequency DCT components of the input patches. Our DCT-based compression reduces the size of weight matrices for queries, keys, and values. While maintaining the same level of accuracy, our DCT compressed Swin Transformers obtain a considerable decrease in the computational overhead.
- [453] arXiv:2406.04724 (replaced) [pdf, html, other]
-
Title: On Minimizing Adversarial Counterfactual Error in Adversarial RLComments: Presented at ICLR 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Deep Reinforcement Learning (DRL) policies are highly susceptible to adversarial noise in observations, which poses significant risks in safety-critical scenarios. The challenge inherent to adversarial perturbations is that by altering the information observed by the agent, the state becomes only partially observable. Existing approaches address this by either enforcing consistent actions across nearby states or maximizing the worst-case value within adversarially perturbed observations. However, the former suffers from performance degradation when attacks succeed, while the latter tends to be overly conservative, leading to suboptimal performance in benign settings. We hypothesize that these limitations stem from their failing to account for partial observability directly. To this end, we introduce a novel objective called Adversarial Counterfactual Error (ACoE), defined on the beliefs about the true state and balancing value optimization with robustness. To make ACoE scalable in model-free settings, we propose the theoretically-grounded surrogate objective Cumulative-ACoE (C-ACoE). Our empirical evaluations on standard benchmarks (MuJoCo, Atari, and Highway) demonstrate that our method significantly outperforms current state-of-the-art approaches for addressing adversarial RL challenges, offering a promising direction for improving robustness in DRL under adversarial conditions. Our code is available at this https URL.
- [454] arXiv:2406.05904 (replaced) [pdf, html, other]
-
Title: Aegis: Tethering a Blockchain with Primary-Chain StakeSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR)
Blockchains implement decentralized monetary systems and applications. Recent advancements enable what we call tethering a blockchain to a primary blockchain, securing the tethered chain by nodes that post primary-chain tokens as collateral. The collateral ensures nodes behave as intended, until they withdraw it. Unlike a Proof of Stake blockchain which uses its own token as collateral, using primary-chain tokens shields the tethered chain from the volatility of its own token.
State-of-the-art tethered blockchains either rely on centralization, or make extreme assumptions: that all communication is synchronous, that operators remain correct even post-withdrawal, or that withdrawals can be indefinitely delayed by tethered-chain failures.
We prove that with partial synchrony, there is no solution to the problem. However, under the standard assumptions that communication with the primary chain is synchronous and communication among the tethered chain nodes is partially synchronous, there is a solution. We present a tethered-chain protocol called Aegis. Aegis uses references from its blocks to primary blocks to define committees, checkpoints on the primary chain to perpetuate decisions, and resets to establish new committees when previous ones become obsolete. It ensures safety at all times and rapid progress when latency among Aegis nodes is low. - [455] arXiv:2406.06225 (replaced) [pdf, other]
-
Title: Siren -- Advancing Cybersecurity through Deception and Adaptive AnalysisComments: 14 pages, 5 figures, 13th Computing Conference 2025 - London, United KingdomSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Siren represents a pioneering research effort aimed at fortifying cybersecurity through strategic integration of deception, machine learning, and proactive threat analysis. Drawing inspiration from mythical sirens, this project employs sophisticated methods to lure potential threats into controlled environments. The system features a dynamic machine learning model for realtime analysis and classification, ensuring continuous adaptability to emerging cyber threats. The architectural framework includes a link monitoring proxy, a purpose-built machine learning model for dynamic link analysis, and a honeypot enriched with simulated user interactions to intensify threat engagement. Data protection within the honeypot is fortified with probabilistic encryption. Additionally, the incorporation of simulated user activity extends the system's capacity to capture and learn from potential attackers even after user disengagement. Overall, Siren introduces a paradigm shift in cybersecurity, transforming traditional defense mechanisms into proactive systems that actively engage and learn from potential adversaries. The research strives to enhance user protection while yielding valuable insights for ongoing refinement in response to the evolving landscape of cybersecurity threats.
- [456] arXiv:2406.07494 (replaced) [pdf, html, other]
-
Title: CADS: A Systematic Literature Review on the Challenges of Abstractive Dialogue SummarizationComments: Published in the Journal of Artificial Intelligence Research (JAIR) (this https URL)Journal-ref: Journal of Artificial Intelligence Research (JAIR), Vol. 82, 2025Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstractive dialogue summarization is the task of distilling conversations into informative and concise summaries. Although reviews have been conducted on this topic, there is a lack of comprehensive work detailing the challenges of dialogue summarization, unifying the differing understanding of the task, and aligning proposed techniques, datasets, and evaluation metrics with the challenges. This article summarizes the research on Transformer-based abstractive summarization for English dialogues by systematically reviewing 1262 unique research papers published between 2019 and 2024, relying on the Semantic Scholar and DBLP databases. We cover the main challenges present in dialog summarization (i.e., language, structure, comprehension, speaker, salience, and factuality) and link them to corresponding techniques such as graph-based approaches, additional training tasks, and planning strategies, which typically overly rely on BART-based encoder-decoder models. We find that while some challenges, like language, have seen considerable progress, mainly due to training methods, others, such as comprehension, factuality, and salience, remain difficult and hold significant research opportunities. We investigate how these approaches are typically assessed, covering the datasets for the subdomains of dialogue (e.g., meeting, medical), the established automatic metrics and human evaluation approaches for assessing scores and annotator agreement. We observe that only a few datasets span across all subdomains. The ROUGE metric is the most used, while human evaluation is frequently reported without sufficient detail on inner-annotator agreement and annotation guidelines. Additionally, we discuss the possible implications of the recently explored large language models and conclude that despite a potential shift in relevance and difficulty, our described challenge taxonomy remains relevant.
- [457] arXiv:2406.09656 (replaced) [pdf, html, other]
-
Title: RSEND: Retinex-based Squeeze and Excitation Network with Dark Region Detection for Efficient Low Light Image EnhancementSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Images captured under low-light scenarios often suffer from low quality. Previous CNN-based deep learning methods often involve using Retinex theory. Nevertheless, most of them cannot perform well in more complicated datasets like LOL-v2 while consuming too much computational resources. Besides, some of these methods require sophisticated training at different stages, making the procedure even more time-consuming and tedious. In this paper, we propose a more accurate, concise, and one-stage Retinex theory based framework, RSEND. RSEND first divides the low-light image into the illumination map and reflectance map, then captures the important details in the illumination map and performs light enhancement. After this step, it refines the enhanced gray-scale image and does element-wise matrix multiplication with the reflectance map. By denoising the output it has from the previous step, it obtains the final result. In all the steps, RSEND utilizes Squeeze and Excitation network to better capture the details. Comprehensive quantitative and qualitative experiments show that our Efficient Retinex model significantly outperforms other CNN-based models, achieving a PSNR improvement ranging from 0.44 dB to 4.2 dB in different datasets and even outperforms transformer-based models in the LOL-v2-real dataset.
- [458] arXiv:2406.10479 (replaced) [pdf, html, other]
-
Title: Unlocking Large Language Model's Planning Capabilities with Maximum Diversity Fine-tuningComments: 8 pages of main paper, 2 pages of referencesSubjects: Artificial Intelligence (cs.AI)
Large language models (LLMs) have demonstrated impressive task-solving capabilities through prompting techniques and system designs, including solving planning tasks (e.g., math proofs, basic travel planning) when sufficient data is available online and used during pre-training. However, for planning tasks with limited prior data (e.g., blocks world, advanced travel planning), the performance of LLMs, including proprietary models like GPT and Gemini, is poor. This paper investigates the impact of fine-tuning on the planning capabilities of LLMs, revealing that LLMs can achieve strong performance in planning through substantial (tens of thousands of specific examples) fine-tuning. Yet, this process incurs high economic, time, and computational costs for each planning problem variation. To address this, we propose Clustering-Based Maximum Diversity Sampling (CMDS), which selects diverse and representative data to enhance sample efficiency and the model's generalization capability. Extensive evaluations demonstrate that CMDS-l, a baseline method combining CMDS with language embeddings, outperforms random sampling. Furthermore, we introduce a novel algorithm, CMDS-g, which encodes planning task instances with their graph representations into the embedding space. Empirical results show that CMDS-g consistently outperforms baseline methods across various scales and multiple benchmark domains.
- [459] arXiv:2406.14088 (replaced) [pdf, html, other]
-
Title: ReaL: Efficient RLHF Training of Large Language Models with Parameter ReallocationComments: 11 pages (20 pages with references and the appendix), 17 figures. Accepted by MLSys 25Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for empowering large language model (LLM) applications. Compared with the supervised training process of LLMs, the RLHF training process is much more sophisticated, requiring a diverse range of computation workloads with intricate dependencies between multiple LLM instances. Therefore, simply adopting the fixed parallelization strategies from supervised training for LLMs can be insufficient for RLHF and result in low training efficiency. To overcome this limitation, we propose a novel technique named parameter ReaLlocation, which dynamically adapts the parallelization strategies for different workloads during training by redistributing LLM parameters across the training cluster. Building upon this idea, we introduce ReaL, a pioneering system for efficient RLHF training. ReaL introduces the concept of an execution plan, which defines a fine-grained resource allocation and parallelization strategy particularly designed for RLHF training. Based on this concept, ReaL employs a tailored search algorithm with a lightweight run-time estimator to automatically discover an efficient execution plan for an instance of RLHF experiment. Subsequently, the runtime engine deploys the selected plan by effectively parallelizing computations and redistributing parameters. We evaluate ReaL on the LLaMA models with up to 70 billion parameters and 128 GPUs. The experimental results demonstrate that ReaL achieves speedups of up to $3.58\times$ compared to baseline methods. Furthermore, the execution plans generated by ReaL exhibit an average of $81\%$ performance improvement over heuristic approaches based on Megatron-LM in the long-context scenario. The source code of ReaL is publicly available at this https URL .
- [460] arXiv:2406.15231 (replaced) [pdf, html, other]
-
Title: Synthetic Lyrics Detection Across Languages and GenresComments: Published in the TrustNLP Workshop at NAACL 2025Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In recent years, the use of large language models (LLMs) to generate music content, particularly lyrics, has gained in popularity. These advances provide valuable tools for artists and enhance their creative processes, but they also raise concerns about copyright violations, consumer satisfaction, and content spamming. Previous research has explored content detection in various domains. However, no work has focused on the text modality, lyrics, in music. To address this gap, we curated a diverse dataset of real and synthetic lyrics from multiple languages, music genres, and artists. The generation pipeline was validated using both humans and automated methods. We performed a thorough evaluation of existing synthetic text detection approaches on lyrics, a previously unexplored data type. We also investigated methods to adapt the best-performing features to lyrics through unsupervised domain adaptation. Following both music and industrial constraints, we examined how well these approaches generalize across languages, scale with data availability, handle multilingual language content, and perform on novel genres in few-shot settings. Our findings show promising results that could inform policy decisions around AI-generated music and enhance transparency for users.
- [461] arXiv:2406.16627 (replaced) [pdf, html, other]
-
Title: A Random Integration Algorithm for High-dimensional Function SpacesSubjects: Numerical Analysis (math.NA)
We introduce a novel random integration algorithm that boasts both high convergence order and polynomial tractability for functions characterized by sparse frequencies or rapidly decaying Fourier coefficients. Specifically, for integration in periodic isotropic Sobolev space and the isotropic Sobolev space with compact support, our approach attains a nearly optimal root mean square error (RMSE) bound. In contrast to previous nearly optimal algorithms, our method exhibits polynomial tractability, ensuring that the number of samples does not scale exponentially with increasing dimensions. Our integration algorithm also enjoys nearly optimal bound for weighted Korobov space. Furthermore, the algorithm can be applied without the need for prior knowledge of weights, distinguishing it from the component-by-component algorithm. For integration in the Wiener algebra, the sample complexity of our algorithm is independent of the decay rate of Fourier coefficients. The effectiveness of the integration is confirmed through numerical experiments.
- [462] arXiv:2406.17276 (replaced) [pdf, html, other]
-
Title: OPT-Tree: Speculative Decoding with Adaptive Draft Tree StructureComments: Published in TACLSubjects: Computation and Language (cs.CL)
Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become increasingly larger. Speculative decoding employs a "draft and then verify" mechanism to allow multiple tokens to be generated in one step, realizing lossless acceleration. Existing methods mainly adopt fixed heuristic draft structures, which fail to adapt to different situations to maximize the acceptance length during verification. To alleviate this dilemma, we proposed OPT-Tree, an algorithm to construct adaptive and scalable draft trees. It searches the optimal tree structure that maximizes the mathematical expectation of the acceptance length in each decoding step. Experimental results reveal that OPT-Tree outperforms the existing draft structures and achieves a speed-up ratio of up to 3.2 compared with autoregressive decoding. If the draft model is powerful enough and the node budget is sufficient, it can generate more than ten tokens in a single step. Our code is available at this https URL.
- [463] arXiv:2407.06013 (replaced) [pdf, html, other]
-
Title: Revisit the Arimoto-Blahut algorithm: New Analysis with ApproximationSubjects: Information Theory (cs.IT)
By the seminal paper of Claude Shannon \cite{Shannon48}, the computation of the capacity of a discrete memoryless channel has been considered as one of the most important and fundamental problems in Information Theory. Nearly 50 years ago, Arimoto and Blahut independently proposed identical algorithms to solve this problem in their seminal papers \cite{Arimoto1972AnAF, Blahut1972ComputationOC}. The Arimoto-Blahut algorithm was proven to converge to the capacity of the channel as $t \to \infty$ with the convergence rate upper bounded by $O\left(\log(m)/t\right)$, where $m$ is the size of the input distribution, and being inverse exponential when there is a unique solution in the interior of the input probability simplex \cite{Arimoto1972AnAF}. Recently it was proved, in \cite{Nakagawa2020AnalysisOT}, that the convergence rate is at worst inverse linear $O(1/t)$ in some specific cases.
In this paper, we revisit this fundamental algorithm looking at the rate of convergence to the capacity and the time complexity, given $m,n$, where $n$ is size of the output of the channel, focusing on the approximation of the capacity. We prove that the rate of convergence to an $\varepsilon$-optimal solution, for any sufficiently small constant $\varepsilon > 0$, is inverse exponential $O\left(\log(m)/c^t\right)$, for a constant $c > 1$ and $O\left(\log \left(\log (m)/\varepsilon\right)\right)$ at most iterations, implying $O\left(m n\log \left(\log (m)/\varepsilon\right)\right)$ total complexity of the algorithm. - [464] arXiv:2407.14306 (replaced) [pdf, html, other]
-
Title: Label-Free Model Failure Detection for Lidar-based Point Cloud SegmentationComments: Daniel Bogdoll, Finn Sartoris, and Vincent Geppert contributed equally. Accepted for publication at IV 2025Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Autonomous vehicles drive millions of miles on the road each year. Under such circumstances, deployed machine learning models are prone to failure both in seemingly normal situations and in the presence of outliers. However, in the training phase, they are only evaluated on small validation and test sets, which are unable to reveal model failures due to their limited scenario coverage. While it is difficult and expensive to acquire large and representative labeled datasets for evaluation, large-scale unlabeled datasets are typically available. In this work, we introduce label-free model failure detection for lidar-based point cloud segmentation, taking advantage of the abundance of unlabeled data available. We leverage different data characteristics by training a supervised and self-supervised stream for the same task to detect failure modes. We perform a large-scale qualitative analysis and present LidarCODA, the first publicly available dataset with labeled anomalies in real-world lidar data, for an extensive quantitative analysis.
- [465] arXiv:2407.21266 (replaced) [pdf, html, other]
-
Title: DDU-Net: A Domain Decomposition-Based CNN for High-Resolution Image Segmentation on Multiple GPUsJournal-ref: IEEE Access 13 (2015) 66967-66983Subjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
The segmentation of ultra-high resolution images poses challenges such as loss of spatial information or computational inefficiency. In this work, a novel approach that combines encoder-decoder architectures with domain decomposition strategies to address these challenges is proposed. Specifically, a domain decomposition-based U-Net (DDU-Net) architecture is introduced, which partitions input images into non-overlapping patches that can be processed independently on separate devices. A communication network is added to facilitate inter-patch information exchange to enhance the understanding of spatial context. Experimental validation is performed on a synthetic dataset that is designed to measure the effectiveness of the communication network. Then, the performance is tested on the DeepGlobe land cover classification dataset as a real-world benchmark data set. The results demonstrate that the approach, which includes inter-patch communication for images divided into $16\times16$ non-overlapping subimages, achieves a $2-3\,\%$ higher intersection over union (IoU) score compared to the same network without inter-patch communication. The performance of the network which includes communication is equivalent to that of a baseline U-Net trained on the full image, showing that our model provides an effective solution for segmenting ultra-high-resolution images while preserving spatial context. The code is available at this https URL.
- [466] arXiv:2408.00693 (replaced) [pdf, html, other]
-
Title: Superlinear Convergence of GMRES for clustered eigenvalues and its application to least squares problemsComments: 15 pages,9 figuresSubjects: Numerical Analysis (math.NA)
The objective of this paper is to understand the superlinear convergence behavior of the GMRES method when the coefficient matrix has clustered eigenvalues. In order to understand the phenomenon, we analyze the convergence using the Vandermonde matrix which is defined using the eigenvalues of the coefficient matrix. Although eigenvalues alone cannot explain the convergence, they may provide an upper bound of the residual, together with the right hand side vector and the eigenvectors of the coefficient matrix. We show that when the coefficient matrix is diagonalizable, if the eigenvalues of the coefficient matrix are clustered, the upper bound of the convergence curve shows superlinear convergence, when the norm of the matrix obtained by decomposing the right hand side vector into the eigenvector components is not so large. We apply the analysis to explain the convergence of inner-iteration preconditioned GMRES for least squares problems.
- [467] arXiv:2408.01246 (replaced) [pdf, html, other]
-
Title: MapComp: A Secure View-based Collaborative Analytics Framework for Join-Group-AggregationXinyu Peng, Feng Han, Li Peng, Weiran Liu, Zheng Yan, Kai Kang, Xinyuan Zhang, Guoxing Wei, Jianling Sun, Jinfei Liu, Lin QuSubjects: Cryptography and Security (cs.CR)
This paper introduces MapComp, a novel view-based framework to facilitate join-group-aggregation (JGA) queries for secure collaborative analytics. Through specially crafted materialized views for join and novel design of group-aggregation (GA) protocols, MapComp removes duplicated join workload and expedites subsequent GA, improving the efficiency of JGA query execution. To support continuous data updates, our materialized view offers payload-independence feature and brings in significant efficiency improvement of view refreshing with free MPC overhead. This feature also allows further acceleration for GA, where we devise multiple novel protocols that outperform prior works. Our rigorous experiments demonstrate a significant advantage of MapComp, achieving up to a 308.9x efficiency improvement compared to the baseline in the real-world query simulation.
- [468] arXiv:2408.02657 (replaced) [pdf, html, other]
-
Title: Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative PretrainingDongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, Peng GaoComments: Code available at: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. By initializing from multimodal Generative PreTraining (mGPT), we demonstrate that decoder-only Autoregressive (AR) model can achieve image generation performance comparable to modern diffusion models with high efficiency through Flexible Progressive Supervised Fine-tuning (FP-SFT). Equipped with our proposed Unambiguous image Representation (UniRep), Lumina-mGPT can flexibly generate high-quality images of varying aspect ratios. Building on the strong image generation capabilities, we further explore Ominiponent Supervised Fine-tuning (Omni-SFT), an initial attempt to elevate Lumina-mGPT into a unified multi-modal generalist. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like text-to-image/multiview generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multi-turn visual question answering, showing the rosy potential of the technical direction. Codes and checkpoints are available at this https URL.
- [469] arXiv:2408.03404 (replaced) [pdf, other]
-
Title: Set2Seq Transformer: Temporal and Positional-Aware Set Representations for Sequential Multiple-Instance LearningSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Sequential multiple-instance learning involves learning representations of sets distributed across discrete timesteps. In many real-world applications, modeling both the internal structure of sets and their temporal relationships across time is essential for capturing complex underlying patterns. However, existing methods either focus on learning set representations at a static level, ignoring temporal dynamics, or treat sequences as ordered lists of individual elements, lacking explicit mechanisms to represent sets. In this work, we propose Set2Seq Transformer, a novel architecture that jointly models permutation-invariant set structure and temporal dependencies by learning temporal and positional-aware representations of sets within a sequence in an end-to-end multimodal manner. We evaluate our Set2Seq Transformer on two tasks that require modeling both set structure alongside temporal and positional patterns, but differ significantly in domain, modality, and objective. First, we consider a fine-art analysis task, modeling artists' oeuvres for predicting artistic success using a novel dataset, WikiArt-Seq2Rank. Second, we utilize our Set2Seq Transformer for a short-term wildfire danger forecasting task. Through extensive experimentation, we show that our Set2Seq Transformer significantly improves over traditional static multiple-instance learning methods by effectively learning permutation-invariant set, temporal, and positional-aware representations across diverse domains, modalities, and tasks. We will release both the dataset and model implementations on GitHub.
- [470] arXiv:2408.03624 (replaced) [pdf, html, other]
-
Title: AgentsCoMerge: Large Language Model Empowered Collaborative Decision Making for Ramp MergingComments: Accepted by IEEE Transactions on Mobile Computing (TMC)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Ramp merging is one of the bottlenecks in traffic systems, which commonly cause traffic congestion, accidents, and severe carbon emissions. In order to address this essential issue and enhance the safety and efficiency of connected and autonomous vehicles (CAVs) at multi-lane merging zones, we propose a novel collaborative decision-making framework, named AgentsCoMerge, to leverage large language models (LLMs). Specifically, we first design a scene observation and understanding module to allow an agent to capture the traffic environment. Then we propose a hierarchical planning module to enable the agent to make decisions and plan trajectories based on the observation and the agent's own state. In addition, in order to facilitate collaboration among multiple agents, we introduce a communication module to enable the surrounding agents to exchange necessary information and coordinate their actions. Finally, we develop a reinforcement reflection guided training paradigm to further enhance the decision-making capability of the framework. Extensive experiments are conducted to evaluate the performance of our proposed method, demonstrating its superior efficiency and effectiveness for multi-agent collaborative decision-making under various ramp merging scenarios.
- [471] arXiv:2408.04728 (replaced) [pdf, other]
-
Title: HotStuff-1: Linear Consensus with One-Phase SpeculationComments: 38 pages, 10 figuresSubjects: Databases (cs.DB)
This paper introduces HotStuff-1, a BFT consensus protocol that improves the latency of HotStuff-2 by two network hops while maintaining linear communication complexity against faults. Furthermore, HotStuff-1 incorporates an incentive-compatible leader rotation design that motivates leaders to propose transactions promptly. HotStuff-1 achieves a reduction of two network hops by speculatively sending clients early confirmations, after one phase of the protocol. Introducing speculation into streamlined protocols is challenging because, unlike stable-leader protocols, these protocols cannot stop the consensus and recover from failures. Thus, we identify prefix speculation dilemma in the context of streamlined protocols; HotStuff-1 is the first streamlined protocol to resolve it. HotStuff-1 embodies an additional mechanism, slotting, that thwarts delays caused by (1) rationally-incentivized leaders and (2) malicious leaders inclined to sabotage other's progress. The slotting mechanism allows leaders to dynamically drive as many decisions as allowed by network transmission delays before view timers expire, thus mitigating both threats.
- [472] arXiv:2408.14837 (replaced) [pdf, html, other]
-
Title: Diffusion Models Are Real-Time Game EnginesComments: ICLR 2025. Project page: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
We present GameNGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, GameNGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations help ensure stable auto-regressive generation over long trajectories, and decoder fine-tuning improves the fidelity of visual details and text.
- [473] arXiv:2408.16965 (replaced) [pdf, html, other]
-
Title: Contrastive Learning with Synthetic PositivesComments: 8 pages, conferenceSubjects: Computer Vision and Pattern Recognition (cs.CV)
Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-supervised learning (SSL) techniques by utilizing the similarity of multiple instances within the same class. However, its efficacy is constrained as the nearest neighbor algorithm primarily identifies "easy" positive pairs, where the representations are already closely located in the embedding space. In this paper, we introduce a novel approach called Contrastive Learning with Synthetic Positives (CLSP) that utilizes synthetic images, generated by an unconditional diffusion model, as the additional positives to help the model learn from diverse positives. Through feature interpolation in the diffusion model sampling process, we generate images with distinct backgrounds yet similar semantic content to the anchor image. These images are considered "hard" positives for the anchor image, and when included as supplementary positives in the contrastive loss, they contribute to a performance improvement of over 2% and 1% in linear evaluation compared to the previous NNCLR and All4One methods across multiple benchmark datasets such as CIFAR10, achieving state-of-the-art methods. On transfer learning benchmarks, CLSP outperforms existing SSL frameworks on 6 out of 8 downstream datasets. We believe CLSP establishes a valuable baseline for future SSL studies incorporating synthetic data in the training process.
- [474] arXiv:2408.17348 (replaced) [pdf, html, other]
-
Title: Robust Model Predictive Control Exploiting Monotonicity PropertiesComments: Accepted as a technical note in "IEEE Transactions on Automatic Control", Early access DOI: https://doi.org/10.1109/TAC.2025.3558137, Code: this https URLSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Robust model predictive control algorithms are essential for addressing unavoidable errors due to the uncertainty in predicting real-world systems. However, the formulation of such algorithms typically results in a trade-off between conservatism and computational complexity. Monotone systems facilitate the efficient computation of reachable sets and thus the straightforward formulation of a robust model predictive control approach optimizing over open-loop predictions. We present an approach based on the division of reachable sets to incorporate feedback in the predictions, resulting in less conservative strategies. The concept of mixed-monotonicity enables an extension of our methodology to non-monotone systems. The potential of the proposed approaches is demonstrated through a nonlinear high-dimensional chemical tank reactor cascade case study.
- [475] arXiv:2409.02313 (replaced) [pdf, other]
-
Title: On the Benefits of Memory for Modeling Time-Dependent PDEsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Data-driven techniques have emerged as a promising alternative to traditional numerical methods for solving PDEs. For time-dependent PDEs, many approaches are Markovian -- the evolution of the trained system only depends on the current state, and not the past states. In this work, we investigate the benefits of using memory for modeling time-dependent PDEs: that is, when past states are explicitly used to predict the future. Motivated by the Mori-Zwanzig theory of model reduction, we theoretically exhibit examples of simple (even linear) PDEs, in which a solution that uses memory is arbitrarily better than a Markovian solution. Additionally, we introduce Memory Neural Operator (MemNO), a neural operator architecture that combines recent state space models (specifically, S4) and Fourier Neural Operators (FNOs) to effectively model memory. We empirically demonstrate that when the PDEs are supplied in low resolution or contain observation noise at train and test time, MemNO significantly outperforms the baselines without memory -- with up to 6x reduction in test error. Furthermore, we show that this benefit is particularly pronounced when the PDE solutions have significant high-frequency Fourier modes (e.g., low-viscosity fluid dynamics) and we construct a challenging benchmark dataset consisting of such PDEs.
- [476] arXiv:2409.06601 (replaced) [pdf, html, other]
-
Title: LaMsS: When Large Language Models Meet Self-SkepticismComments: 11 pages, 6 figures, Published at ICLR 2025 Workshop on Scaling Self-Improving Foundation Models,Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Hallucination is a major challenge for large language models (LLMs), preventing their further application in some fields. The skeptical thinking of humankind could be useful for LLMs to self-cognition, self-reflection and alleviate their hallucinations. Inspired by this consideration, we propose a novel approach called LaMsS, which combines the semantic understanding capability of LLMs with self-skepticism. By introducing a series of skepticism tokens and augmenting them into the vocabulary, we conduct both pertaining and finetuning, which allow the LLM to decode each normal token followed by a skeptical token, representing different skepticism levels. By calculating the response skepticism given a query, one can define a new self-aware LLM which is only willing to answer with relative lower skepticism level than the threshold. By examining the accuracy, AUC and AP of willingly answering questions, we demonstrate that LaMsS achieves better performance than baselines on both multi-choice questions and open-domain question-answering benchmarks, and can generalize to multi-task and out-of-domain settings. Our study sheds some lights on the self-skepticism modeling on further artificial intelligence. Project code and model checkpoints can be found in this https URL.
- [477] arXiv:2409.08141 (replaced) [pdf, html, other]
-
Title: Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent InterconnectsSubjects: Hardware Architecture (cs.AR); Operating Systems (cs.OS)
Conventional wisdom holds that an efficient interface between an OS running on a CPU and a high-bandwidth I/O device should use Direct Memory Access (DMA) to offload data transfer, descriptor rings for buffering and queuing, and interrupts for asynchrony between cores and device.
In this paper we question this wisdom in the light of two trends: modern and emerging cache-coherent interconnects like CXL3.0, and workloads, particularly microservices and serverless computing. Like some others before us, we argue that the assumptions of the DMA-based model are obsolete, and in many use-cases programmed I/O, where the CPU explicitly transfers data and control information to and from a device via loads and stores, delivers a more efficient system.
However, we push this idea much further. We show, in a real hardware implementation, the gains in latency for fine-grained communication achievable using an open cache-coherence protocol which exposes cache transitions to a smart device, and that throughput is competitive with DMA over modern interconnects. We also demonstrate three use-cases: fine-grained RPC-style invocation of functions on an accelerator, offloading of operators in a streaming dataflow engine, and a network interface targeting serverless functions, comparing our use of coherence with both traditional DMA-style interaction and a highly-optimized implementation using memory-mapped programmed I/O over PCIe. - [478] arXiv:2409.09451 (replaced) [pdf, html, other]
-
Title: On the Generalizability of Foundation Models for Crop Type MappingYi-Chia Chang, Adam J. Stewart, Favyen Bastani, Piper Wolters, Shreya Kannan, George R. Huber, Jingtong Wang, Arindam BanerjeeSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Foundation models pre-trained using self-supervised learning have shown powerful transfer learning capabilities on various downstream tasks, including language understanding, text generation, and image recognition. The Earth observation (EO) field has produced several foundation models pre-trained directly on multispectral satellite imagery for applications like precision agriculture, wildfire and drought monitoring, and natural disaster response. However, few studies have investigated the ability of these models to generalize to new geographic locations, and potential concerns of geospatial bias -- models trained on data-rich developed nations not transferring well to data-scarce developing nations -- remain. We investigate the ability of popular EO foundation models to transfer to new geographic regions in the agricultural domain, where differences in farming practices and class imbalance make transfer learning particularly challenging. We first select five crop classification datasets across five continents, normalizing for dataset size and harmonizing classes to focus on four major cereal grains: maize, soybean, rice, and wheat. We then compare three popular foundation models, pre-trained on SSL4EO-S12, SatlasPretrain, and ImageNet, using in-distribution (ID) and out-of-distribution (OOD) evaluation. Experiments show that pre-trained weights designed explicitly for Sentinel-2, such as SSL4EO-S12, outperform general pre-trained weights like ImageNet. Furthermore, while only 100 labeled images are sufficient for achieving high overall accuracy, 900 images are required to achieve high average accuracy due to class imbalance. All harmonized datasets and experimental code are open-source and available for download.
- [479] arXiv:2409.10136 (replaced) [pdf, html, other]
-
Title: Count2Multiply: Reliable In-Memory High-Radix CountingJoão Paulo Cardoso de Lima, Benjamin Franklin Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. JonesComments: 13 pagesSubjects: Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
Computing-in-memory (CIM) has been demonstrated across various memory technologies, ranging from memristive crossbars performing analog dot-product computations to large-scale digital bitwise operations in commodity DRAM and other proposed non-volative memory technologies. However, current CIM solutions face latency and reliability challenges. CIM fidelity lags considerably behind access fidelity. Furthermore, bulk-bitwise CIM, although highly parallelized, requires long latency for operations like multiplication and addition, due to their bit-serial computation. This paper presents Count2Multiply, a technology-agnostic digital CIM approach to perform multiplication, addition and other operations using high-radix, massively parallel counting enabled by CIM bulk-bitwise logic operations. Designed to meet fault tolerance requirements, Count2Multiply integrates traditional row-wise error correction codes, such as Hamming and BCH, to address the high error rates in existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. However, we note that the Count2Multiply architecture is compatible with other functionally complete CIM proposals. Compared to the state-of-the-art in-DRAM CIM method, Count2Multiply achieves up to 10x speedup, 8x higher GOPS/Watt, and 9.5x higher GOPS/area, while outperforming GPU for vector-matrix multiplications.
- [480] arXiv:2409.11242 (replaced) [pdf, html, other]
-
Title: Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to RefuseComments: Published at ICLR 2025 (Oral)Subjects: Computation and Language (cs.CL)
LLMs are an integral component of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the overall quality of end-to-end RAG systems, there is a gap in understanding the appropriateness of LLMs for the RAG task. To address this, we introduce Trust-Score, a holistic metric that evaluates the trustworthiness of LLMs within the RAG framework. Our results show that various prompting methods, such as in-context learning, fail to effectively adapt LLMs to the RAG task as measured by Trust-Score. Consequently, we propose Trust-Align, a method to align LLMs for improved Trust-Score performance. 26 out of 27 models aligned using Trust-Align substantially outperform competitive baselines on ASQA, QAMPARI, and ELI5. Specifically, in LLaMA-3-8b, Trust-Align outperforms FRONT on ASQA (up 12.56), QAMPARI (up 36.04), and ELI5 (up 17.69). Trust-Align also significantly enhances models' ability to correctly refuse and provide quality citations. We also demonstrate the effectiveness of Trust-Align across different open-weight models, including the LLaMA series (1b to 8b), Qwen-2.5 series (0.5b to 7b), and Phi3.5 (3.8b). We release our code at this https URL.
- [481] arXiv:2409.17685 (replaced) [pdf, html, other]
-
Title: Feature-to-Image Data Augmentation: Improving Model Feature Extraction with Cluster-Guided Synthetic SamplesComments: 10 pages, 6 figures, 6 tableSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
One of the growing trends in machine learning is the use of data generation techniques, since the performance of machine learning models is dependent on the quantity of the training dataset. However, in many real-world applications, particularly in medical and low-resource domains, collecting large datasets is challenging due to resource constraints, which leads to overfitting and poor generalization. This study introduces FICAug, a novel feature-to-image data augmentation framework designed to improve model generalization under limited data conditions by generating structured synthetic samples.
FICAug first operates in the feature space, where original data are clustered using the k-means algorithm. Within pure-label clusters, synthetic data are generated through Gaussian sampling to increase diversity while maintaining label consistency. These synthetic features are then projected back into the image domain using a generative neural network, and a convolutional neural network is trained on the reconstructed images to learn enhanced representations.
Experimental results demonstrate that FICAug significantly improves classification accuracy. In feature space, it achieved a cross-validation accuracy of 84.09%, while training a ResNet-18 model on the reconstructed images further boosted performance to 88.63%, illustrating the effectiveness of the proposed framework in extracting new and task-relevant features. - [482] arXiv:2409.18986 (replaced) [pdf, html, other]
-
Title: Lab-AI: Using Retrieval Augmentation to Enhance Language Models for Personalized Lab Test Interpretation in Clinical MedicineXiaoyu Wang, Haoyong Ouyang, Balu Bhasuran, Xiao Luo, Karim Hanna, Mia Liza A. Lustria, Carl Yang, Zhe HeSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Accurate interpretation of lab results is crucial in clinical medicine, yet most patient portals use universal normal ranges, ignoring conditional factors like age and gender. This study introduces Lab-AI, an interactive system that offers personalized normal ranges using retrieval-augmented generation (RAG) from credible health sources. Lab-AI has two modules: factor retrieval and normal range retrieval. We tested these on 122 lab tests: 40 with conditional factors and 82 without. For tests with factors, normal ranges depend on patient-specific information. Our results show GPT-4-turbo with RAG achieved a 0.948 F1 score for factor retrieval and 0.995 accuracy for normal range retrieval. GPT-4-turbo with RAG outperformed the best non-RAG system by 33.5% in factor retrieval and showed 132% and 100% improvements in question-level and lab-level performance, respectively, for normal range retrieval. These findings highlight Lab-AI's potential to enhance patient understanding of lab results.
- [483] arXiv:2409.19151 (replaced) [pdf, html, other]
-
Title: Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book?Comments: Accepted at ICLR 2025 (Spotlight)Subjects: Computation and Language (cs.CL)
Extremely low-resource (XLR) languages lack substantial corpora for training NLP models, motivating the use of all available resources such as dictionaries and grammar books. Machine Translation from One Book (Tanzer et al., 2024) suggests that prompting long-context LLMs with one grammar book enables English-Kalamang translation, an XLR language unseen by LLMs - a noteworthy case of linguistics helping an NLP task. We investigate the source of this translation ability, finding almost all improvements stem from the book's parallel examples rather than its grammatical explanations. We find similar results for Nepali and Guarani, seen low-resource languages, and we achieve performance comparable to an LLM with a grammar book by simply fine-tuning an encoder-decoder translation model. We then investigate where grammar books help by testing two linguistic tasks, grammaticality judgment and gloss prediction, and we explore what kind of grammatical knowledge helps by introducing a typological feature prompt that achieves leading results on these more relevant tasks. We thus emphasise the importance of task-appropriate data for XLR languages: parallel examples for translation, and grammatical data for linguistic tasks. As we find no evidence that long-context LLMs can make effective use of grammatical explanations for XLR translation, we conclude data collection for multilingual XLR tasks such as translation is best focused on parallel data over linguistic description.
- [484] arXiv:2410.01131 (replaced) [pdf, html, other]
-
Title: nGPT: Normalized Transformer with Representation Learning on the HypersphereSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.
- [485] arXiv:2410.01952 (replaced) [pdf, html, other]
-
Title: TypedThinker: Diversify Large Language Model Reasoning with Typed ThinkingComments: work in processSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) have demonstrated strong reasoning capabilities in solving complex problems. However, current approaches primarily enhance reasoning through the elaboration of thoughts while neglecting the diversity of reasoning types. LLMs typically employ deductive reasoning, proceeding step-by-step from given conditions, which limits their exploration during problem-solving. Our analysis reveals that certain problems are exclusively solvable through specific reasoning strategies like inductive, abductive, or analogical reasoning. However, incorporating diverse reasoning approaches presents two key challenges: identifying the appropriate reasoning type for each problem and exploiting this approach during problem-solving. Therefore, we propose the TypedThinker that predicts suitable reasoning types based on the problem and their previous effectiveness and provides relevant demonstrations to guide LLMs in applying these strategies. Experimental results show significant improvements across multiple benchmarks, with performance gains of 3.4% for Mistral 7B, 6.5% for LLaMA3 8B, and 7% for Qwen 2 7B on logical and mathematical reasoning tasks. TypedThinker enhances LLM reasoning without requiring knowledge distillation from larger models. It can be integrated into more advanced systems like GPT-4o or specialized models like MetaMath to diversify their reasoning approaches and improve their problem-solving capabilities.
- [486] arXiv:2410.02703 (replaced) [pdf, html, other]
-
Title: Selective Attention Improves TransformerComments: ICLR 2025Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention consistently improves language modeling and downstream task performance in a variety of model sizes and context lengths. For example, transformers trained with the language modeling objective on C4 with selective attention perform language modeling equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.
- [487] arXiv:2410.03058 (replaced) [pdf, html, other]
-
Title: DiffKillR: Killing and Recreating Diffeomorphisms for Cell Annotation in Dense Microscopy ImagesChen Liu, Danqi Liao, Alejandro Parada-Mayorga, Alejandro Ribeiro, Marcello DiStasio, Smita KrishnaswamyComments: ICASSP 2025, Oral PresentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
The proliferation of digital microscopy images, driven by advances in automated whole slide scanning, presents significant opportunities for biomedical research and clinical diagnostics. However, accurately annotating densely packed information in these images remains a major challenge. To address this, we introduce DiffKillR, a novel framework that reframes cell annotation as the combination of archetype matching and image registration tasks. DiffKillR employs two complementary neural networks: one that learns a diffeomorphism-invariant feature space for robust cell matching and another that computes the precise warping field between cells for annotation mapping. Using a small set of annotated archetypes, DiffKillR efficiently propagates annotations across large microscopy images, reducing the need for extensive manual labeling. More importantly, it is suitable for any type of pixel-level annotation. We will discuss the theoretical properties of DiffKillR and validate it on three microscopy tasks, demonstrating its advantages over existing supervised, semi-supervised, and unsupervised methods. The code is available at this https URL.
- [488] arXiv:2410.04612 (replaced) [pdf, html, other]
-
Title: Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHFSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate $Q$-values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at this https URL, and models trained by REFUEL can be found at this https URL.
- [489] arXiv:2410.05401 (replaced) [pdf, html, other]
-
Title: Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness EvaluationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
Climate change communication on social media increasingly employs microtargeting strategies to effectively reach and influence specific demographic groups. This study presents a post-hoc analysis of microtargeting practices within climate campaigns by leveraging large language models (LLMs) to examine Facebook advertisements. Our analysis focuses on two key aspects: demographic targeting and fairness. We evaluate the ability of LLMs to accurately predict the intended demographic targets, such as gender and age group, achieving an overall accuracy of 88.55%. Furthermore, we instruct the LLMs to generate explanations for their classifications, providing transparent reasoning behind each decision. These explanations reveal the specific thematic elements used to engage different demographic segments, highlighting distinct strategies tailored to various audiences. Our findings show that young adults are primarily targeted through messages emphasizing activism and environmental consciousness, while women are engaged through themes related to caregiving roles and social advocacy. In addition to evaluating the effectiveness of LLMs in detecting microtargeted messaging, we conduct a comprehensive fairness analysis to identify potential biases in model predictions. Our findings indicate that while LLMs perform well overall, certain biases exist, particularly in the classification of senior citizens and male audiences. By showcasing the efficacy of LLMs in dissecting and explaining targeted communication strategies and by highlighting fairness concerns, this study provides a valuable framework for future research aimed at enhancing transparency, accountability, and inclusivity in social media-driven climate campaigns.
- [490] arXiv:2410.06515 (replaced) [pdf, html, other]
-
Title: Understanding Practitioners' Expectations on Clear Code Review CommentsComments: Accepted by 34th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2025)Subjects: Software Engineering (cs.SE)
The code review comment (CRC) is pivotal in the process of modern code review. It provides reviewers with the opportunity to identify potential bugs, offer constructive feedback, and suggest improvements. Clear and concise code review comments (CRCs) facilitate the communication between developers and are crucial to the correct understanding of the identified issues and proposed solutions. Despite the importance of CRCs' clarity, there is still a lack of guidelines on what constitutes a good clarity and how to evaluate it. In this paper, we conduct a comprehensive study on understanding and evaluating the clarity of CRCs. We first derive a set of attributes related to the clarity of CRCs, namely RIE attributes (i.e., Relevance, Informativeness, and Expression), as well as their corresponding evaluation criteria based on our literature review and survey with practitioners. We then investigate the clarity of CRCs in open-source projects written in nine programming languages and find that a large portion (i.e., 28.8%) of the CRCs lack the clarity in at least one of the attributes. Finally, we explore the potential of automatically evaluating the clarity of CRCs by proposing ClearCRC. Experimental results show that ClearCRC with pre-trained language models is promising for effective evaluation of the clarity of CRCs, achieving a balanced accuracy up to 73.04% and a F-1 score up to 94.61%.
- [491] arXiv:2410.08135 (replaced) [pdf, html, other]
-
Title: State Feedback System Level Synthesis in Continuous TimeComments: 8 pages, 6 figures, conferenceSubjects: Systems and Control (eess.SY)
System level synthesis (SLS) is a controller parameterization technique that facilitates synthesis of structured distributed controllers via convex optimization. Past results on SLS are primarily in the discrete-time setting; this paper extends SLS to the continuous-time setting. We translate the parametrization and associated constraints to continuous-time, and propose a controller design procedure consisting of two steps: (1) selection of poles and (2) optimization over closed-loop responses. We provide SLS parameterizations for continuous-time $\H2$ and $\Hinf$ control, and show that the proposed procedure allows us to design structured $\H2$ and $\Hinf$ controllers via convex optimization. Furthermore, the proposed procedure preserves the scalability and local-disturbance-rejection features of the original discrete-time SLS framework. We verify our findings in simulation -- on a grid of 9 nodes governed by linearized swing equations, our structured distributed controllers perform similarly to the optimal centralized controllers.
- [492] arXiv:2410.10797 (replaced) [pdf, html, other]
-
Title: MEV Capture Through Time-Advantaged ArbitrageSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
As blockchains begin processing significant economic activity, the ability to include and order transactions inevitably becomes highly valuable, a concept known as Maximal Extractable Value (MEV). This makes effective mechanisms for transaction inclusion and ordering, and thereby the extraction of MEV, a key aspect of blockchain design. Beyond traditional approaches such as ordering in a first-come-first-serve manner or using priority fees, a recent proposal suggests auctioning off a time advantage for transaction inclusion. In this paper, we investigate this time advantage mechanism, focusing specifically on arbitrage opportunities on Automated Market Makers (AMMs), one of the largest sources of MEV today. We analyze the optimal strategy for a time-advantaged arbitrageur and compare the profits generated by various MEV extraction methods. Finally, we explore how AMMs can be adapted in the time advantage setting to capture a portion of the MEV.
- [493] arXiv:2410.11539 (replaced) [pdf, html, other]
-
Title: Transfer Learning with Foundational Models for Time Series Forecasting using Low-Rank AdaptationsSubjects: Machine Learning (cs.LG)
Foundational Models are an emerging widely used technique of GenAI. These models are distinguished by their scalability and the ease with which they can be adapted through the exploitation of Transfer Learning. The availability of high computational power and large datasets have supported their development, achieving a high generalization capacity due to the enormous and heterogeneous amounts of data used in their initial training. These characteristics contribute to a solid base that can be adapted or adjusted to a wide range of tasks, increasing their applicability. This study proposes the methodology LLIAM, a straightforward adaptation of a kind of FM, Large Language Models, for the Time Series Forecasting task. An adequate time-series prompting schema and Low-Rank Adaptations are used to enhance the knowledge of the model with diverse time series datasets, known as the fine-tuning phase. A study divided in two stages has been performed for evaluating the effectiveness of the proposed methodology. Initially, a comparison was made between the performance of LLIAM and different state-of-the-art DL algorithms, including Recurrent Neural Networks and Temporal Convolutional Networks, as well as a LLM-based method, TimeLLM. Following this, a zero-shot study is presented in order to evaluate the generalization capacity of the proposed methodology with time series datasets from unknown domains not considered in the model training. The outcomes of this investigation demonstrate the efficacy of LLIAM, highlighting that this straightforward and general approach can attain competent results without the necessity for applying complex modifications. This work also encourages the use of available resources (such as these pre-trained models) and efficient fine-tuning techniques to avoid unnecessary and costly training, narrowing the gap between the goals of traditional AI and Green AI.
- [494] arXiv:2410.11894 (replaced) [pdf, html, other]
-
Title: Automated Discovery of Operable Dynamics from VideosSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Chaotic Dynamics (nlin.CD)
Dynamical systems form the foundation of scientific discovery, traditionally modeled with predefined state variables such as the angle and angular velocity, and differential equations such as the equation of motion for a single pendulum. We introduce a framework that automatically discovers a low-dimensional and operable representation of system dynamics, including a set of compact state variables that preserve the smoothness of the system dynamics and a differentiable vector field, directly from video without requiring prior domain-specific knowledge. The prominence and effectiveness of the proposed approach are demonstrated through both quantitative and qualitative analyses of a range of dynamical systems, including the identification of stable equilibria, the prediction of natural frequencies, and the detection of of chaotic and limit cycle behaviors. The results highlight the potential of our data-driven approach to advance automated scientific discovery.
- [495] arXiv:2410.19878 (replaced) [pdf, html, other]
-
Title: Parameter-Efficient Fine-Tuning in Large Models: A Survey of MethodologiesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The large models, as predicted by scaling raw forecasts, have made groundbreaking progress in many fields, particularly in natural language generation tasks, where they have approached or even surpassed human levels. However, the unprecedented scale of their parameters brings significant computational and storage costs. These large models require substantial computational resources and GPU memory to operate. When adapting large models to specific downstream tasks, their massive parameter scale poses a significant challenge in fine-tuning on hardware platforms with limited computational power and GPU memory. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) offers a practical solution by efficiently adjusting the parameters of large pre-trained models to suit various downstream tasks. Specifically, PEFT adjusts the parameters of pre-trained large models to adapt to specific tasks or domains, minimizing the introduction of additional parameters and the computational resources required. This review mainly introduces the preliminary knowledge of PEFT, the core ideas and principles of various PEFT algorithms, the applications of PEFT, and potential future research directions. By reading this review, we believe that interested parties can quickly grasp the PEFT methodology, thereby accelerating its development and innovation.
- [496] arXiv:2410.22716 (replaced) [pdf, html, other]
-
Title: Exposing Cross-Platform Coordinated Inauthentic Activity in the Run-Up to the 2024 U.S. ElectionComments: HUMANS Lab -- Working Paper No. 2024.7 -- The 2024 Election Integrity Initiative -- University of Southern California - Updated Version of WWW '25 SubmissionSubjects: Social and Information Networks (cs.SI)
Coordinated information operations remain a persistent challenge on social media, despite platform efforts to curb them. While previous research has primarily focused on identifying these operations within individual platforms, this study shows that coordination frequently transcends platform boundaries. Leveraging newly collected data of online conversations related to the 2024 U.S. Election across $\mathbb{X}$ (formerly, Twitter), Facebook, and Telegram, we construct similarity networks to detect coordinated communities exhibiting suspicious sharing behaviors within and across platforms. Proposing an advanced coordination detection model, we reveal evidence of potential foreign interference, with Russian-affiliated media being systematically promoted across Telegram and $\mathbb{X}$. Our analysis also uncovers substantial intra- and cross-platform coordinated inauthentic activity, driving the spread of highly partisan, low-credibility, and conspiratorial content. These findings highlight the urgent need for regulatory measures that extend beyond individual platforms to effectively address the growing challenge of cross-platform coordinated influence campaigns.
- [497] arXiv:2410.24117 (replaced) [pdf, html, other]
-
Title: AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and ValidationAli Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, Reyhaneh JabbarvandComments: Published in FSE 2025Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Code translation transforms programs from one programming language (PL) to another. Several rule-based transpilers have been designed to automate code translation between different pairs of PLs. However, the rules can become obsolete as the PLs evolve and cannot generalize to other PLs. Recent studies have explored the automation of code translation using Large Language Models (LLMs). One key observation is that such techniques may work well for crafted benchmarks but fail to generalize to the scale and complexity of real-world projects with dependencies, custom types, PL-specific features, etc. We propose AlphaTrans, a neuro-symbolic approach to automate repository-level code translation. AlphaTrans translates both source and test code, and employs multiple levels of validation to ensure the translation preserves the functionality of the source program. To break down the problem for LLMs, AlphaTrans leverages program analysis to decompose the program into fragments and translates them in the reverse call order. We leveraged AlphaTrans to translate ten real-world open-source projects consisting of <836, 8575, 2719> classes, methods, and tests. AlphaTrans breaks down these projects into 17874 fragments and translates the entire repository. 96.40% of the translated fragments are syntactically correct, and AlphaTrans validates the translations' runtime behavior and functional correctness for 27.03% and 25.14% of fragments. On average, the integrated translation and validation take 34 hours to translate a project, showing its scalability in practice. For the incorrect translations, AlphaTrans generates a report including existing translation, stack trace, test errors, or assertion failures. We provided these artifacts to two developers to fix the translation bugs in four projects. They were able to fix the issues in 20.1 hours on average and achieve all passing tests.
- [498] arXiv:2411.03416 (replaced) [pdf, html, other]
-
Title: Efficient Iterative Proximal Variational Inference Motion PlanningComments: 13 pagesSubjects: Robotics (cs.RO)
Motion planning under uncertainty can be cast as a stochastic optimal control problem where the optimal posterior distribution has an explicit form. To approximate this posterior, this work frames an optimization problem in the space of Gaussian distributions by solving a Variational Inference (VI) in the path distribution space. For linear-Gaussian stochastic dynamics, we propose a proximal algorithm to solve for an optimal Gaussian proposal iteratively. The computational bottleneck is evaluating the gradients with respect to the proposal over a dense trajectory. We exploit the sparse motion planning factor graph and Gaussian Belief Propagation (GBP), allowing for parallel computing of these gradients on Graphics Processing Units (GPUs). We term the novel paradigm as the Parallel Gaussian Variational Inference Motion Planning (P-GVIMP). Building on the efficient algorithm for linear Gaussian systems, we then propose an iterative paradigm based on Statistical Linear Regression (SLR) techniques to solve motion planning for nonlinear stochastic systems, where the P-GVIMP serves as a sub-routine for the linearized time-varying system. We validate the proposed framework on various robotic systems, demonstrating significant speed acceleration achieved by leveraging parallel computation and successful planning solutions for nonlinear systems under uncertainty. An open-sourced implementation is presented at this https URL.
- [499] arXiv:2411.04950 (replaced) [pdf, html, other]
-
Title: Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing ApproachSubjects: Computation and Language (cs.CL)
We introduce a data-centric hypothesis-testing framework to quantify the influence of sequentially correlated literary properties--such as thematic continuity--on textual classification tasks. Our method models label sequences as stochastic processes and uses an empirical autocovariance matrix to generate surrogate labelings that preserve sequential dependencies. This enables statistical testing to determine whether classification outcomes are primarily driven by thematic structure or by non-sequential features like authorial style. Applying this framework across a diverse corpus of English prose, we compare traditional (word n-grams and character k-mers) and neural (contrastively trained) embeddings in both supervised and unsupervised classification settings. Crucially, our method identifies when classifications are confounded by sequentially correlated similarity, revealing that supervised and neural models are more prone to false positives--mistaking shared themes and cross-genre differences for stylistic signals. In contrast, unsupervised models using traditional features often yield high true positive rates with minimal false positives, especially in genre-consistent settings. By disentangling sequential from non-sequential influences, our approach provides a principled way to assess and interpret classification reliability. This is particularly impactful for authorship attribution, forensic linguistics, and the analysis of redacted or composite texts, where conventional methods may conflate theme with style. Our results demonstrate that controlling for sequential correlation is essential for reducing false positives and ensuring that classification outcomes reflect genuine stylistic distinctions.
- [500] arXiv:2411.11114 (replaced) [pdf, html, other]
-
Title: JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and CircuitComments: 17 pages, 11 figuresSubjects: Cryptography and Security (cs.CR)
Despite the outstanding performance of Large language Models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses. Although jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explained typical jailbreaking behavior (e.g., the degree to which the model refuses to respond) of LLMs by analyzing representation shifts in their latent space caused by jailbreak prompts or identifying key neurons that contribute to the success of jailbreak attacks. However, these studies neither explore diverse jailbreak patterns nor provide a fine-grained explanation from the failure of circuit to the changes of representational, leaving significant gaps in uncovering the jailbreak mechanism. In this paper, we propose JailbreakLens, an interpretation framework that analyzes jailbreak mechanisms from both representation (which reveals how jailbreaks alter the model's harmfulness perception) and circuit perspectives~(which uncovers the causes of these deceptions by identifying key circuits contributing to the vulnerability), tracking their evolution throughout the entire response generation process. We then conduct an in-depth evaluation of jailbreak behavior on five mainstream LLMs under seven jailbreak strategies. Our evaluation reveals that jailbreak prompts amplify components that reinforce affirmative responses while suppressing those that produce refusal. This manipulation shifts model representations toward safe clusters to deceive the LLM, leading it to provide detailed responses instead of refusals. Notably, we find a strong and consistent correlation between representation deception and activation shift of key circuits across diverse jailbreak methods and multiple LLMs.
- [501] arXiv:2411.14423 (replaced) [pdf, html, other]
-
Title: PhysFlow: Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene SimulationComments: CVPR 2025. Homepage: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Realistic simulation of dynamic scenes requires accurately capturing diverse material properties and modeling complex object interactions grounded in physical principles. However, existing methods are constrained to basic material types with limited predictable parameters, making them insufficient to represent the complexity of real-world materials. We introduce PhysFlow, a novel approach that leverages multi-modal foundation models and video diffusion to achieve enhanced 4D dynamic scene simulation. Our method utilizes multi-modal models to identify material types and initialize material parameters through image queries, while simultaneously inferring 3D Gaussian splats for detailed scene representation. We further refine these material parameters using video diffusion with a differentiable Material Point Method (MPM) and optical flow guidance rather than render loss or Score Distillation Sampling (SDS) loss. This integrated framework enables accurate prediction and realistic simulation of dynamic interactions in real-world scenarios, advancing both accuracy and flexibility in physics-based simulations.
- [502] arXiv:2411.16206 (replaced) [pdf, html, other]
-
Title: A Simple and Efficient Approach to Batch Bayesian OptimizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Extending Bayesian optimization to batch evaluation can enable the designer to make the most use of parallel computing technology. However, most of current batch approaches do not scale well with the batch size. That is, their performances deteriorate dramatically as the batch size increases. To address this issue, we propose a simple and efficient approach to extend Bayesian optimization to large-scale batch evaluation in this work. Different from existing batch approaches, the idea of the new approach is to draw a batch of axis-aligned subspaces of the original problem and select one acquisition point from each subspace. To achieve this, we propose the expected subspace improvement criterion to measure the amount of the improvement that a candidate point can achieve within a certain axis-aligned subspace. By optimizing these expected subspace improvement functions simultaneously, we can get a batch of query points for parallel evaluation. Numerical experiments show that our proposed approach can speedup the convergence significantly when compared with the sequential Bayesian optimization algorithm, and performs very competitively when compared with seven batch Bayesian optimization algorithms. A Matlab implementation of the proposed approach is available at this https URL.
- [503] arXiv:2411.16718 (replaced) [pdf, html, other]
-
Title: Neuro-Symbolic Evaluation of Text-to-Video Models using Formal VerificationJournal-ref: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Recent advancements in text-to-video models such as Sora, Gen-3, MovieGen, and CogVideoX are pushing the boundaries of synthetic video generation, with adoption seen in fields like robotics, autonomous driving, and entertainment. As these models become prevalent, various metrics and benchmarks have emerged to evaluate the quality of the generated videos. However, these metrics emphasize visual quality and smoothness, neglecting temporal fidelity and text-to-video alignment, which are crucial for safety-critical applications. To address this gap, we introduce NeuS-V, a novel synthetic video evaluation metric that rigorously assesses text-to-video alignment using neuro-symbolic formal verification techniques. Our approach first converts the prompt into a formally defined Temporal Logic (TL) specification and translates the generated video into an automaton representation. Then, it evaluates the text-to-video alignment by formally checking the video automaton against the TL specification. Furthermore, we present a dataset of temporally extended prompts to evaluate state-of-the-art video generation models against our benchmark. We find that NeuS-V demonstrates a higher correlation by over 5x with human evaluations when compared to existing metrics. Our evaluation further reveals that current video generation models perform poorly on these temporally complex prompts, highlighting the need for future work in improving text-to-video generation capabilities.
- [504] arXiv:2411.17973 (replaced) [pdf, html, other]
-
Title: Improved implicit diffusion model with knowledge distillation to estimate the spatial distribution density of carbon stock in remote sensing imagerySubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The forest serves as the most significant terrestrial carbon stock mechanism, effectively reducing atmospheric CO2 concentrations and mitigating climate change. Remote sensing provides high data accuracy and enables large-scale observations. Optical images facilitate long-term monitoring, which is crucial for future carbon stock estimation studies. This study focuses on Huize County, Qujing City, Yunnan Province, China, utilizing GF-1 WFV satellite imagery. The KD-VGG and KD-UNet modules were introduced for initial feature extraction, and the improved implicit diffusion model (IIDM) was proposed. The results showed: (1) The VGG module improved initial feature extraction, improving accuracy, and reducing inference time with optimized model parameters. (2) The Cross-attention + MLPs module enabled effective feature fusion, establishing critical relationships between global and local features, achieving high-accuracy estimation. (3) The IIDM model, a novel contribution, demonstrated the highest estimation accuracy with an RMSE of 12.17%, significantly improving by 41.69% to 42.33% compared to the regression model. In carbon stock estimation, the generative model excelled in extracting deeper features, significantly outperforming other models, demonstrating the feasibility of AI-generated content in quantitative remote sensing. The 16-meter resolution estimates provide a robust basis for tailoring forest carbon sink regulations, enhancing regional carbon stock management.
- [505] arXiv:2411.19736 (replaced) [pdf, html, other]
-
Title: Higher order error estimates for regularization of inverse problems under non-additive noiseSubjects: Numerical Analysis (math.NA); Functional Analysis (math.FA)
In this work we derive higher order error estimates for inverse problems distorted by non-additive noise, in terms of Bregman distances. The results are obtained by means of a novel source condition, inspired by the dual problem. Specifically, we focus on variational regularization having the Kullback-Leibler divergence as data-fidelity, and a convex penalty term. In this framework, we provide an interpretation of the new source condition, and present error estimates also when a variational formulation of the source condition is employed. We show that this approach can be extended to variational regularization that incorporates more general convex data fidelities.
- [506] arXiv:2412.00247 (replaced) [pdf, html, other]
-
Title: WiReSens Toolkit: An Open-source Platform towards Accessible Wireless Tactile SensingSubjects: Human-Computer Interaction (cs.HC)
Past research has widely explored the design and fabrication of resistive matrix-based tactile sensors as a means of creating touch-sensitive devices. However, developing portable, adaptive, and long-lasting tactile sensing systems that incorporate these sensors remains challenging for individuals having limited prior experience with them. To address this, we developed the WiReSens Toolkit, an open-source platform for accessible wireless tactile sensing. Central to our approach is adaptive hardware for interfacing with resistive sensors and a web-based GUI that mediates access to complex functionalities for developing scalable tactile sensing systems, including 1) multi-device programming and wireless visualization across three distinct communication protocols 2) autocalibration methods for adaptive sensitivity and 3) intermittent data transmission for low-power operation. We validated the toolkit's usability through a user study with 11 novice participants, who, on average, successfully configured a tactile sensor with over 95\% accuracy in under five minutes, calibrated sensors 10x faster than baseline methods, and demonstrated enhanced tactile data sense-making.
- [507] arXiv:2412.02107 (replaced) [pdf, other]
-
Title: Efficient, Portable, Census-Polymorphic Choreographic ProgrammingComments: Presenting at PLDI25Subjects: Programming Languages (cs.PL)
Choreographic programming (CP) is a paradigm for implementing distributed systems that uses a single global program to define the actions and interactions of all participants. Library-level CP implementations, like HasChor, integrate well with mainstream programming languages but have several limitations: Their conditionals require extra communication; they require specific host-language features (e.g., monads); and they lack support for programming patterns that are essential for implementing realistic distributed applications.
We make three contributions to library-level CP to specifically address these challenges. First, we propose and formalize conclaves and multiply-located values, which enable efficient conditionals in library-level CP without redundant communication. Second, we propose end-point projection as dependency injection, a design pattern that enables library-level CP in host languages without support for monads. Third, we propose census polymorphism, a technique for abstracting over the number of participants in a choreography. We demonstrate these contributions via implementations in Haskell, Rust, and TypeScript. - [508] arXiv:2412.02868 (replaced) [pdf, html, other]
-
Title: Enhancing LLMs with Smart Preprocessing for EHR AnalysisYixiang Qu, Yifan Dai, Shilin Yu, Pradham Tanikella, Travis Schrank, Trevor Hackman, Didong Li, Di WuSubjects: Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have demonstrated remarkable proficiency in natural language processing; however, their application in sensitive domains such as healthcare, especially in processing Electronic Health Records (EHRs), is constrained by limited computational resources and privacy concerns. This paper introduces a compact LLM framework optimized for local deployment in environments with stringent privacy requirements and restricted access to high-performance GPUs. Our approach leverages simple yet powerful preprocessing techniques, including regular expressions (regex) and Retrieval-Augmented Generation (RAG), to extract and highlight critical information from clinical notes. By pre-filtering long, unstructured text, we enhance the performance of smaller LLMs on EHR-related tasks. Our framework is evaluated using zero-shot and few-shot learning paradigms on both private and publicly available datasets (MIMIC-IV), with additional comparisons against fine-tuned LLMs on MIMIC-IV. Experimental results demonstrate that our preprocessing strategy significantly supercharges the performance of smaller LLMs, making them well-suited for privacy-sensitive and resource-constrained applications. This study offers valuable insights into optimizing LLM performance for local, secure, and efficient healthcare applications. It provides practical guidance for real-world deployment for LLMs while tackling challenges related to privacy, computational feasibility, and clinical applicability.
- [509] arXiv:2412.04646 (replaced) [pdf, html, other]
-
Title: Online Hitting Sets for Disks of Bounded RadiiComments: 31 pages and 19 figuresSubjects: Computational Geometry (cs.CG)
We present algorithms for the online minimum hitting set problem in geometric range spaces: Given a set $P$ of $n$ points in the plane and a sequence of geometric objects that arrive one-by-one, we need to maintain a hitting set at all times. For disks of radii in the interval $[1,M]$, we present an $O(\log M \log n)$-competitive algorithm. This result generalizes from disks to positive homothets of any convex body in the plane with scaling factors in the interval $[1,M]$. As a main technical tool, we reduce the problem to the online hitting set problem for a finite subset of integer points and bottomless rectangles. Specifically, for a given $N>1$, we present an $O(\log N)$-competitive algorithm for the variant where $P$ is a subset of an $N\times N$ section of the integer lattice, and the geometric objects are bottomless rectangles.
- [510] arXiv:2412.08802 (replaced) [pdf, html, other]
-
Title: jina-clip-v2: Multilingual Multimodal Embeddings for Text and ImagesAndreas Koukounas, Georgios Mastrapas, Sedigheh Eslami, Bo Wang, Mohammad Kalim Akram, Michael Günther, Isabelle Mohr, Saba Sturua, Nan Wang, Han XiaoComments: 30 pages, 1-10 main paper, 10-12 refs, 12-30 benchmarksSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available at this https URL.
- [511] arXiv:2412.10706 (replaced) [pdf, html, other]
-
Title: SHIFT Planner: Speedy Hybrid Iterative Field and Segmented Trajectory Optimization with IKD-tree for Uniform Lightweight CoverageSubjects: Robotics (cs.RO)
This paper introduces a comprehensive planning and navigation framework that address these limitations by integrating semantic mapping, adaptive coverage planning, dynamic obstacle avoidance and precise trajectory tracking. Our framework begins by generating panoptic occupancy local semantic maps and accurate localization information from data aligned between a monocular camera, IMU, and GPS. This information is combined with input terrain point clouds or preloaded terrain information to initialize the planning process. We propose the Radiant Field-Informed Coverage Planning algorithm, which utilizes a diffusion field model to dynamically adjust the robot's coverage trajectory and speed based on environmental attributes such as dirtiness and dryness. By modeling the spatial influence of the robot's actions using a Gaussian field, ensures a speed-optimized, uniform coverage trajectory while adapting to varying environmental conditions.
- [512] arXiv:2412.10892 (replaced) [pdf, html, other]
-
Title: Know Unreported Roadway Incidents in Real-time: Early Traffic Anomaly DetectionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
This research aims to know traffic anomalies as early as possible. A traffic anomaly refers to a generic incident on the road that influences traffic flow and calls for urgent traffic management measures. `Knowing'' the occurrence of a traffic anomaly is twofold: the ability to detect this anomaly before it is reported anywhere, or it may be such that an anomaly can be predicted before it actually occurs on the road (e.g., non-recurrent traffic breakdown). In either way, the objective is to inform traffic operators of unreported incidents in real time and as early as possible. The key is to stay ahead of the curve. Time is of the essence.
Conventional automatic incident detection (AID) methods often struggle with early detection due to their limited consideration of spatial effects and early-stage characteristics. Therefore, we propose a deep learning framework utilizing prior domain knowledge and model-designing strategies. This allows the model to detect a broader range of anomalies, not only incidents that significantly influence traffic flow but also early characteristics of incidents along with historically unreported anomalies. We specially design the model to target the early-stage detection/prediction of an incident. Additionally, unlike most conventional AID studies, our method is highly scalable and generalizable, as it is fully automated with no manual selection of historical reports required, relies solely on widely available low-cost data, and requires no additional detectors. The experimental results across numerous road segments on different maps demonstrate that our model leads to more effective and early anomaly detection. - [513] arXiv:2412.11003 (replaced) [pdf, html, other]
-
Title: Optimal Rates for Robust Stochastic Convex OptimizationComments: The 6th annual Symposium on Foundations of Responsible Computing (FORC 2025)Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Machine learning algorithms in high-dimensional settings are highly susceptible to the influence of even a small fraction of structured outliers, making robust optimization techniques essential. In particular, within the $\epsilon$-contamination model, where an adversary can inspect and replace up to an $\epsilon$-fraction of the samples, a fundamental open problem is determining the optimal rates for robust stochastic convex optimization (SCO) under such contamination. We develop novel algorithms that achieve minimax-optimal excess risk (up to logarithmic factors) under the $\epsilon$-contamination model. Our approach improves over existing algorithms, which are not only suboptimal but also require stringent assumptions, including Lipschitz continuity and smoothness of individual sample functions. By contrast, our optimal algorithms do not require these stringent assumptions, assuming only population-level smoothness of the loss. Moreover, our algorithms can be adapted to handle the case in which the covariance parameter is unknown, and can be extended to nonsmooth population risks via convolutional smoothing. We complement our algorithmic developments with a tight information-theoretic lower bound for robust SCO.
- [514] arXiv:2412.11496 (replaced) [pdf, html, other]
-
Title: Capacity of Hierarchical Secure Coded Gradient Aggregation with Straggling Communication LinksSubjects: Information Theory (cs.IT)
The growing privacy concerns in distributed learning have led to the widespread adoption of secure aggregation techniques in distributed machine learning systems, such as federated learning. Motivated by a coded gradient aggregation problem in a user-helper-master hierarchical network setting with straggling communication links, we formulate a new secure hierarchical coded gradient aggregation problem. In our setting, \( K \) users communicate with the master through an intermediate layer of \( N \) helpers, who can communicate with each other. With a resiliency threshold of \( N_r \) for straggling communication links, and at most \( T \) colluding helpers and any number of colluding users, the master aims to recover the sum of all users' gradients while remaining unaware of any individual gradient that exceeds the expected sum. In addition, helpers cannot infer more about users' gradients than what is already known by the colluding users. We propose an achievable scheme where users' upload messages are based on a globally known Vandermonde matrix, and helper communication is facilitated using an extended Vandermonde matrix with special structural properties. A matching converse bound is also derived, establishing the optimal result for this hierarchical coded gradient aggregation problem.
- [515] arXiv:2412.15576 (replaced) [pdf, html, other]
-
Title: QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot LearningXinyang Tong, Pengxiang Ding, Yiguo Fan, Donglin Wang, Wenjie Zhang, Can Cui, Mingyang Sun, Han Zhao, Hongyin Zhang, Yonghao Dang, Siteng Huang, Shangke LyuComments: Accepted to ICRA 2025; Github page: this https URLSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65%. Our project page is this https URL.
- [516] arXiv:2412.15921 (replaced) [pdf, html, other]
-
Title: Less is More: Towards Green Code Large Language Models via Unified Structural PruningComments: UNDER REVIEWSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
The extensive application of Large Language Models (LLMs) in generative coding tasks has raised concerns due to their high computational demands and energy consumption. Unlike previous structural pruning methods designed for classification models that deal with lowdimensional classification logits, generative Code LLMs produce high-dimensional token logit sequences, making traditional pruning objectives inherently limited. Moreover, existing single component pruning approaches further constrain the effectiveness when applied to generative Code LLMs. In response, we propose Flab-Pruner, an innovative unified structural pruning method that combines vocabulary, layer, and Feed-Forward Network (FFN) pruning. This approach effectively reduces model parameters while maintaining performance. Additionally, we introduce a customized code instruction data strategy for coding tasks to enhance the performance recovery efficiency of the pruned model. Through extensive evaluations on three state-of-the-art Code LLMs across multiple generative coding tasks, the results demonstrate that Flab-Pruner retains 97% of the original performance after pruning 22% of the parameters and achieves the same or even better performance after post-training. The pruned models exhibit significant improvements in storage, GPU usage, computational efficiency, and environmental impact, while maintaining well robustness. Our research provides a sustainable solution for green software engineering and promotes the efficient deployment of LLMs in real-world generative coding intelligence applications.
- [517] arXiv:2412.16195 (replaced) [pdf, other]
-
Title: Machine Learning-Based Automated Assessment of Intracorporeal Suturing in Laparoscopic FundoplicationShekhar Madhav Khairnar, Huu Phong Nguyen, Alexis Desir, Carla Holcomb, Daniel J. Scott, Ganesh SankaranarayananComments: 17 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Automated assessment of surgical skills using artificial intelligence (AI) provides trainees with instantaneous feedback. After bimanual tool motions are captured, derived kinematic metrics are reliable predictors of performance in laparoscopic tasks. Implementing automated tool tracking requires time-intensive human annotation. We developed AI-based tool tracking using the Segment Anything Model (SAM) to eliminate the need for human annotators. Here, we describe a study evaluating the usefulness of our tool tracking model in automated assessment during a laparoscopic suturing task in the fundoplication procedure. An automated tool tracking model was applied to recorded videos of Nissen fundoplication on porcine bowel. Surgeons were grouped as novices (PGY1-2) and experts (PGY3-5, attendings). The beginning and end of each suturing step were segmented, and motions of the left and right tools were extracted. A low-pass filter with a 24 Hz cut-off frequency removed noise. Performance was assessed using supervised and unsupervised models, and an ablation study compared results. Kinematic features--RMS velocity, RMS acceleration, RMS jerk, total path length, and Bimanual Dexterity--were extracted and analyzed using Logistic Regression, Random Forest, Support Vector Classifier, and XGBoost. PCA was performed for feature reduction. For unsupervised learning, a Denoising Autoencoder (DAE) model with classifiers, such as a 1-D CNN and traditional models, was trained. Data were extracted for 28 participants (9 novices, 19 experts). Supervised learning with PCA and Random Forest achieved an accuracy of 0.795 and an F1 score of 0.778. The unsupervised 1-D CNN achieved superior results with an accuracy of 0.817 and an F1 score of 0.806, eliminating the need for kinematic feature computation. We demonstrated an AI model capable of automated performance classification, independent of human annotation.
- [518] arXiv:2501.01163 (replaced) [pdf, html, other]
-
Title: 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint TransformerComments: Accepted by CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines-such as offline multi-view feature extraction or additional task-specific heads-3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens, (2) a visual prompt encoder that embeds interactive visual prompts into the visual token space, and (3) a referring mask decoder that produces 3D masks based on text description. This versatile OST is empowered by the hybrid pretraining to obtain perception priors and leveraged as the visual connector that bridges the 3D data to the LLM. After performing unified instruction tuning, our 3D-LLaVA reports impressive results on various benchmarks.
- [519] arXiv:2501.02917 (replaced) [pdf, html, other]
-
Title: On Achievable Rates Over Noisy Nanopore ChannelsComments: 35 pages, 2 figuresSubjects: Information Theory (cs.IT)
In this paper, we consider a recent channel model of a nanopore sequencer proposed by McBain, Viterbo, and Saunderson (2024), termed the noisy nanopore channel (NNC). In essence, an NNC is a duplication channel with structured, Markov inputs, that is corrupted by memoryless noise. We first discuss a (tight) lower bound on the capacity of the NNC in the absence of random noise. Next, we present lower and upper bounds on the channel capacity of general noisy nanopore channels. We then consider two interesting regimes of operation of an NNC: first, where the memory of the input process is large and the random noise introduces erasures, and second, where the rate of measurements of the electric current (also called the sampling rate) is high. For these regimes, we show that it is possible to achieve information rates close to the noise-free capacity, using low-complexity encoding and decoding schemes. In particular, our decoder for the regime of high sampling rates makes use of a change-point detection procedure -- a subroutine of immediate relevance for practitioners.
- [520] arXiv:2501.03888 (replaced) [pdf, html, other]
-
Title: Neural DNF-MT: A Neuro-symbolic Approach for Learning Interpretable and Editable PoliciesComments: AAMAS 2025 (with Appendix)Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Although deep reinforcement learning has been shown to be effective, the model's black-box nature presents barriers to direct policy interpretation. To address this problem, we propose a neuro-symbolic approach called neural DNF-MT for end-to-end policy learning. The differentiable nature of the neural DNF-MT model enables the use of deep actor-critic algorithms for training. At the same time, its architecture is designed so that trained models can be directly translated into interpretable policies expressed as standard (bivalent or probabilistic) logic programs. Moreover, additional layers can be included to extract abstract features from complex observations, acting as a form of predicate invention. The logic representations are highly interpretable, and we show how the bivalent representations of deterministic policies can be edited and incorporated back into a neural model, facilitating manual intervention and adaptation of learned policies. We evaluate our approach on a range of tasks requiring learning deterministic or stochastic behaviours from various forms of observations. Our empirical results show that our neural DNF-MT model performs at the level of competing black-box methods whilst providing interpretable policies.
- [521] arXiv:2501.05255 (replaced) [pdf, html, other]
-
Title: CallNavi, A Challenge and Empirical Study on LLM Function Calling and RoutingYewei Song, Xunzhu Tang, Cedric Lothritz, Saad Ezzini, Jacques Klein, Tegawendé F. Bissyandé, Andrey Boytsov, Ulrick Ble, Anne GoujonJournal-ref: The 29th International Conference on Evaluation and Assessment in Software Engineering (EASE 2025)Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
API-driven chatbot systems are increasingly integral to software engineering applications, yet their effectiveness hinges on accurately generating and executing API calls. This is particularly challenging in scenarios requiring multi-step interactions with complex parameterization and nested API dependencies. Addressing these challenges, this work contributes to the evaluation and assessment of AI-based software development through three key advancements: (1) the introduction of a novel dataset specifically designed for benchmarking API function selection, parameter generation, and nested API execution; (2) an empirical evaluation of state-of-the-art language models, analyzing their performance across varying task complexities in API function generation and parameter accuracy; and (3) a hybrid approach to API routing, combining general-purpose large language models for API selection with fine-tuned models and prompt engineering for parameter generation. These innovations significantly improve API execution in chatbot systems, offering practical methodologies for enhancing software design, testing, and operational workflows in real-world software engineering contexts.
- [522] arXiv:2501.06141 (replaced) [pdf, html, other]
-
Title: Emergent Symbol-like Number Variables in Artificial Neural NetworksSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
What types of numeric representations emerge in neural systems? What would a satisfying answer to this question look like? In this work, we interpret Neural Network (NN) solutions to sequence based counting tasks through a variety of lenses. We seek to understand how well we can understand NNs through the lens of interpretable Symbolic Algorithms (SAs), where SAs are defined by precise, abstract, mutable variables used to perform computations. We use GRUs, LSTMs, and Transformers trained using Next Token Prediction (NTP) on numeric tasks where the solutions to the tasks depend on numeric information only latent in the task structure. We show through multiple causal and theoretical methods that we can interpret NN's raw activity through the lens of simplified SAs when we frame the neural activity in terms of interpretable subspaces rather than individual neurons. Depending on the analysis, however, these interpretations can be graded, existing on a continuum, highlighting the philosophical question of what it means to "interpret" neural activity, and motivating us to introduce Alignment Functions to add flexibility to the existing Distributed Alignment Search (DAS) method. Through our specific analyses we show the importance of causal interventions for NN interpretability; we show that recurrent models develop graded, symbol-like number variables within their neural activity; we introduce a generalization of DAS to frame NN activity in terms of linear functions of interpretable variables; and we show that Transformers must use anti-Markovian solutions -- solutions that avoid using cumulative, Markovian hidden states -- in the absence of sufficient attention layers. We use our results to encourage interpreting NNs at the level of neural subspaces through the lens of SAs.
- [523] arXiv:2501.06164 (replaced) [pdf, html, other]
-
Title: Model Alignment SearchSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
When can we say that two neural systems are the same? The answer to this question is goal-dependent, and it is often addressed through correlative methods such as Representational Similarity Analysis (RSA) and Centered Kernel Alignment (CKA). We find ourselves chiefly interested in the relationship between representations and behavior, asking ourselves how we can isolate specific functional aspects of representational similarity to relate our measures to behavior -- avoiding cause vs. correlation pitfalls in the process. In this work, we introduce Model Alignment Search (MAS), a method for causally exploring distributed representational similarity as it relates to behavior. The method learns invertible linear transformations that find an aligned subspace between two distributed networks' representations where functional information can be isolated and manipulated. We first show that the method can be used to transfer values of specific causal variables -- such as the number of items in a counting task -- between networks with different training seeds and different architectures. We then explore open questions in number cognition by comparing different types of numeric representations in models trained on structurally different tasks, we explore differences between MAS and preexisting functional similarity methods, and lastly, we introduce a counterfactual latent auxiliary loss that helps shape functionally relevant alignments even in cases where we do not have causal access to one of the two models for training.
- [524] arXiv:2501.07421 (replaced) [pdf, html, other]
-
Title: Empirical Comparison of Four Stereoscopic Depth Sensing Cameras for Robotics ApplicationsJournal-ref: IEEE Access 13 (2025) 67564-67577Subjects: Robotics (cs.RO)
Depth sensing is an essential technology in robotics and many other fields. Many depth sensing (or RGB-D) cameras are available on the market and selecting the best one for your application can be challenging. In this work, we tested four stereoscopic RGB-D cameras that sense the distance by using two images from slightly different views. We empirically compared four cameras (Intel RealSense D435, Intel RealSense D455, StereoLabs ZED 2, and Luxonis OAK-D Pro) in three scenarios: (i) planar surface perception, (ii) plastic doll perception, (iii) household object perception (YCB dataset). We recorded and evaluated more than 3,000 RGB-D frames for each camera. For table-top robotics scenarios with distance to objects up to one meter, the best performance is provided by the D435 camera that is able to perceive with an error under 1 cm in all of the tested scenarios. For longer distances, the other three models perform better, making them more suitable for some mobile robotics applications. OAK-D Pro additionally offers integrated AI modules (e.g., object and human keypoint detection). ZED 2 is overall the best camera which is able to keep the error under 3 cm even at 4 meters. However, it is not a standalone device and requires a computer with a GPU for depth data acquisition. All data (more than 12,000 RGB-D frames) are publicly available at this https URL.
- [525] arXiv:2501.10100 (replaced) [pdf, other]
-
Title: Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in RoboticsSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Learning robust and generalizable world models is crucial for enabling efficient and scalable robotic control in real-world environments. In this work, we introduce a novel framework for learning world models that accurately capture complex, partially observable, and stochastic dynamics. The proposed method employs a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, ensuring adaptability across diverse robotic tasks. We further propose a policy optimization framework that leverages world models for efficient training in imagined environments and seamless deployment in real-world systems. This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer. By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.
- [526] arXiv:2501.11695 (replaced) [pdf, html, other]
-
Title: Spatially-Delineated Domain-Adapted AI Classification: An Application for Oncology DataJournal-ref: SIAM International Conference on Data Mining 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Given multi-type point maps from different place-types (e.g., tumor regions), our objective is to develop a classifier trained on the source place-type to accurately distinguish between two classes of the target place-type based on their point arrangements. This problem is societally important for many applications, such as generating clinical hypotheses for designing new immunotherapies for cancer treatment. The challenge lies in the spatial variability, the inherent heterogeneity and variation observed in spatial properties or arrangements across different locations (i.e., place-types). Previous techniques focus on self-supervised tasks to learn domain-invariant features and mitigate domain differences; however, they often neglect the underlying spatial arrangements among data points, leading to significant discrepancies across different place-types. We explore a novel multi-task self-learning framework that targets spatial arrangements, such as spatial mix-up masking and spatial contrastive predictive coding, for spatially-delineated domain-adapted AI classification. Experimental results on real-world datasets (e.g., oncology data) show that the proposed framework provides higher prediction accuracy than baseline methods.
- [527] arXiv:2501.12489 (replaced) [pdf, html, other]
-
Title: Large-image Object Detection for Fine-grained Recognition of Punches Patterns in Medieval Panel PaintingJosh Bruegger, Diana Ioana Catana, Vanja Macovaz, Matias Valdenegro-Toro, Matthia Sabatelli, Marco ZullichSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The attribution of the author of an art piece is typically a laborious manual process, usually relying on subjective evaluations of expert figures. However, there are some situations in which quantitative features of the artwork can support these evaluations. The extraction of these features can sometimes be automated, for instance, with the use of Machine Learning (ML) techniques. An example of these features is represented by repeated, mechanically impressed patterns, called punches, present chiefly in 13th and 14th-century panel paintings from Tuscany. Previous research in art history showcased a strong connection between the shapes of punches and specific artists or workshops, suggesting the possibility of using these quantitative cues to support the attribution. In the present work, we first collect a dataset of large-scale images of these panel paintings. Then, using YOLOv10, a recent and popular object detection model, we train a ML pipeline to perform object detection on the punches contained in the images. Due to the large size of the images, the detection procedure is split across multiple frames by adopting a sliding-window approach with overlaps, after which the predictions are combined for the whole image using a custom non-maximal suppression routine. Our results indicate how art historians working in the field can reliably use our method for the identification and extraction of punches.
- [528] arXiv:2501.14050 (replaced) [pdf, html, other]
-
Title: GraphRAG under FireComments: 13 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
GraphRAG advances retrieval-augmented generation (RAG) by structuring external knowledge as multi-scale knowledge graphs, enabling language models to integrate both broad context and granular details in their generation. While GraphRAG has demonstrated success across domains, its security implications remain largely unexplored. To bridge this gap, this work examines GraphRAG's vulnerability to poisoning attacks, uncovering an intriguing security paradox: compared to conventional RAG, GraphRAG's graph-based indexing and retrieval enhance resilience against simple poisoning attacks; yet, the same features also create new attack surfaces. We present GRAGPoison, a novel attack that exploits shared relations in the underlying knowledge graph to craft poisoning text capable of compromising multiple queries simultaneously. GRAGPoison employs three key strategies: i) relation injection to introduce false knowledge, ii) relation enhancement to amplify poisoning influence, and iii) narrative generation to embed malicious content within coherent text. Empirical evaluation across diverse datasets and models shows that GRAGPoison substantially outperforms existing attacks in terms of effectiveness (up to 98\% success rate) and scalability (using less than 68\% poisoning text) on various GraphRAG-based systems. We also explore potential defensive measures and their limitations, identifying promising directions for future research.
- [529] arXiv:2501.14936 (replaced) [pdf, other]
-
Title: Context-Aware Neural Gradient Mapping for Fine-Grained Instruction ProcessingComments: arXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorshipSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The integration of contextual embeddings into the optimization processes of large language models is an advancement in natural language processing. The Context-Aware Neural Gradient Mapping framework introduces a dynamic gradient adjustment mechanism, incorporating contextual embeddings directly into the optimization process. This approach facilitates real-time parameter adjustments, enhancing task-specific generalization even in the presence of sparse or noisy data inputs. The mathematical foundation of this framework relies on gradient descent modifications, where contextual embeddings are derived from a supplementary neural network trained to map input features to optimal adaptation gradients. By employing differential geometry principles, high-dimensional input dependencies are encoded into low-dimensional gradient manifolds, enabling efficient adaptation without necessitating the retraining of the entire model. Empirical evaluations demonstrate that the proposed framework consistently outperforms baseline models across various metrics, including accuracy, robustness to noise, and computational efficiency. The integration of context-specific embeddings allows for a more complex understanding of language, thereby improving the model's ability to handle diverse linguistic phenomena. Furthermore, the computational efficiency achieved through this method demonstrates its scalability for large-scale language models operating under diverse constraints.
- [530] arXiv:2501.15857 (replaced) [pdf, other]
-
Title: Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?Comments: Accepted by ICLR 2025Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources. For example, if someone learns ( B = f(A) ) from one source and ( C = g(B) ) from another, they can deduce ( C=g(B)=g(f(A)) ) even without encountering ( ABC ) together, showcasing the generalization ability of human intelligence. In this paper, we introduce a synthetic learning task, "FTCT" (Fragmented at Training, Chained at Testing), to validate the potential of Transformers in replicating this skill and interpret its inner mechanism. In the training phase, data consist of separated knowledge fragments from an overall causal graph. During testing, Transformers must infer complete causal graph traces by integrating these fragments. Our findings demonstrate that few-shot Chain-of-Thought prompting enables Transformers to perform compositional reasoning on FTCT by revealing correct combinations of fragments, even if such combinations were absent in the training data. Furthermore, the emergence of compositional reasoning ability is strongly correlated with the model complexity and training-testing data similarity. We propose, both theoretically and empirically, that Transformers learn an underlying generalizable program from training, enabling effective compositional reasoning during testing.
- [531] arXiv:2501.16205 (replaced) [pdf, html, other]
-
Title: EPOCH: Enabling Preemption Operation for Context Saving in Heterogeneous FPGA SystemsComments: 13 Pages, 7 Figures, 3 TablesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR)
FPGAs are increasingly used in multi-tenant cloud environments to offload compute-intensive tasks from the main CPU. The operating system (OS) plays a vital role in identifying tasks suitable for offloading and coordinating between the CPU and FPGA for seamless task execution. The OS leverages preemption to manage CPU efficiently and balance CPU time; however, preempting tasks running on FPGAs without context loss remains challenging. Despite growing reliance on FPGAs, vendors have yet to deliver a solution that fully preserves and restores task context.
This paper presents EPOCH, the first out-of-the-box framework to seamlessly preserve the state of tasks running on multi-tenant cloud FPGAs. EPOCH enables interrupting a tenant's execution at any arbitrary clock cycle, capturing its state, and saving this 'state snapshot' in off-chip memory with fine-grain granularity. Subsequently, when task resumption is required, EPOCH can resume execution from the saved 'state snapshot', eliminating the need to restart the task from scratch. EPOCH automates intricate processes, shields users from complexities, and synchronizes all underlying logic in a common clock domain, mitigating timing violations and ensuring seamless handling of interruptions.
EPOCH proficiently captures the state of fundamental FPGA elements, such as look-up tables, flip-flops, block--RAMs, and digital signal processing units. On real hardware, ZynQ-XC7Z020 SoC, the proposed solution achieves context save and restore operations per frame in 62.2us and 67.4us, respectively. - [532] arXiv:2501.16312 (replaced) [pdf, html, other]
-
Title: LinPrim: Linear Primitives for Differentiable Volumetric RenderingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Volumetric rendering has become central to modern novel view synthesis methods, which use differentiable rendering to optimize 3D scene representations directly from observed views. While many recent works build on NeRF or 3D Gaussians, we explore an alternative volumetric scene representation. More specifically, we introduce two new scene representations based on linear primitives - octahedra and tetrahedra - both of which define homogeneous volumes bounded by triangular faces. To optimize these primitives, we present a differentiable rasterizer that runs efficiently on GPUs, allowing end-to-end gradient-based optimization while maintaining real-time rendering capabilities. Through experiments on real-world datasets, we demonstrate comparable performance to state-of-the-art volumetric methods while requiring fewer primitives to achieve similar reconstruction fidelity. Our findings deepen the understanding of 3D representations by providing insights into the fidelity and performance characteristics of transparent polyhedra and suggest that adopting novel primitives can expand the available design space.
- [533] arXiv:2501.18374 (replaced) [pdf, html, other]
-
Title: Proofs for Folklore Theorems on the Radon-Nikodym DerivativeComments: Submitted to the IEEE Information Theory Workshop 2025, 6 pagesSubjects: Information Theory (cs.IT); History and Overview (math.HO); Statistics Theory (math.ST); Machine Learning (stat.ML)
In this paper, rigorous statements and formal proofs are presented for both foundational and advanced folklore theorems on the Radon-Nikodym derivative. The cases of conditional and marginal probability measures are carefully considered, which leads to an identity involving the sum of mutual and lautum information suggesting a new interpretation for such a sum.
- [534] arXiv:2502.00473 (replaced) [pdf, html, other]
-
Title: Weak-to-Strong Diffusion with ReflectionComments: 23 pages, 23 figures, 15 tablesSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
The goal of diffusion generative models is to align the learned distribution with the real data distribution through gradient score matching. However, inherent limitations in training data quality, modeling strategies, and architectural design lead to inevitable gap between generated outputs and real data. To reduce this gap, we propose Weak-to-Strong Diffusion (W2SD), a novel framework that utilizes the estimated difference between existing weak and strong models (i.e., weak-to-strong difference) to bridge the gap between an ideal model and a strong model. By employing a reflective operation that alternates between denoising and inversion with weak-to-strong difference, we theoretically understand that W2SD steers latent variables along sampling trajectories toward regions of the real data distribution. W2SD is highly flexible and broadly applicable, enabling diverse improvements through the strategic selection of weak-to-strong model pairs (e.g., DreamShaper vs. SD1.5, good experts vs. bad experts in MoE). Extensive experiments demonstrate that W2SD significantly improves human preference, aesthetic quality, and prompt adherence, achieving SOTA performance across various modalities (e.g., image, video), architectures (e.g., UNet-based, DiT-based, MoE), and benchmarks. For example, Juggernaut-XL with W2SD can improve with the HPSv2 winning rate up to 90% over the original results. Moreover, the performance gains achieved by W2SD markedly outweigh its additional computational overhead, while the cumulative improvements from different weak-to-strong difference further solidify its practical utility and deployability.
- [535] arXiv:2502.01015 (replaced) [pdf, html, other]
-
Title: Efficient Model Editing with Task Vector Bases: A Theoretical Framework and Scalable ApproachComments: 27 pages, 11 figuresSubjects: Machine Learning (cs.LG)
Task vectors, which are derived from the difference between pre-trained and fine-tuned model weights, enable flexible task adaptation and model merging through arithmetic operations such as addition and negation. However, existing approaches often rely on heuristics with limited theoretical support, often leading to performance gaps comparing to direct task fine tuning. Meanwhile, although it is easy to manipulate saved task vectors with arithmetic for different purposes, such compositional flexibility demands high memory usage, especially when dealing with a huge number of tasks, limiting scalability. This work addresses these issues with a theoretically grounded framework that explains task vector arithmetic and introduces the task vector bases framework. Building upon existing task arithmetic literature, our method significantly reduces the memory cost for downstream arithmetic with little effort, while achieving competitive performance and maintaining compositional advantage, providing a practical solution for large-scale task arithmetic. The code is available at this https URL.
- [536] arXiv:2502.01673 (replaced) [pdf, html, other]
-
Title: Multilingual State Space Models for Structured Question Answering in Indic LanguagesComments: Accepted at NAACLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs),to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. We propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.
- [537] arXiv:2502.02309 (replaced) [pdf, html, other]
-
Title: Review of Demographic Fairness in Face RecognitionComments: under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Demographic fairness in face recognition (FR) has emerged as a critical area of research, given its impact on fairness, equity, and reliability across diverse applications. As FR technologies are increasingly deployed globally, disparities in performance across demographic groups-- such as race, ethnicity, and gender-- have garnered significant attention. These biases not only compromise the credibility of FR systems but also raise ethical concerns, especially when these technologies are employed in sensitive domains. This review consolidates extensive research efforts providing a comprehensive overview of the multifaceted aspects of demographic fairness in FR.
We systematically examine the primary causes, datasets, assessment metrics, and mitigation approaches associated with demographic disparities in FR. By categorizing key contributions in these areas, this work provides a structured approach to understanding and addressing the complexity of this issue. Finally, we highlight current advancements and identify emerging challenges that need further investigation. This article aims to provide researchers with a unified perspective on the state-of-the-art while emphasizing the critical need for equitable and trustworthy FR systems. - [538] arXiv:2502.05346 (replaced) [pdf, other]
-
Title: Probabilistic Subspace Manifolds for Contextual Inference in Large Language ModelsChristopher Nightingale, Dominic Lavington, Jonathan Thistlethwaite, Sebastian Penhaligon, Thomas Belinski, David BoldoComments: arXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorshipSubjects: Computation and Language (cs.CL)
Representing token embeddings as probability distributions over learned manifolds allows for more flexible contextual inference, reducing representational rigidity while enhancing semantic granularity. Comparative evaluations demonstrate that probabilistic embeddings improve neighborhood consistency and decrease redundancy, ensuring that token relationships remain more structurally coherent across fine-tuning iterations. The integration of probabilistic subspaces within attention mechanisms facilitates more adaptive contextual weighting, enabling models to capture latent dependencies that would otherwise be obscured in conventional embeddings. Experimental results highlight increased robustness against adversarial modifications, with probabilistic embeddings preserving contextual integrity even under perturbation-based evaluation scenarios. Performance assessments indicate that probabilistic representations achieve greater adaptability in domain-specific applications, mitigating the need for extensive retraining when shifting across linguistic domains. Computational trade-offs remain within operationally feasible limits, with marginal increases in inference latency balanced against the benefits of enhanced representation stability and contextual expressiveness. The capacity to encode structured uncertainty provides advantages in generative modeling tasks, particularly where maintaining coherence across extended sequences requires a representation framework capable of handling ambiguous or context-dependent linguistic constructs.
- [539] arXiv:2502.05384 (replaced) [pdf, html, other]
-
Title: Demonstrating CavePI: Autonomous Exploration of Underwater Caves by Semantic GuidanceComments: V4, 17 pagesSubjects: Robotics (cs.RO)
Enabling autonomous robots to safely and efficiently navigate, explore, and map underwater caves is of significant importance to water resource management, hydrogeology, archaeology, and marine robotics. In this work, we demonstrate the system design and algorithmic integration of a visual servoing framework for semantically guided autonomous underwater cave exploration. We present the hardware and edge-AI design considerations to deploy this framework on a novel AUV (Autonomous Underwater Vehicle) named CavePI. The guided navigation is driven by a computationally light yet robust deep visual perception module, delivering a rich semantic understanding of the environment. Subsequently, a robust control mechanism enables CavePI to track the semantic guides and navigate within complex cave structures. We evaluate the system through field experiments in natural underwater caves and spring-water sites and further validate its ROS (Robot Operating System)-based digital twin in a simulation environment. Our results highlight how these integrated design choices facilitate reliable navigation under feature-deprived, GPS-denied, and low-visibility conditions.
- [540] arXiv:2502.06425 (replaced) [pdf, html, other]
-
Title: Generating Privacy-Preserving Personalized Advice with Zero-Knowledge Proofs and LLMsComments: Accepted to The ACM Web Conference (WWW) 2025 Short Paper TrackSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Large language models (LLMs) are increasingly utilized in domains such as finance, healthcare, and interpersonal relationships to provide advice tailored to user traits and contexts. However, this personalization often relies on sensitive data, raising critical privacy concerns and necessitating data minimization. To address these challenges, we propose a framework that integrates zero-knowledge proof (ZKP) technology, specifically zkVM, with LLM-based chatbots. This integration enables privacy-preserving data sharing by verifying user traits without disclosing sensitive information. Our research introduces both an architecture and a prompting strategy for this approach. Through empirical evaluation, we clarify the current constraints and performance limitations of both zkVM and the proposed prompting strategy, thereby demonstrating their practical feasibility in real-world scenarios.
- [541] arXiv:2502.07983 (replaced) [pdf, html, other]
-
Title: Welzijn.AI: Developing Responsible Conversational AI for Elderly Care through Stakeholder InvolvementSubjects: Computers and Society (cs.CY)
We present this http URL as new digital solution for monitoring (mental) well-being in elderly populations, and illustrate how development of systems like this http URL can align with guidelines on responsible AI development. Three evaluations with different stakeholders were designed to disclose new perspectives on the strengths, weaknesses, design characteristics, and value requirements of this http URL. Evaluations concerned expert panels and involved patient federations, general practitioners, researchers, and the elderly themselves. Panels concerned interviews, a co-creation session, and feedback on a proof-of-concept implementation. Interview results were summarized in terms of this http URL's strengths, weaknesses, opportunities and threats. The co-creation session ranked a variety of value requirements of this http URL with the Hundred Dollar Method. User evaluation comprised analysing proportions of (dis)agreement on statements targeting this http URL's design characteristics, and ranking desired social characteristics. Experts in the panel interviews acknowledged this http URL's potential to combat loneliness and extract patterns from elderly behaviour. The proof-of-concept evaluation complemented the design characteristics most appealing to the elderly to potentially achieve this: empathetic and varying interactions. Stakeholders also link the technology to the implementation context: it could help activate an individual's social network, but support should also be available to empower users. Yet, non-elderly and elderly experts also disclose challenges in properly understanding the application; non-elderly experts also highlight issues concerning privacy. In sum, incorporating all stakeholder perspectives in system development remains challenging. Still, our results benefit researchers, policy makers, and health professionals that aim to improve elderly care with technology.
- [542] arXiv:2502.08659 (replaced) [pdf, html, other]
-
Title: Deployment-friendly Lane-changing Intention Prediction Powered by Brain-inspired Spiking Neural NetworksSubjects: Robotics (cs.RO)
Accurate and real-time prediction of surrounding vehicles' lane-changing intentions is a critical challenge in deploying safe and efficient autonomous driving systems in open-world scenarios. Existing high-performing methods remain hard to deploy due to their high computational cost, long training times, and excessive memory requirements. Here, we propose an efficient lane-changing intention prediction approach based on brain-inspired Spiking Neural Networks (SNN). By leveraging the event-driven nature of SNN, the proposed approach enables us to encode the vehicle's states in a more efficient manner. Comparison experiments conducted on HighD and NGSIM datasets demonstrate that our method significantly improves training efficiency and reduces deployment costs while maintaining comparable prediction accuracy. Particularly, compared to the baseline, our approach reduces training time by 75% and memory usage by 99.9%. These results validate the efficiency and reliability of our method in lane-changing predictions, highlighting its potential for safe and efficient autonomous driving systems while offering significant advantages in deployment, including reduced training time, lower memory usage, and faster inference.
- [543] arXiv:2502.09395 (replaced) [pdf, html, other]
-
Title: Robot Pouring: Identifying Causes of Spillage and Selecting Alternative Action Parameters Using Probabilistic Actual CausationComments: 20 pages, 13 figuresSubjects: Robotics (cs.RO); Machine Learning (cs.LG)
In everyday life, we perform tasks (e.g., cooking or cleaning) that involve a large variety of objects and goals. When confronted with an unexpected or unwanted outcome, we take corrective actions and try again until achieving the desired result. The reasoning performed to identify a cause of the observed outcome and to select an appropriate corrective action is a crucial aspect of human reasoning for successful task execution. Central to this reasoning is the assumption that a factor is responsible for producing the observed outcome. In this paper, we investigate the use of probabilistic actual causation to determine whether a factor is the cause of an observed undesired outcome. Furthermore, we show how the actual causation probabilities can be used to find alternative actions to change the outcome. We apply the probabilistic actual causation analysis to a robot pouring task. When spillage occurs, the analysis indicates whether a task parameter is the cause and how it should be changed to avoid spillage. The analysis requires a causal graph of the task and the corresponding conditional probability distributions. To fulfill these requirements, we perform a complete causal modeling procedure (i.e., task analysis, definition of variables, determination of the causal graph structure, and estimation of conditional probability distributions) using data from a realistic simulation of the robot pouring task, covering a large combinatorial space of task parameters. Based on the results, we discuss the implications of the variables' representation and how the alternative actions suggested by the actual causation analysis would compare to the alternative solutions proposed by a human observer. The practical use of the analysis of probabilistic actual causation to select alternative action parameters is demonstrated.
- [544] arXiv:2502.09884 (replaced) [pdf, html, other]
-
Title: Nonasymptotic CLT and Error Bounds for Two-Time-Scale Stochastic ApproximationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
We consider linear two-time-scale stochastic approximation algorithms driven by martingale noise. Recent applications in machine learning motivate the need to understand finite-time error rates, but conventional stochastic approximation analysis focus on either asymptotic convergence in distribution or finite-time bounds that are far from optimal. Prior work on asymptotic central limit theorems (CLTs) suggest that two-time-scale algorithms may be able to achieve $1/\sqrt{n}$ error in expectation, with a constant given by the expected norm of the limiting Gaussian vector. However, the best known finite-time rates are much slower. We derive the first non-asymptotic central limit theorem with respect to the Wasserstein-1 distance for two-time-scale stochastic approximation with Polyak-Ruppert averaging. As a corollary, we show that expected error achieved by Polyak-Ruppert averaging decays at rate $1/\sqrt{n}$, which significantly improves on the rates of convergence in prior works.
- [545] arXiv:2502.10527 (replaced) [pdf, other]
-
Title: Algorithms and Hardness for Estimating Statistical SimilarityArnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, N. V. VinodchandranComments: There is an error in the proof of Lemma 23, which invalidates Theorems 11 and 8. The rest of our results hold trueSubjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC)
We study the problem of computing statistical similarity between probability distributions. For distributions $P$ and $Q$ over a finite sample space, their statistical similarity is defined as $S_{\mathrm{stat}}(P, Q) := \sum_{x} \min(P(x), Q(x))$. Statistical similarity is a basic measure of similarity between distributions, with several natural interpretations, and captures the Bayes error in prediction and hypothesis testing problems. Recent work has established that, somewhat surprisingly, even for the simple class of product distributions, exactly computing statistical similarity is $\#\mathsf{P}$-hard. This motivates the question of designing approximation algorithms for statistical similarity. Our primary contribution is a Fully Polynomial-Time deterministic Approximation Scheme (FPTAS) for estimating statistical similarity between two product distributions. To obtain this result, we introduce a new variant of the Knapsack problem, which we call the Masked Knapsack problem, and design an FPTAS to estimate the number of solutions of a multidimensional version of this problem. This new technical contribution could be of independent interest. Furthermore, we also establish a complementary hardness result. We show that it is $\mathsf{NP}$-hard to estimate statistical similarity when $P$ and $Q$ are Bayes net distributions of in-degree $2$.
- [546] arXiv:2502.11569 (replaced) [pdf, other]
-
Title: Towards Reasoning Ability of Small Language ModelsComments: # fixed some typos, added public slm reasoning leaderboardSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Reasoning has long been viewed as an emergent property of large language models (LLMs), appearing at or above a certain scale ($\sim$100B parameters). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. SLMs are increasingly favored for their efficiency and deployability. However, there is a lack of systematic study on the reasoning abilities of diverse SLMs, including those trained from scratch or derived from LLMs through quantization, pruning, and distillation. This raises a critical question: Can SLMs achieve reasoning abilities comparable to LLMs? In this work, we systematically survey, benchmark, and analyze 72 SLMs from six model families across 14 reasoning benchmarks. For reliable evaluation, we examine four evaluation methods and compare four LLM judges against human evaluations on 800 data points. We repeat all experiments three times to ensure a robust performance assessment. Additionally, we analyze the impact of different prompting strategies in small models. Beyond accuracy, we also evaluate model robustness under adversarial conditions and intermediate reasoning steps. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression. They can serve as efficient alternatives to LLMs for reasoning-intensive tasks.
- [547] arXiv:2502.11658 (replaced) [pdf, html, other]
-
Title: "I'm not for sale" -- Perceptions and limited awareness of privacy risks by digital natives about location dataComments: Accepted for publication at ICWSM2025Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Although mobile devices benefit users in their daily lives in numerous ways, they also raise several privacy concerns. For instance, they can reveal sensitive information that can be inferred from location data. This location data is shared through service providers as well as mobile applications. Understanding how and with whom users share their location data -- as well as users' perception of the underlying privacy risks --, are important notions to grasp in order to design usable privacy-enhancing technologies. In this work, we perform a quantitative and qualitative analysis of smartphone users' awareness, perception and self-reported behavior towards location data-sharing through a survey of n=99 young adult participants (i.e., digital natives). We compare stated practices with actual behaviors to better understand their mental models, and survey participants' understanding of privacy risks before and after the inspection of location traces and the information that can be inferred therefrom.
Our empirical results show that participants have risky privacy practices: about 54% of participants underestimate the number of mobile applications to which they have granted access to their data, and 33% forget or do not think of revoking access to their data. Also, by using a demonstrator to perform inferences from location data, we observe that slightly more than half of participants (57%) are surprised by the extent of potentially inferred information, and that 47% intend to reduce access to their data via permissions as a result of using the demonstrator. Last, a majority of participants have little knowledge of the tools to better protect themselves, but are nonetheless willing to follow suggestions to improve privacy (51%). Educating people, including digital natives, about privacy risks through transparency tools seems a promising approach. - [548] arXiv:2502.12117 (replaced) [pdf, other]
-
Title: The Role of Prescreening in Auctions with PredictionsSubjects: Computer Science and Game Theory (cs.GT)
Auctioneers often use closed auctions to create scarcity and prestige, aiming to intensify competition among a select group of high-status bidders. Advances in machine learning and AI make this strategy increasingly viable, enabling cost-effective identification of capable participants. In this paper, we develop a theoretical model to assess whether such practice can be justified from an economic perspective. We consider a setting in which bidders have i.i.d. private valuations, and the auction designer observes a noisy predictor of each bidder's valuation, which is assumed to be fully informative with some probability. Based on this noisy predictor, the designer determines how many bidders to admit -- a process we refer to as prescreening. We show that an auction with prescreening is equivalent to a standard auction (i.e., without prescreening) in which bidder valuations are correlated. Notably, the standard notion of affiliation commonly assumed in the auction literature does not generally hold in this equivalent formulation. We characterize conditions for the existence of symmetric and strictly monotone equilibrium strategies across three classical auction formats: all-pay, first-price, and second-price auctions. Our results demonstrate that prescreening with noisy predictors can significantly enhance revenue in all-pay auctions; in fact, with a perfect predictor, admitting only two bidders is optimal. By contrast, in both first-price and second-price auctions, admitting all bidders remains revenue-maximizing.
- [549] arXiv:2502.13881 (replaced) [pdf, html, other]
-
Title: PSCon: Product Search Through ConversationsJie Zou, Mohammad Aliannejadi, Evangelos Kanoulas, Shuxi Han, Heli Ma, Zheng Wang, Yang Yang, Heng Tao ShenComments: 11 pages. Accepted by SIGIR 2025Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Conversational Product Search ( CPS ) systems interact with users via natural language to offer personalized and context-aware product lists. However, most existing research on CPS is limited to simulated conversations, due to the lack of a real CPS dataset driven by human-like language. Moreover, existing conversational datasets for e-commerce are constructed for a particular market or a particular language and thus can not support cross-market and multi-lingual usage. In this paper, we propose a CPS data collection protocol and create a new CPS dataset, called PSCon, which assists product search through conversations with human-like language. The dataset is collected by a coached human-human data collection protocol and is available for dual markets and two languages. By formulating the task of CPS, the dataset allows for comprehensive and in-depth research on six subtasks: user intent detection, keyword extraction, system action prediction, question selection, item ranking, and response generation. Moreover, we present a concise analysis of the dataset and propose a benchmark model on the proposed CPS dataset. Our proposed dataset and model will be helpful for facilitating future research on CPS.
- [550] arXiv:2502.15192 (replaced) [pdf, html, other]
-
Title: SPAARC: Spatial Proximity and Association based prefetching for Augmented Reality in edge CacheSubjects: Emerging Technologies (cs.ET); Distributed, Parallel, and Cluster Computing (cs.DC)
Mobile Augmented Reality (MAR) applications face performance challenges due to their high computational demands and need for low-latency responses. Traditional approaches like on-device storage or reactive data fetching from the cloud often result in limited AR experiences or unacceptable lag. Edge caching, which caches AR objects closer to the user, provides a promising solution. However, existing edge caching approaches do not consider AR-specific features such as AR object sizes, user interactions, and physical location. This paper investigates how to further optimize edge caching by employing AR-aware prefetching techniques. We present SPAARC, a Spatial Proximity and Association-based Prefetching policy specifically designed for MAR Caches. SPAARC intelligently prioritizes the caching of virtual objects based on their association with other similar objects and the user's proximity to them. It also considers the recency of associations and uses a lazy fetching strategy to efficiently manage edge resources and maximize Quality of Experience (QoE).
Through extensive evaluation using both synthetic and real-world workloads, we demonstrate that SPAARC significantly improves cache hit rates compared to standard caching algorithms, achieving gains ranging from 3% to 40% while reducing the need for on-demand data retrieval from the cloud. Further, we present an adaptive tuning algorithm that automatically tunes SPAARC parameters to achieve optimal performance. Our findings demonstrate the potential of SPAARC to substantially enhance the user experience in MAR applications by ensuring the timely availability of virtual objects. - [551] arXiv:2502.17060 (replaced) [pdf, html, other]
-
Title: Data Analysis Prediction over Multiple Unseen Datasets: A Vector Embedding ApproachSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The massive increase in the data volume and dataset availability for analysts compels researchers to focus on data content and select high-quality datasets to enhance the performance of analytics operators. While selecting the highest quality data for analysis highly increases task accuracy and efficiency, it is still a hard task, especially when the number of available inputs is very large. To address this issue, we propose a novel methodology that infers the outcome of analytics operators by creating a model from datasets similar to the queried one. Dataset similarity is performed via projecting each dataset to a vector embedding representation. The vectorization process is performed using our proposed deep learning model NumTabData2Vec, which takes a whole dataset and projects it into a lower vector embedding representation space. Through experimental evaluation, we compare the prediction performance and the execution time of our framework to another state-of-the-art modelling operator framework, illustrating that our approach predicts analytics outcomes accurately. Furthermore, our vectorization model can project different real-world scenarios to a lower vector embedding representation and distinguish between them.
- [552] arXiv:2502.17086 (replaced) [pdf, html, other]
-
Title: Automatically Evaluating the Paper Reviewing Capability of Large Language ModelsHyungyu Shin, Jingyu Tang, Yoonjoo Lee, Nayoung Kim, Hyunseung Lim, Ji Yong Cho, Hwajung Hong, Moontae Lee, Juho KimSubjects: Computation and Language (cs.CL)
Peer review is essential for scientific progress, but it faces challenges such as reviewer shortages and growing workloads. Although Large Language Models (LLMs) show potential for providing assistance, research has reported significant limitations in the reviews they generate. While the insights are valuable, conducting the analysis is challenging due to the considerable time and effort required, especially given the rapid pace of LLM developments. To address the challenge, we developed an automatic evaluation pipeline to assess the LLMs' paper review capability by comparing them with expert-generated reviews. By constructing a dataset consisting of 676 OpenReview papers, we examined the agreement between LLMs and experts in their strength and weakness identifications. The results showed that LLMs lack balanced perspectives, significantly overlook novelty assessment when criticizing, and produce poor acceptance decisions. Our automated pipeline enables a scalable evaluation of LLMs' paper review capability over time.
- [553] arXiv:2502.17196 (replaced) [pdf, html, other]
-
Title: Disentangling Visual Transformers: Patch-level Interpretability for Image ClassificationComments: CVPR 2025 official version. Main manuscript + supplementarySubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Visual transformers have achieved remarkable performance in image classification tasks, but this performance gain has come at the cost of interpretability. One of the main obstacles to the interpretation of transformers is the self-attention mechanism, which mixes visual information across the whole image in a complex way. In this paper, we propose Hindered Transformer (HiT), a novel interpretable by design architecture inspired by visual transformers. Our proposed architecture rethinks the design of transformers to better disentangle patch influences at the classification stage. Ultimately, HiT can be interpreted as a linear combination of patch-level information. We show that the advantages of our approach in terms of explicability come with a reasonable trade-off in performance, making it an attractive alternative for applications where interpretability is paramount.
- [554] arXiv:2503.00234 (replaced) [pdf, html, other]
-
Title: Investigating the Relationship Between Debiasing and Artifact Removal using Saliency MapsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
The widespread adoption of machine learning systems has raised critical concerns about fairness and bias, making mitigating harmful biases essential for AI development. In this paper, we investigate the relationship between debiasing and removing artifacts in neural networks for computer vision tasks. First, we introduce a set of novel XAI-based metrics that analyze saliency maps to assess shifts in a model's decision-making process. Then, we demonstrate that successful debiasing methods systematically redirect model focus away from protected attributes. Finally, we show that techniques originally developed for artifact removal can be effectively repurposed for improving fairness. These findings provide evidence for the existence of a bidirectional connection between ensuring fairness and removing artifacts corresponding to protected attributes.
- [555] arXiv:2503.00698 (replaced) [pdf, html, other]
-
Title: Deep Univariate Polynomial and Conformal ApproximationSubjects: Numerical Analysis (math.NA)
A deep approximation is an approximating function defined by composing more than one layer of simple functions. We study deep approximations of functions of one variable using layers consisting of low-degree polynomials or simple conformal transformations. We show that deep approximations to $|x|$ on $[-1,1]$ achieve exponential convergence with respect to the degrees of freedom. Computational experiments suggest that a composite of two and three polynomial layers can give more accurate approximations than a single polynomial with the same number of coefficients. We also study the related problem of reducing the Runge phenomenon by composing polynomials with conformal transformations.
- [556] arXiv:2503.03877 (replaced) [pdf, html, other]
-
Title: CRAFT: Characterizing and Root-Causing Fault Injection Threats at Pre-SiliconComments: 6 Pages, 8 Figures, 2 TablesSubjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)
Fault injection attacks represent a class of threats that can compromise embedded systems across multiple layers of abstraction, such as system software, instruction set architecture (ISA), microarchitecture, and physical implementation. Early detection of these vulnerabilities and understanding their root causes, along with their propagation from the physical layer to the system software, is critical to secure the cyberinfrastructure. This work presents a comprehensive methodology for conducting controlled fault injection attacks at the pre-silicon level and an analysis of the underlying system for root-causing behavior. As the driving application, we use the clock glitch attacks in AI/ML applications for critical misclassification. Our study aims to characterize and diagnose the impact of faults within the RISC-V instruction set and pipeline stages, while tracing fault propagation from the circuit level to the AI/ML application software. This analysis resulted in discovering two new vulnerabilities through controlled clock glitch parameters. First, we reveal a novel method for causing instruction skips, thereby preventing the loading of critical values from memory. This can cause disruption and affect program continuity and correctness. Second, we demonstrate an attack that converts legal instructions into illegal ones, thereby diverting control flow in a manner exploitable by attackers. Our work underscores the complexity of fault injection attack exploits and emphasizes the importance of preemptive security analysis.
- [557] arXiv:2503.07269 (replaced) [pdf, html, other]
-
Title: SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion DetectionShamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Seid Muhie Yimam, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine De Kock, Tadesse Destaw Belay, Ibrahim Said Ahmad, Nirmal Surange, Daniela Teodorescu, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino Ali, Vladimir Araujo, Abinew Ali Ayele, Oana Ignat, Alexander Panchenko, Yi Zhou, Saif M. MohammadComments: SemEval2025 Task11 (Task Description Paper). arXiv admin note: text overlap with arXiv:2502.11926Subjects: Computation and Language (cs.CL)
We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and are spoken across various continents. The data instances are multi-labeled with six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels in three tracks: (a) multilabel emotion detection, (b) emotion intensity score detection, and (c) cross-lingual emotion detection.
The task attracted over 700 participants. We received final submissions from more than 200 teams and 93 system description papers. We report baseline results, along with findings on the best-performing systems, the most common approaches, and the most effective methods across different tracks and languages. The datasets for this task are publicly available. The dataset is available at SemEval2025 Task 11 this https URL - [558] arXiv:2503.08585 (replaced) [pdf, html, other]
-
Title: HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video UnderstandingComments: Accepted in CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed Hierachical Querying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis.
- [559] arXiv:2503.09829 (replaced) [pdf, html, other]
-
Title: SE(3)-Equivariant Robot Learning and Control: A Tutorial SurveyJoohwan Seo, Soochul Yoo, Junwoo Chang, Hyunseok An, Hyunwoo Ryu, Soomi Lee, Arvind Kruthiventy, Jongeun Choi, Roberto HorowitzComments: Accepted to International Journcal of Control, Automation and Systems (IJCAS)Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Recent advances in deep learning and Transformers have driven major breakthroughs in robotics by employing techniques such as imitation learning, reinforcement learning, and LLM-based multimodal perception and decision-making. However, conventional deep learning and Transformer models often struggle to process data with inherent symmetries and invariances, typically relying on large datasets or extensive data augmentation. Equivariant neural networks overcome these limitations by explicitly integrating symmetry and invariance into their architectures, leading to improved efficiency and generalization. This tutorial survey reviews a wide range of equivariant deep learning and control methods for robotics, from classic to state-of-the-art, with a focus on SE(3)-equivariant models that leverage the natural 3D rotational and translational symmetries in visual robotic manipulation and control design. Using unified mathematical notation, we begin by reviewing key concepts from group theory, along with matrix Lie groups and Lie algebras. We then introduce foundational group-equivariant neural network design and show how the group-equivariance can be obtained through their structure. Next, we discuss the applications of SE(3)-equivariant neural networks in robotics in terms of imitation learning and reinforcement learning. The SE(3)-equivariant control design is also reviewed from the perspective of geometric control. Finally, we highlight the challenges and future directions of equivariant methods in developing more robust, sample-efficient, and multi-modal real-world robotic systems.
- [560] arXiv:2503.10742 (replaced) [pdf, html, other]
-
Title: Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video ProcessingYudong Liu, Jingwei Sun, Yueqian Lin, Jingyang Zhang, Ming Yin, Qinsi Wang, Jianyi Zhang, Hai Li, Yiran ChenSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video scenarios. Existing approaches predominantly focus on either vision token pruning, which may overlook spatio-temporal dependencies, or keyframe selection, which identifies informative frames but discards others, thus disrupting contextual continuity. In this work, we propose KVTP (Keyframe-oriented Vision Token Pruning), a novel framework that overcomes the drawbacks of token pruning and keyframe selection. By adaptively assigning pruning rates based on frame relevance to the query, KVTP effectively retains essential contextual information while significantly reducing redundant computation. To thoroughly evaluate the long-form video understanding capacities of VLMs, we curated and reorganized subsets from VideoMME, EgoSchema, and NextQA into a unified benchmark named SparseKV-QA that highlights real-world scenarios with sparse but crucial events. Our experiments with VLMs of various scales show that KVTP can reduce token usage by 80% without compromising spatiotemporal and contextual consistency, significantly cutting computation while maintaining the performance. These results demonstrate our approach's effectiveness in efficient long-video processing, facilitating more scalable VLM deployment.
- [561] arXiv:2503.10894 (replaced) [pdf, html, other]
-
Title: HyperDAS: Towards Automating Mechanistic Interpretability with HypernetworksJiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar, Atticus GeigerComments: ICLR 2025Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Mechanistic interpretability has made great strides in identifying neural network features (e.g., directions in hidden activation space) that mediate concepts(e.g., the birth year of a person) and enable predictable manipulation. Distributed alignment search (DAS) leverages supervision from counterfactual data to learn concept features within hidden states, but DAS assumes we can afford to conduct a brute force search over potential feature locations. To address this, we present HyperDAS, a transformer-based hypernetwork architecture that (1) automatically locates the token-positions of the residual stream that a concept is realized in and (2) constructs features of those residual stream vectors for the concept. In experiments with Llama3-8B, HyperDAS achieves state-of-the-art performance on the RAVEL benchmark for disentangling concepts in hidden states. In addition, we review the design decisions we made to mitigate the concern that HyperDAS (like all powerful interpretabilty methods) might inject new information into the target model rather than faithfully interpreting it.
- [562] arXiv:2503.12117 (replaced) [pdf, html, other]
-
Title: The Resonance Bias Framework: Resonance, Structure, and Arithmetic in Quadrature ErrorComments: 21 pages, 6 figuresSubjects: Numerical Analysis (math.NA); Number Theory (math.NT); Spectral Theory (math.SP)
We study the trapezoidal rule for periodic functions on uniform grids and show that the quadrature error exhibits a rich deterministic structure, beyond traditional asymptotic or statistical interpretations. Focusing on the prototype function f(x) = sin^2(2 pi k x), we derive an analytical expression for the error governed by a resonance function chi_P(y), closely related to the Dirichlet kernel, roots of unity, and discrete Fourier analysis on the group Z/PZ. This function acts as a spectral filter, connecting the integration error to arithmetic properties such as k/P and geometric phase cancellation, visualized as vector averaging on the unit circle. We introduce the Resonance Bias Framework (RBF), a generalization to arbitrary smooth periodic functions, leading to the error representation B_P[f] = sum_{k != 0} c_k chi_P(k/P). Although this is mathematically equivalent to the classical aliasing sum, it reveals a deeper mechanism: the quadrature error arises from structured resonance rather than random aliasing noise. The RBF thus provides an interpretable framework for understanding integration errors at finite resolution, grounded in number theory and geometry.
- [563] arXiv:2503.15840 (replaced) [pdf, html, other]
-
Title: Automatic Generation of Safety-compliant Linear Temporal Logic via Large Language Model: A Self-supervised FrameworkSubjects: Logic in Computer Science (cs.LO); Formal Languages and Automata Theory (cs.FL)
Converting high-level tasks described by natural language into formal specifications like Linear Temporal Logic (LTL) is a key step towards providing formal safety guarantees over cyber-physical systems (CPS). While the compliance of the formal specifications themselves against the safety restrictions imposed on CPS is crucial for ensuring safety, most existing works only focus on translation consistency between natural languages and formal specifications. In this paper, we introduce AutoSafeLTL, a self-supervised framework that utilizes large language models (LLMs) to automate the generation of LTL specifications complying with a set of safety restrictions while preserving their logical consistency and semantic accuracy. As a key insight, our framework integrates Language Inclusion check with an automated counterexample-guided modification mechanism to ensure the safety-compliance of the resulting LTL specifications. In particular, we develop 1) an LLM-as-an-Aligner, which performs atomic proposition matching between generated LTL specifications and safety restrictions to enforce semantic alignment; and 2) an LLM-as-a-Critic, which automates LTL specification refinement by interpreting counterexamples derived from Language Inclusion checks. Experimental results demonstrate that our architecture effectively guarantees safety-compliance for the generated LTL specifications, achieving a 0% violation rate against imposed safety restrictions. This shows the potential of our work in synergizing AI and formal verification techniques, enhancing safety-aware specification generation and automatic verification for both AI and critical CPS applications.
- [564] arXiv:2503.19268 (replaced) [pdf, html, other]
-
Title: Privately Evaluating Untrusted Black-Box FunctionsSubjects: Data Structures and Algorithms (cs.DS)
We provide tools for sharing sensitive data when the data curator does not know in advance what questions an (untrusted) analyst might ask about the data. The analyst can specify a program that they want the curator to run on the dataset. We model the program as a black-box function $f$. We study differentially private algorithms, called privacy wrappers, that, given black-box access to a real-valued function $f$ and a sensitive dataset $x$, output an accurate approximation to $f(x)$. The dataset $x$ is modeled as a finite subset of a possibly infinite set $U$, in which each entry represents data of one individual. A privacy wrapper calls $f$ on the dataset $x$ and on some subsets of $x$ and returns either an approximation to $f(x)$ or a nonresponse symbol $\perp$. The wrapper may also use additional information (that is, parameters) provided by the analyst, but differential privacy is required for all values of these parameters. Correct setting of these parameters will ensure better accuracy of the wrapper. The bottleneck in the running time of our wrappers is the number of calls to $f$, which we refer to as queries. Our goal is to design wrappers with high accuracy and low query complexity. We introduce a novel setting, the automated sensitivity detection setting, where the analyst supplies the black-box function $f$ and the intended (finite) range of $f$. In the previously considered setting, the claimed sensitivity bound setting, the analyst supplies additional parameters that describe the sensitivity of $f$. We design privacy wrappers for both settings and show that our wrappers are nearly optimal in terms of accuracy, locality (i.e., the depth of the local neighborhood of the dataset $x$ they explore), and query complexity. In the claimed sensitivity bound setting, we provide the first accuracy guarantees that have no dependence on the size of the universe $U$.
- [565] arXiv:2503.20322 (replaced) [pdf, html, other]
-
Title: Dynamic Pyramid Network for Efficient Multimodal Large Language ModelHao Ai, Kunyi Wang, Zezhou Wang, Hao Lu, Jin Tian, Yaxin Luo, Peng Xing, Jen-Yuan Huang, Huaxia Li, Gen luoSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. The source code will be released at this https URL.
- [566] arXiv:2503.21073 (replaced) [pdf, html, other]
-
Title: Shared Global and Local Geometry of Language Model EmbeddingsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Researchers have recently suggested that models share common representations. In our work, we find that token embeddings of language models exhibit common geometric structure. First, we find ``global'' similarities: token embeddings often share similar relative orientations. Next, we characterize local geometry in two ways: (1) by using Locally Linear Embeddings, and (2) by defining a simple measure for the intrinsic dimension of each token embedding. Our intrinsic dimension demonstrates that token embeddings lie on a lower dimensional manifold. We qualitatively show that tokens with lower intrinsic dimensions often have semantically coherent clusters, while those with higher intrinsic dimensions do not. Both characterizations allow us to find similarities in the local geometry of token embeddings. Perhaps most surprisingly, we find that alignment in token embeddings persists through the hidden states of language models, allowing us to develop an application for interpretability. Namely, we introduce Emb2Emb, a simple method to transfer steering vectors from one language model to another, despite the two models having different dimensions.
- [567] arXiv:2503.21495 (replaced) [pdf, html, other]
-
Title: Adaptive Resampling with Bootstrap for Noisy Multi-Objective Optimization ProblemsComments: 14 pages. 5 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The challenge of noisy multi-objective optimization lies in the constant trade-off between exploring new decision points and improving the precision of known points through resampling. This decision should take into account both the variability of the objective functions and the current estimate of a point in relation to the Pareto front. Since the amount and distribution of noise are generally unknown, it is desirable for a decision function to be highly adaptive to the properties of the optimization problem. This paper presents a resampling decision function that incorporates the stochastic nature of the optimization problem by using bootstrapping and the probability of dominance. The distribution-free estimation of the probability of dominance is achieved using bootstrap estimates of the means. To make the procedure applicable even with very few observations, we transfer the distribution observed at other decision points. The efficiency of this resampling approach is demonstrated by applying it in the NSGA-II algorithm with a sequential resampling procedure under multiple noise variations.
- [568] arXiv:2503.22093 (replaced) [pdf, html, other]
-
Title: How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation BenchmarkComments: 4 pages, accepted by ToM@AAAI25Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Vision Language Models (VLMs) have demonstrated strong reasoning capabilities in Visual Question Answering (VQA) tasks; however, their ability to perform Theory of Mind (ToM) tasks, such as inferring human intentions, beliefs, and mental states, remains underexplored. We propose an open-ended question framework to evaluate VLMs' performance across diverse categories of ToM tasks. We curated and annotated a benchmark dataset of 30 images and evaluated the performance of four VLMs of varying sizes. Our results show that the GPT-4 model outperformed all the others, with only one smaller model, GPT-4o-mini, achieving comparable performance. We observed that VLMs often struggle to infer intentions in complex scenarios such as bullying or cheating. Our findings reveal that smaller models can sometimes infer correct intentions despite relying on incorrect visual cues. The dataset is available at this https URL.
- [569] arXiv:2504.01422 (replaced) [pdf, html, other]
-
Title: Optimization of BLE Broadcast Mode in Offline Finding NetworkSubjects: Networking and Internet Architecture (cs.NI)
In the Offline Finding Network(OFN), offline Bluetooth tags broadcast to the surrounding area, the finder devices receiving the broadcast signal and upload location information to the IoT(Internet of Things) cloud servers, thereby achieving offline finding of lost items. This process is essentially a Bluetooth low energy (BLE) neighbor discovery process(NDP). In the process, the variety of Bluetooth scan modes caused by the scan interval and scan window settings affects the discovery latency of finder devices finding the tag broadcast packets. To optimize the experience of searching for lost devices, we propose the CPBIS-mechanism, a certain proportion broadcast-intervals screening mechanism that calculates the most suitable two broadcast intervals and their proportion for offline tags. This reduces discovery latency in the BLE NDP, improves the discovery success rate, further enhances the user experience. To our knowledge, we are the first to propose a comprehensive solution for configuring the broadcast interval parameters of advertisers in BLE NDP, particularly for configurations involving two or more broadcast intervals. We evaluated the results obtained by CPBIS on the nRF52832 chip. The data shows that the CPBIS-mechanism achieves relatively low discovery latencies for multiple scan modes.
- [570] arXiv:2504.01482 (replaced) [pdf, html, other]
-
Title: A Robust Model-Based Approach for Continuous-Time Policy Evaluation with Unknown Lévy Process DynamicsComments: 28 pages, 9 figuresSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)
This paper develops a model-based framework for continuous-time policy evaluation (CTPE) in reinforcement learning, incorporating both Brownian and Lévy noise to model stochastic dynamics influenced by rare and extreme events. Our approach formulates the policy evaluation problem as solving a partial integro-differential equation (PIDE) for the value function with unknown coefficients. A key challenge in this setting is accurately recovering the unknown coefficients in the stochastic dynamics, particularly when driven by Lévy processes with heavy tail effects. To address this, we propose a robust numerical approach that effectively handles both unbiased and censored trajectory datasets. This method combines maximum likelihood estimation with an iterative tail correction mechanism, improving the stability and accuracy of coefficient recovery. Additionally, we establish a theoretical bound for the policy evaluation error based on coefficient recovery error. Through numerical experiments, we demonstrate the effectiveness and robustness of our method in recovering heavy-tailed Lévy dynamics and verify the theoretical error analysis in policy evaluation.
- [571] arXiv:2504.02269 (replaced) [pdf, html, other]
-
Title: Engineering Artificial Intelligence: Framework, Challenges, and Future DirectionComments: The paper submitted to the Journal Machine Learning: Engineering has been acceptedSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Over the past ten years, the application of artificial intelligence (AI) and machine learning (ML) in engineering domains has gained significant popularity, showcasing their potential in data-driven contexts. However, the complexity and diversity of engineering problems often require the development of domain-specific AI approaches, which are frequently hindered by a lack of systematic methodologies, scalability, and robustness during the development process. To address this gap, this paper introduces the "ABCDE" as the key elements of Engineering AI and proposes a unified, systematic engineering AI ecosystem framework, including eight essential layers, along with attributes, goals, and applications, to guide the development and deployment of AI solutions for specific engineering needs. Additionally, key challenges are examined, and eight future research directions are highlighted. By providing a comprehensive perspective, this paper aims to advance the strategic implementation of AI, fostering the development of next-generation engineering AI solutions.
- [572] arXiv:2504.02441 (replaced) [pdf, html, other]
-
Title: Cognitive Memory in Large Language ModelsComments: 37 pages, 9 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
This paper examines memory mechanisms in Large Language Models (LLMs), emphasizing their importance for context-rich responses, reduced hallucinations, and improved efficiency. It categorizes memory into sensory, short-term, and long-term, with sensory memory corresponding to input prompts, short-term memory processing immediate context, and long-term memory implemented via external databases or structures. The text-based memory section covers acquisition (selection and summarization), management (updating, accessing, storing, and resolving conflicts), and utilization (full-text search, SQL queries, semantic search). The KV cache-based memory section discusses selection methods (regularity-based summarization, score-based approaches, special token embeddings) and compression techniques (low-rank compression, KV merging, multimodal compression), along with management strategies like offloading and shared attention mechanisms. Parameter-based memory methods (LoRA, TTT, MoE) transform memories into model parameters to enhance efficiency, while hidden-state-based memory approaches (chunk mechanisms, recurrent transformers, Mamba model) improve long-text processing by combining RNN hidden states with current methods. Overall, the paper offers a comprehensive analysis of LLM memory mechanisms, highlighting their significance and future research directions.
- [573] arXiv:2504.03515 (replaced) [pdf, html, other]
-
Title: Dexterous Manipulation through Imitation Learning: A SurveyShan An, Ziyu Meng, Chao Tang, Yuning Zhou, Tengyu Liu, Fangqiang Ding, Shufang Zhang, Yao Mu, Ran Song, Wei Zhang, Zeng-Guang Hou, Hong ZhangComments: 22pages, 5 figuresSubjects: Robotics (cs.RO); Machine Learning (cs.LG)
Dexterous manipulation, which refers to the ability of a robotic hand or multi-fingered end-effector to skillfully control, reorient, and manipulate objects through precise, coordinated finger movements and adaptive force modulation, enables complex interactions similar to human hand dexterity. With recent advances in robotics and machine learning, there is a growing demand for these systems to operate in complex and unstructured environments. Traditional model-based approaches struggle to generalize across tasks and object variations due to the high dimensionality and complex contact dynamics of dexterous manipulation. Although model-free methods such as reinforcement learning (RL) show promise, they require extensive training, large-scale interaction data, and carefully designed rewards for stability and effectiveness. Imitation learning (IL) offers an alternative by allowing robots to acquire dexterous manipulation skills directly from expert demonstrations, capturing fine-grained coordination and contact dynamics while bypassing the need for explicit modeling and large-scale trial-and-error. This survey provides an overview of dexterous manipulation methods based on imitation learning, details recent advances, and addresses key challenges in the field. Additionally, it explores potential research directions to enhance IL-driven dexterous manipulation. Our goal is to offer researchers and practitioners a comprehensive introduction to this rapidly evolving domain.
- [574] arXiv:2504.04318 (replaced) [pdf, html, other]
-
Title: Variational Self-Supervised LearningComments: NeurIPS 2025 - SSL Workshop SubmissionSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
We present Variational Self-Supervised Learning (VSSL), a novel framework that combines variational inference with self-supervised learning to enable efficient, decoder-free representation learning. Unlike traditional VAEs that rely on input reconstruction via a decoder, VSSL symmetrically couples two encoders with Gaussian outputs. A momentum-updated teacher network defines a dynamic, data-dependent prior, while the student encoder produces an approximate posterior from augmented views. The reconstruction term in the ELBO is replaced with a cross-view denoising objective, preserving the analytical tractability of Gaussian KL divergence. We further introduce cosine-based formulations of KL and log-likelihood terms to enhance semantic alignment in high-dimensional latent spaces. Experiments on CIFAR-10, CIFAR-100, and ImageNet-100 show that VSSL achieves competitive or superior performance to leading self-supervised methods, including BYOL and MoCo V3. VSSL offers a scalable, probabilistically grounded approach to learning transferable representations without generative reconstruction, bridging the gap between variational modeling and modern self-supervised techniques.
- [575] arXiv:2504.05058 (replaced) [pdf, html, other]
-
Title: Not All Data Are Unlearned EquallySubjects: Computation and Language (cs.CL)
Machine unlearning is concerned with the task of removing knowledge learned from particular data points from a trained model. In the context of large language models (LLMs), unlearning has recently received increased attention, particularly for removing knowledge about named entities from models for privacy purposes. While various approaches have been proposed to address the unlearning problem, most existing approaches treat all data points to be unlearned equally, i.e., unlearning that Montreal is a city in Canada is treated exactly the same as unlearning the phone number of the first author of this paper. In this work, we show that this all data is equal assumption does not hold for LLM unlearning. We study how the success of unlearning depends on the frequency of the knowledge we want to unlearn in the pre-training data of a model and find that frequency strongly affects unlearning, i.e., more frequent knowledge is harder to unlearn. Additionally, we uncover a misalignment between probability and generation-based evaluations of unlearning and show that this problem worsens as models become larger. Overall, our experiments highlight the need for better evaluation practices and novel methods for LLM unlearning that take the training data of models into account.
- [576] arXiv:2504.06314 (replaced) [pdf, other]
-
Title: Beyond authorship: Analyzing contributions in PLOS ONE and the challenges of appropriate attributionJournal-ref: Abdelghani Maddi, Jaime A. Teixeira da Silva. Beyond authorship: Analyzing contributions in PLOS ONE and the challenges of appropriate attribution[J]. Journal of Data and Information Science, 2024Subjects: Digital Libraries (cs.DL); Computers and Society (cs.CY)
This study aims to evaluate the accuracy of authorship attributions in scientific publications, focusing on the fairness and precision of individual contributions within academic works. The study analyzes 81,823 publications from the journal PLOS ONE, covering the period from January 2018 to June 2023. It examines the authorship attributions within these publications to try and determine the prevalence of inappropriate authorship. It also investigates the demographic and professional profiles of affected authors, exploring trends and potential factors contributing to inaccuracies in authorship. Surprisingly, 9.14% of articles feature at least one author with inappropriate authorship, affecting over 14,000 individuals (2.56% of the sample). Inappropriate authorship is more concentrated in Asia, Africa, and specific European countries like Italy. Established researchers with significant publication records and those affiliated with companies or nonprofits show higher instances of potential monetary authorship. Our findings are based on contributions as declared by the authors, which implies a degree of trust in their transparency. However, this reliance on self-reporting may introduce biases or inaccuracies into the dataset. Further research could employ additional verification methods to enhance the reliability of the findings. These findings have significant implications for journal publishers, highlighting the necessity for robust control mechanisms to ensure the integrity of authorship attributions. Moreover, researchers must exercise discernment in determining when to acknowledge a contributor and when to include them in the author list. Addressing these issues is crucial for maintaining the credibility and fairness of academic publications.
- [577] arXiv:2504.06398 (replaced) [pdf, html, other]
-
Title: Sharpness-Aware Parameter Selection for Machine UnlearningSubjects: Machine Learning (cs.LG)
It often happens that some sensitive personal information, such as credit card numbers or passwords, are mistakenly incorporated in the training of machine learning models and need to be removed afterwards. The removal of such information from a trained model is a complex task that needs to partially reverse the training process. There have been various machine unlearning techniques proposed in the literature to address this problem. Most of the proposed methods revolve around removing individual data samples from a trained model. Another less explored direction is when features/labels of a group of data samples need to be reverted. While the existing methods for these tasks do the unlearning task by updating the whole set of model parameters or only the last layer of the model, we show that there are a subset of model parameters that have the largest contribution in the unlearning target features. More precisely, the model parameters with the largest corresponding diagonal value in the Hessian matrix (computed at the learned model parameter) have the most contribution in the unlearning task. By selecting these parameters and updating them during the unlearning stage, we can have the most progress in unlearning. We provide theoretical justifications for the proposed strategy by connecting it to sharpness-aware minimization and robust unlearning. We empirically show the effectiveness of the proposed strategy in improving the efficacy of unlearning with a low computational cost.
- [578] arXiv:2504.06768 (replaced) [pdf, html, other]
-
Title: FedMerge: Federated Personalization via Model MergingSubjects: Machine Learning (cs.LG)
One global model in federated learning (FL) might not be sufficient to serve many clients with non-IID tasks and distributions. While there has been advances in FL to train multiple global models for better personalization, they only provide limited choices to clients so local finetuning is still indispensable. In this paper, we propose a novel ``FedMerge'' approach that can create a personalized model per client by simply merging multiple global models with automatically optimized and customized weights. In FedMerge, a few global models can serve many non-IID clients, even without further local finetuning. We formulate this problem as a joint optimization of global models and the merging weights for each client. Unlike existing FL approaches where the server broadcasts one or multiple global models to all clients, the server only needs to send a customized, merged model to each client. Moreover, instead of periodically interrupting the local training and re-initializing it to a global model, the merged model aligns better with each client's task and data distribution, smoothening the local-global gap between consecutive rounds caused by client drift. We evaluate FedMerge on three different non-IID settings applied to different domains with diverse tasks and data types, in which FedMerge consistently outperforms existing FL approaches, including clustering-based and mixture-of-experts (MoE) based methods.
- [579] arXiv:2504.07315 (replaced) [pdf, html, other]
-
Title: Multilingual MFA: Forced Alignment on Low-Resource Related LanguagesJournal-ref: ComputEl8, 2025Subjects: Computation and Language (cs.CL)
We compare the outcomes of multilingual and crosslingual training for related and unrelated Australian languages with similar phonological inventories. We use the Montreal Forced Aligner to train acoustic models from scratch and adapt a large English model, evaluating results against seen data, unseen data (seen language), and unseen data and language. Results indicate benefits of adapting the English baseline model for previously unseen languages.
- [580] arXiv:2504.07687 (replaced) [pdf, html, other]
-
Title: FMNV: A Dataset of Media-Published News Videos for Fake News DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
News media, particularly video-based platforms, have become deeply embedded in daily life, concurrently amplifying risks of misinformation dissemination. Consequently, multimodal fake news detection has garnered significant research attention. However, existing datasets predominantly comprise user-generated videos characterized by crude editing and limited public engagement, whereas professionally crafted fake news videos disseminated by media outlets, often politically or virally motivated-pose substantially greater societal harm. To address this gap, we construct FMNV, a novel dataset exclusively composed of news videos published by media organizations. Through empirical analysis of existing datasets and our curated collection, we categorize fake news videos into four distinct types. Building upon this taxonomy, we employ Large Language Models (LLMs) to automatically generate deceptive content by manipulating authentic media-published news videos. Furthermore, we propose FMNVD, a baseline model featuring a dual-stream architecture integrating CLIP and Faster R-CNN for video feature extraction, enhanced by co-attention mechanisms for feature refinement and multimodal aggregation. Comparative experiments demonstrate both the generalization capability of FMNV across multiple baselines and the superior detection efficacy of FMNVD. This work establishes critical benchmarks for detecting high-impact fake news in media ecosystems while advancing methodologies for cross-modal inconsistency analysis.
- [581] arXiv:2504.07804 (replaced) [pdf, html, other]
-
Title: Function-Correcting Codes for Locally Bounded FunctionsComments: The title has been updatedSubjects: Information Theory (cs.IT)
In this paper, we introduce a class of functions that assume only a limited number $\lambda$ of values within a given Hamming $\rho$-ball and call them locally $(\rho, \lambda)$-bounded functions. We develop function-correcting codes (FCCs) for these functions and propose an upper bound on the redundancy of FCCs. The bound is based on the minimum length of an error-correcting code with a given number of codewords and a minimum distance. Furthermore, we provide a sufficient optimality condition for FCCs when $\lambda =4$. We also demonstrate that any function can be represented as a locally $(\rho, \lambda)$-bounded function, illustrating this with a representation of Hamming weight distribution functions. Furthermore, we present another construction of function-correcting codes for Hamming weight distribution functions.
- [582] arXiv:2504.08703 (replaced) [pdf, html, other]
-
Title: SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agentsMuhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, Laurent CallotComments: 20 pages, 6 figures, corrected author name spellingSubjects: Software Engineering (cs.SE)
Coding agents powered by large language models have shown impressive capabilities in software engineering tasks, but evaluating their performance across diverse programming languages and real-world scenarios remains challenging. We introduce SWE-PolyBench, a new multi-language benchmark for repository-level, execution-based evaluation of coding agents. SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code refactoring. We provide a task and repository-stratified subsample (SWE-PolyBench500) and release an evaluation harness allowing for fully automated evaluation. To enable a more comprehensive comparison of coding agents, this work also presents a novel set of metrics rooted in syntax tree analysis. We evaluate leading open source coding agents on SWE-PolyBench, revealing their strengths and limitations across languages, task types, and complexity classes. Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks. SWE-PolyBench aims to drive progress in developing more versatile and robust AI coding assistants for real-world software engineering. Our datasets and code are available at: this https URL
- [583] arXiv:2504.09818 (replaced) [pdf, html, other]
-
Title: Transferable text data distillation by trajectory matchingSubjects: Computation and Language (cs.CL)
In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).
- [584] arXiv:2504.09924 (replaced) [pdf, html, other]
-
Title: Passive Channel Charting: Locating Passive Targets using Wi-Fi Channel State InformationSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
We propose passive channel charting, an extension of channel charting to passive target localization. As in conventional channel charting, we follow a dimensionality reduction approach to reconstruct a physically interpretable map of target positions from similarities in high-dimensional channel state information. We show that algorithms and neural network architectures developed in the context of channel charting with active mobile transmitters can be straightforwardly applied to the passive case, where we assume a scenario with static transmitters and receivers and a mobile target. We evaluate our method on a channel state information dataset collected indoors with a distributed setup of ESPARGOS Wi-Fi sensing antenna arrays. This scenario can be interpreted as either a multi-static or passive radar system. We demonstrate that passive channel charting outperforms a baseline based on classical triangulation in terms of localization accuracy. We discuss our results and highlight some unsolved issues related to the proposed concept.
- [585] arXiv:2504.10323 (replaced) [pdf, html, other]
-
Title: Rel: A Programming Language for Relational DataMolham Aref, Paolo Guagliardo, George Kastrinis, Leonid Libkin, Victor Marsault, Wim Martens, Mary McGrath, Filip Murlak, Nathaniel Nystrom, Liat Peterfreund, Allison Rogers, Cristina Sirangelo, Domagoj Vrgoc, David Zhao, Abdul ZreikaSubjects: Databases (cs.DB); Programming Languages (cs.PL)
From the moment of their inception, languages for relational data have been described as sublanguages embedded in a host programming language. Rel is a new relational language whose key design goal is to go beyond this paradigm with features that allow for programming in the large, making it possible to fully describe end to end application semantics. With the new approach we can model the semantics of entire enterprise applications relationally, which helps significantly reduce architecture complexity and avoid the well-known impedance mismatch problem. This paradigm shift is enabled by 50 years of database research, making it possible to revisit the sublanguage/host language paradigm, starting from the fundamental principles. We present the main features of Rel: those that give it the power to express traditional query language operations and those that are designed to grow the language and allow programming in the large.
- [586] arXiv:2504.11336 (replaced) [pdf, html, other]
-
Title: Looking beyond the next tokenSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
The structure of causal language model training assumes that each token can be accurately predicted from the previous context. This contrasts with humans' natural writing and reasoning process, where goals are typically known before the exact argument or phrasings. While this mismatch has been well studied in the literature, the working assumption has been that architectural changes are needed to address this mismatch. We argue that rearranging and processing the training data sequences can allow models to more accurately imitate the true data-generating process, and does not require any other changes to the architecture or training infrastructure. We demonstrate that this technique, Trelawney, and the inference algorithms derived from it allow us to improve performance on several key benchmarks that span planning, algorithmic reasoning, and story generation tasks. Finally, our method naturally enables the generation of long-term goals at no additional cost. We investigate how using the model's goal-generation capability can further improve planning and reasoning. Additionally, we believe Trelawney could potentially open doors to new capabilities beyond the current language modeling paradigm.
- [587] arXiv:2504.11341 (replaced) [pdf, html, other]
-
Title: Evaluating DAO Sustainability and Longevity Through On-Chain Governance MetricsSubjects: Computers and Society (cs.CY); Emerging Technologies (cs.ET); Social and Information Networks (cs.SI)
Decentralised Autonomous Organisations (DAOs) automate governance and resource allocation through smart contracts, aiming to shift decision-making to distributed token holders. However, many DAOs face sustainability challenges linked to limited user participation, concentrated voting power, and technical design constraints. This paper addresses these issues by identifying research gaps in DAO evaluation and introducing a framework of Key Performance Indicators (KPIs) that capture governance efficiency, financial robustness, decentralisation, and community engagement. We apply the framework to a custom-built dataset of real-world DAOs constructed from on-chain data and analysed using non-parametric methods. The results reveal recurring governance patterns, including low participation rates and high proposer concentration, which may undermine long-term viability. The proposed KPIs offer a replicable, data-driven method for assessing DAO governance structures and identifying potential areas for improvement. These findings support a multidimensional approach to evaluating decentralised systems and provide practical tools for researchers and practitioners working to improve the resilience and effectiveness of DAO-based governance models.
- [588] arXiv:2504.11364 (replaced) [pdf, html, other]
-
Title: Teaching Large Language Models to Reason through Learning and ForgettingComments: Code: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it using both successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. While fine-tuning the model with these data might seem straightforward, we identify a critical issue: the model's search capability tends to degrade rapidly if fine-tuning is performed naively. We show that this degradation can be substantially mitigated by employing a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown mathematical reasoning benchmarks show that our approach not only outperforms both standard fine-tuning and inference-time search baselines but also significantly reduces inference time by 180$\times$.
- [589] arXiv:2504.12597 (replaced) [pdf, html, other]
-
Title: GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal ReasoningLiangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, Bo ZhengComments: 10 pages, 8 figuresSubjects: Computation and Language (cs.CL)
Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in MLLMs, remaining a critical gap in assessing their ability to tackle GPS. To this end, we introduce GeoSense, the first comprehensive bilingual benchmark designed to systematically evaluate the geometric reasoning abilities of MLLMs through the lens of geometric principles. GeoSense features a five-level hierarchical framework of geometric principles spanning plane and solid geometry, an intricately annotated dataset of 1,789 problems, and an innovative evaluation strategy. Through extensive experiments on GeoSense with various open-source and closed-source MLLMs, we observe that Gemini-2.0-pro-flash performs best, achieving an overall score of $65.3$. Our in-depth analysis reveals that the identification and application of geometric principles remain a bottleneck for leading MLLMs, jointly hindering their reasoning abilities. These findings underscore GeoSense's potential to guide future advancements in MLLMs' geometric reasoning capabilities, paving the way for more robust and human-like reasoning in artificial intelligence.
- [590] arXiv:2504.13192 (replaced) [pdf, html, other]
-
Title: CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM AgentComments: Accepted by KDD 2024;Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Recently, Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience and have attracted considerable attention. Despite the impressive progress, the research question regarding the safety vulnerability of LLM-empowered RecSys still remains largely under-investigated. Given the security and privacy concerns, it is more practical to focus on attacking the black-box RecSys, where attackers can only observe the system's inputs and outputs. However, traditional attack approaches employing reinforcement learning (RL) agents are not effective for attacking LLM-empowered RecSys due to the limited capabilities in processing complex textual inputs, planning, and reasoning. On the other hand, LLMs provide unprecedented opportunities to serve as attack agents to attack RecSys because of their impressive capability in simulating human-like decision-making processes. Therefore, in this paper, we propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs, where an LLM-based agent is developed to attack LLM-Empowered RecSys. Specifically, our method first identifies the insertion position for maximum impact with minimal input modification. After that, the LLM agent is designed to generate adversarial perturbations to insert at target positions. To further improve the quality of generated perturbations, we utilize the prompt tuning technique to improve attacking strategies via feedback from the victim RecSys iteratively. Extensive experiments across three real-world datasets demonstrate the effectiveness of our proposed attacking method.
- [591] arXiv:2504.13199 (replaced) [pdf, other]
-
Title: Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language TasksSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Objective: This review explores the trustworthiness of multimodal artificial intelligence (AI) systems, specifically focusing on vision-language tasks. It addresses critical challenges related to fairness, transparency, and ethical implications in these systems, providing a comparative analysis of key tasks such as Visual Question Answering (VQA), image captioning, and visual dialogue. Background: Multimodal models, particularly vision-language models, enhance artificial intelligence (AI) capabilities by integrating visual and textual data, mimicking human learning processes. Despite significant advancements, the trustworthiness of these models remains a crucial concern, particularly as AI systems increasingly confront issues regarding fairness, transparency, and ethics. Methods: This review examines research conducted from 2017 to 2024 focusing on forenamed core vision-language tasks. It employs a comparative approach to analyze these tasks through the lens of trustworthiness, underlining fairness, explainability, and ethics. This study synthesizes findings from recent literature to identify trends, challenges, and state-of-the-art solutions. Results: Several key findings were highlighted. Transparency: Explainability of vision language tasks is important for user trust. Techniques, such as attention maps and gradient-based methods, have successfully addressed this issue. Fairness: Bias mitigation in VQA and visual dialogue systems is essential for ensuring unbiased outcomes across diverse demographic groups. Ethical Implications: Addressing biases in multilingual models and ensuring ethical data handling is critical for the responsible deployment of vision-language systems. Conclusion: This study underscores the importance of integrating fairness, transparency, and ethical considerations in developing vision-language models within a unified framework.
- [592] arXiv:2504.13471 (replaced) [pdf, html, other]
-
Title: From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMsJiliang Ni, Jiachen Pu, Zhongyi Yang, Kun Zhou, Hui Wang, Xiaoliang Xiao, Dakui Wang, Xin Li, Jingfeng Luo, Conggang HuSubjects: Computation and Language (cs.CL)
In recent years, Large Language Models (LLMs) have significantly advanced artificial intelligence by optimizing traditional Natural Language Processing (NLP) pipelines, improving performance and generalization. This has spurred their integration into various systems. Many NLP systems, including ours, employ a "one-stage" pipeline directly incorporating LLMs. While effective, this approach incurs substantial costs and latency due to the need for large model parameters to achieve satisfactory outcomes. This paper introduces a three-stage cost-efficient end-to-end LLM deployment pipeline-including prototyping, knowledge transfer, and model compression-to tackle the cost-performance dilemma in LLM-based frameworks. Our approach yields a super tiny model optimized for cost and performance in online systems, simplifying the system architecture. Initially, by transforming complex tasks into a function call-based LLM-driven pipeline, an optimal performance prototype system is constructed to produce high-quality data as a teacher model. The second stage combines techniques like rejection fine-tuning, reinforcement learning, and knowledge distillation to transfer knowledge to a smaller 0.5B student model, delivering effective performance at minimal cost. The final stage applies quantization and pruning to extremely compress models to 0.4B, achieving ultra-low latency and cost. The framework's modular design and cross-domain capabilities suggest potential applicability in other NLP areas.
- [593] arXiv:2504.13941 (replaced) [pdf, html, other]
-
Title: Nemotron-CrossThink: Scaling Self-Learning beyond Math ReasoningSyeda Nahida Akter, Shrimai Prabhumoye, Matvei Novikov, Seungju Han, Ying Lin, Evelina Bakhturina, Eric Nyberg, Yejin Choi, Mostofa Patwary, Mohammad Shoeybi, Bryan CatanzaroComments: 18 pages, 7 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL). While prior work has successfully applied RL to mathematical reasoning -- where rules and correctness are well-defined -- generalizing these methods to broader reasoning domains remains challenging due to limited data, the lack of verifiable reward structures, and diverse task requirements. In this work, we propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks. NEMOTRON-CROSSTHINK addresses key challenges by (1) incorporating data from varied sources spanning STEM, humanities, social sciences, etc.; (2) applying structured templates (e.g., multiple-choice and open-ended) to control answer-space complexity; (3) filtering for verifiable answers; and (4) optimizing data blending strategies that utilizes data from multiple sources effectively. Our approach enables scalable and verifiable reward modeling beyond mathematics and demonstrates improved accuracies on both math (MATH-500: +30.1%, AMC23:+27.5%) and non-math reasoning benchmarks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%, AGIEVAL: +15.1%, SUPERGPQA: +3.8%). Moreover, NEMOTRON-CROSSTHINK exhibits significantly improved response efficiency -- using 28% fewer tokens for correct answers -- highlighting more focused and effective reasoning. Through NEMOTRON-CROSSTHINK, we demonstrate that integrating multi-domain, multi-format data in RL leads to more accurate, efficient, and generalizable LLMs.
- [594] arXiv:2504.14020 (replaced) [pdf, html, other]
-
Title: HyDra: SOT-CAM Based Vector Symbolic Macro for Hyperdimensional ComputingSubjects: Emerging Technologies (cs.ET)
Hyperdimensional computing (HDC) is a brain-inspired paradigm valued for its noise robustness, parallelism, energy efficiency, and low computational overhead. Hardware accelerators are being explored to further enhance its performance, but current solutions are often limited by application specificity and the latency of encoding and similarity search. This paper presents a generalized, reconfigurable on-chip training and inference architecture for HDC, utilizing spin-orbit-torque magnetic (SOT-MRAM) content-addressable memory (CAM). The proposed SOT-CAM array integrates storage and computation, enabling in-memory execution of key HDC operations: binding (bitwise multiplication), permutation (bit rotation), and efficient similarity search. To mitigate interconnect parasitic effect in similarity search, a four-stage voltage scaling scheme has been proposed to ensure accurate Hamming distance representation. Additionally, a novel bit drop method replaces bit rotation during read operations, and an HDC-specific adder reduces energy and area by 1.51x and 1.43x, respectively. Benchmarked at 7nm, the architecture achieves energy reductions of 21.5x, 552.74x, 1.45x, and 282.57x for addition, permutation, multiplication, and search operations, respectively, compared to CMOS-based HDC. Against state-of-the-art HD accelerators, it achieves a 2.27x lower energy consumption and outperforms CPU and eGPU implementations by 2702x and 23161x, respectively, with less than 3% drop in accuracy
- [595] arXiv:2504.14128 (replaced) [pdf, html, other]
-
Title: TALES: Text Adventure Learning Environment SuiteSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Reasoning is an essential skill to enable Large Language Models (LLMs) to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written text-adventure games designed to challenge and evaluate diverse reasoning capabilities. We present results over a range of LLMs, open- and closed-weights, performing a qualitative analysis on the top performing models. Despite an impressive showing on synthetic games, even the top LLM-driven agents fail to achieve 15% on games designed for human enjoyment. Code and visualization of the experiments can be found at this https URL.
- [596] arXiv:2504.14450 (replaced) [pdf, html, other]
-
Title: Causal Disentanglement for Robust Long-tail Medical Image GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Counterfactual medical image generation effectively addresses data scarcity and enhances the interpretability of medical images. However, due to the complex and diverse pathological features of medical images and the imbalanced class distribution in medical data, generating high-quality and diverse medical images from limited data is significantly challenging. Additionally, to fully leverage the information in limited data, such as anatomical structure information and generate more structurally stable medical images while avoiding distortion or inconsistency. In this paper, in order to enhance the clinical relevance of generated data and improve the interpretability of the model, we propose a novel medical image generation framework, which generates independent pathological and structural features based on causal disentanglement and utilizes text-guided modeling of pathological features to regulate the generation of counterfactual images. First, we achieve feature separation through causal disentanglement and analyze the interactions between features. Here, we introduce group supervision to ensure the independence of pathological and identity features. Second, we leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images. Meanwhile, we enhance accuracy by leveraging a large language model to extract lesion severity and location from medical reports. Additionally, we improve the performance of the latent diffusion model on long-tailed categories through initial noise optimization.
- [597] arXiv:2504.14539 (replaced) [pdf, other]
-
Title: Should Benevolent Deception be Allowed in EHMI? A Mechanism Explanation Based on Game TheorySubjects: Human-Computer Interaction (cs.HC)
The application of external human-machine interface (EHMI) on autonomous vehicles (AVs) facilitates information exchange. Existing research fails to consider the impact of the sequence of actions, as well as the effects of EHMI applications and deception, raising the question of whether benevolent, well-intentioned deception should be permitted (i.e., misleading statements that are intended to benefit both parties). We established a game theory based EHMI information disclosure framework for AVs in this study. In considering benevolent deception, this framework divided the decision-making process into three stages, respectively encompassing three key questions: whether to disclose, when to disclose, and what type of intention information to disclose. The results show that theoretical advantages of deception exist in certain cases when AV expects to maximize the safety of the interaction. In 40 out of 484 cases (8.3%), safety can be enhanced through successful deception. Those successful deceptions fall into two categories: 1) In 28 of these cases, the straight-going AV expected the left-turning HV to yield, while HV exhibited lower speed and higher acceleration; 2) In 12 of these cases, AV expected HV to proceed first, while HV exhibited higher speed and lower acceleration. We also conducted a VR-based driving simulation experiment, and the results confirmed our conclusion. Additionally, we found that when participants had low trust in the EHMI, its use negatively impacted interaction efficiency instead. This study aims to analyze the mechanisms of EHMI information disclosure and contribute to the ongoing discourse on the ethical framework governing autonomous driving systems.
- [598] arXiv:2504.14634 (replaced) [pdf, html, other]
-
Title: Latent Representations for Visual Proprioception in Inexpensive RobotsSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Robotic manipulation requires explicit or implicit knowledge of the robot's joint positions. Precise proprioception is standard in high-quality industrial robots but is often unavailable in inexpensive robots operating in unstructured environments. In this paper, we ask: to what extent can a fast, single-pass regression architecture perform visual proprioception from a single external camera image, available even in the simplest manipulation settings? We explore several latent representations, including CNNs, VAEs, ViTs, and bags of uncalibrated fiducial markers, using fine-tuning techniques adapted to the limited data available. We evaluate the achievable accuracy through experiments on an inexpensive 6-DoF robot.
- [599] arXiv:2504.14732 (replaced) [pdf, html, other]
-
Title: Reinforcement Learning from Multi-level and Episodic Human FeedbackSubjects: Machine Learning (cs.LG)
Designing an effective reward function has long been a challenge in reinforcement learning, particularly for complex tasks in unstructured environments. To address this, various learning paradigms have emerged that leverage different forms of human input to specify or refine the reward function. Reinforcement learning from human feedback is a prominent approach that utilizes human comparative feedback, expressed as a preference for one behavior over another, to tackle this problem. In contrast to comparative feedback, we explore multi-level human feedback, which is provided in the form of a score at the end of each episode. This type of feedback offers more coarse but informative signals about the underlying reward function than binary feedback. Additionally, it can handle non-Markovian rewards, as it is based on the evaluation of an entire episode. We propose an algorithm to efficiently learn both the reward function and the optimal policy from this form of feedback. Moreover, we show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.
- [600] arXiv:2504.14809 (replaced) [pdf, html, other]
-
Title: vApps: Verifiable Applications at Internet ScaleIsaac Zhang, Kshitij Kulkarni, Tan Li, Daniel Wong, Thomas Kim, John Guibas, Uma Roy, Bryan Pellegrino, Ryan ZarickComments: 12 pages, 11 figuresSubjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY)
Blockchain technology promises a decentralized, trustless, and interoperable infrastructure. However, widespread adoption remains hindered by issues such as limited scalability, high transaction costs, and the complexity of maintaining coherent verification logic across different blockchain layers. This paper introduces Verifiable Applications (vApps), a novel development framework designed to streamline the creation and deployment of verifiable blockchain computing applications. vApps offer a unified Rust-based Domain-Specific Language (DSL) within a comprehensive SDK, featuring modular abstractions for verification, proof generation, and inter-chain connectivity. This eases the developer's burden in securing diverse software components, allowing them to focus on application logic. The DSL also ensures that applications can automatically take advantage of specialized precompiles and hardware acceleration to achieve consistently high performance with minimal developer effort, as demonstrated by benchmark results for zero-knowledge virtual machines (zkVMs). Experiments show that native Rust execution eliminates interpretation overhead, delivering up to an 832x cycle count improvement compared to EVM-based approaches. Precompiled circuits can accelerate the proof by more than 95%, while GPU acceleration increases throughput by up to 30x and recursion compresses the proof size by up to 230x, enabling succinct and efficient verification. The framework also supports seamless integration with the Web2 and Web3 systems, enabling developers to focus solely on their application logic. Through modular architecture, robust security guarantees, and composability, vApps pave the way toward a trust-minimized and verifiable Internet-scale application environment.
- [601] arXiv:2504.14837 (replaced) [pdf, html, other]
-
Title: SQL-Factory: A Multi-Agent Framework for High-Quality and Large-Scale SQL GenerationSubjects: Databases (cs.DB)
Hight quality SQL corpus is essential for intelligent database. For example, Text-to-SQL requires SQL queries and correspond natural language questions as training samples. However, collecting such query corpus remains challenging in practice due to the high cost of manual annotation, which highlights the importance of automatic SQL generation. Despite recent advances, existing generation methods still face limitations in achieving both diversity and cost-effectiveness. Besides, many methods also treat all tables equally during generation, which overlooks schema complexity and leads to under-utilization of structurally rich tables. To address these issues, this paper proposes a multi-agent framework for high-quality and large-scale SQL generation, dubbed SQL-Factory. It decomposes the generation process into three collaborative teams: the Generation Team explores diverse query structures using large language models, the Expansion Team scales promising patterns via lightweight local models, and the Management Team adaptively schedules and evaluates generation based on schema coverage and real-time query quality. This modular framework ensures a balanced trade-off between diversity, scalability, and generation cost. We apply SQL-Factory to four widely used benchmarks and generate over 300,000 executable and broadly distributed SQL queries with less than $200 API cost. Our generated queries achieve higher diversity compared to other methods, and extensive experiments demonstrate that the generated queries significantly improve the model performance in various downstream tasks.
- [602] arXiv:2504.14992 (replaced) [pdf, html, other]
-
Title: Efficient Pretraining Length ScalingSubjects: Computation and Language (cs.CL)
Recent advances in large language models have demonstrated the effectiveness of length scaling during post-training, yet its potential in pre-training remains underexplored. We present the Parallel Hidden Decoding Transformer (\textit{PHD}-Transformer), a novel framework that enables efficient length scaling during pre-training while maintaining inference efficiency. \textit{PHD}-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens. By retaining only the KV cache of original tokens for long-range dependencies while immediately discarding hidden decoding tokens after use, our approach maintains the same KV cache size as the vanilla transformer while enabling effective length scaling. To further enhance performance, we introduce two optimized variants: \textit{PHD-SWA} employs sliding window attention to preserve local dependencies, while \textit{PHD-CSWA} implements chunk-wise sliding window attention to eliminate linear growth in pre-filling time. Extensive experiments demonstrate consistent improvements across multiple benchmarks.
- [603] arXiv:2504.15007 (replaced) [pdf, html, other]
-
Title: Shifts in Doctors' Eye Movements Between Real and AI-Generated Medical ImagesDavid C Wong, Bin Wang, Gorkem Durak, Marouane Tliba, Mohamed Amine Kerkouri, Aladine Chetouani, Ahmet Enis Cetin, Cagdas Topel, Nicolo Gennaro, Camila Vendrami, Tugce Agirlar Trabzonlu, Amir Ali Rahsepar, Laetitia Perronne, Matthew Antalek, Onural Ozturk, Gokcan Okur, Andrew C. Gordon, Ayis Pyrros, Frank H Miller, Amir A Borhani, Hatice Savas, Eric M. Hart, Elizabeth A Krupinski, Ulas BagciComments: This paper was accepted at ETRA 2025 JapanSubjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Eye-tracking analysis plays a vital role in medical imaging, providing key insights into how radiologists visually interpret and diagnose clinical cases. In this work, we first analyze radiologists' attention and agreement by measuring the distribution of various eye-movement patterns, including saccades direction, amplitude, and their joint distribution. These metrics help uncover patterns in attention allocation and diagnostic strategies. Furthermore, we investigate whether and how doctors' gaze behavior shifts when viewing authentic (Real) versus deep-learning-generated (Fake) images. To achieve this, we examine fixation bias maps, focusing on first, last, short, and longest fixations independently, along with detailed saccades patterns, to quantify differences in gaze distribution and visual saliency between authentic and synthetic images.
- [604] arXiv:2504.15284 (replaced) [pdf, other]
-
Title: EditLord: Learning Code Transformation Rules for Code EditingSubjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Code editing is a foundational task in software development, where its effectiveness depends on whether it introduces desired code property changes without changing the original code's intended functionality. Existing approaches often formulate code editing as an implicit end-to-end task, omitting the fact that code-editing procedures inherently consist of discrete and explicit steps. Thus, they suffer from suboptimal performance and lack of robustness and generalization. We introduce EditLord, a code editing framework that makes the code transformation steps explicit. Our key insight is to employ a language model (LM) as an inductive learner to extract code editing rules from the training code pairs as concise meta-rule sets. Such rule sets will be manifested for each training sample to augment them for finetuning or assist in prompting- and iterative-based code editing. EditLordoutperforms the state-of-the-art by an average of 22.7% in editing performance and 58.1% in robustness while achieving 20.2% higher functional correctness across critical software engineering and security applications, LM models, and editing modes.
- [605] arXiv:2504.15364 (replaced) [pdf, html, other]
-
Title: KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained EnvironmentsComments: 8 pages, 14 figuresSubjects: Artificial Intelligence (cs.AI)
In this work, we demonstrate that distinctive keys during LLM inference tend to have high attention scores. We explore this phenomenon and propose KeyDiff, a training-free KV cache eviction method based on key similarity. This method facilitates the deployment of LLM-based application requiring long input prompts in resource-constrained environments with limited memory and compute budgets. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We demonstrate that KeyDiff computes the optimal solution to a KV cache selection problem that maximizes key diversity, providing a theoretical understanding of KeyDiff. Notably,KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. We demonstrate the effectiveness of KeyDiff across diverse tasks and models, illustrating a performance gap of less than 0.04\% with 8K cache budget ($\sim$ 23\% KV cache reduction) from the non-evicting baseline on the LongBench benchmark for Llama 3.1-8B and Llama 3.2-3B.
- [606] arXiv:2504.15396 (replaced) [pdf, html, other]
-
Title: A Quadratic Control Framework for Dynamic SystemsComments: 16 pages, 10 figures, 16 tablesSubjects: Systems and Control (eess.SY)
This article presents a unified approach to quadratic optimal control for both linear and nonlinear discrete-time systems, with a focus on trajectory tracking. The control strategy is based on minimizing a quadratic cost function that penalizes deviations of system states and control inputs from their desired trajectories.
For linear systems, the classical Linear Quadratic Regulator (LQR) solution is derived using dynamic programming, resulting in recursive equations for feedback and feedforward terms. For nonlinear dynamics, the Iterative Linear Quadratic Regulator (iLQR) method is employed, which iteratively linearizes the system and solves a sequence of LQR problems to converge to an optimal policy.
To implement this approach, a software service was developed and tested on several canonical models, including: Rayleigh oscillator, inverted pendulum on a moving cart, two-link manipulator, and quadcopter. The results confirm that iLQR enables efficient and accurate trajectory tracking in the presence of nonlinearities.
To further enhance performance, it can be seamlessly integrated with Model Predictive Control (MPC), enabling online adaptation and improved robustness to constraints and system uncertainties. - [607] arXiv:2504.15416 (replaced) [pdf, html, other]
-
Title: Bare Minimum Mitigations for Autonomous AI DevelopmentJoshua Clymer, Isabella Duan, Chris Cundy, Yawen Duan, Fynn Heide, Chaochao Lu, Sören Mindermann, Conor McGurk, Xudong Pan, Saad Siddiqui, Jingren Wang, Min Yang, Xianyuan ZhanComments: 12 pages, 2 figuresSubjects: Computers and Society (cs.CY)
Artificial intelligence (AI) is advancing rapidly, with the potential for significantly automating AI research and development itself in the near future. In 2024, international scientists, including Turing Award recipients, warned of risks from autonomous AI research and development (R&D), suggesting a red line such that no AI system should be able to improve itself or other AI systems without explicit human approval and assistance. However, the criteria for meaningful human approval remain unclear, and there is limited analysis on the specific risks of autonomous AI R&D, how they arise, and how to mitigate them. In this brief paper, we outline how these risks may emerge and propose four minimum safeguard recommendations applicable when AI agents significantly automate or accelerate AI development.
- [608] arXiv:2504.15632 (replaced) [pdf, html, other]
-
Title: A Study on Mixup-Inspired Augmentation Methods for Software Vulnerability DetectionComments: Accepted at EASE 2025, Istanbul, TurkeySubjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Various deep learning (DL) methods have recently been utilized to detect software vulnerabilities. Real-world software vulnerability datasets are rare and hard to acquire, as there is no simple metric for classifying vulnerability. Such datasets are heavily imbalanced, and none of the current datasets are considered huge for DL models. To tackle these problems, a recent work has tried to augment the dataset using the source code and generate realistic single-statement vulnerabilities, which is not quite practical and requires manual checking of the generated vulnerabilities. In this paper, we aim to explore the augmentation of vulnerabilities at the representation level to help current models learn better, which has never been done before to the best of our knowledge. We implement and evaluate five augmentation techniques that augment the embedding of the data and have recently been used for code search, which is a completely different software engineering task. We also introduced a conditioned version of those augmentation methods, which ensures the augmentation does not change the vulnerable section of the vector representation. We show that such augmentation methods can be helpful and increase the F1-score by up to 9.67%, yet they cannot beat Random Oversampling when balancing datasets, which increases the F1-score by 10.82%.
- [609] arXiv:2504.15681 (replaced) [pdf, html, other]
-
Title: Vidi: Large Multimodal Models for Video Understanding and EditingVidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, Xueqiong QuSubjects: Computer Vision and Pattern Recognition (cs.CV)
Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than videos of existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.
- [610] arXiv:2504.15773 (replaced) [pdf, html, other]
-
Title: Clifford Group Equivariant Diffusion Models for 3D Molecular GenerationComments: 7 pages, 1 figure, 1 tableSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
This paper explores leveraging the Clifford algebra's expressive power for $\E(n)$-equivariant diffusion models. We utilize the geometric products between Clifford multivectors and the rich geometric information encoded in Clifford subspaces in \emph{Clifford Diffusion Models} (CDMs). We extend the diffusion process beyond just Clifford one-vectors to incorporate all higher-grade multivector subspaces. The data is embedded in grade-$k$ subspaces, allowing us to apply latent diffusion across complete multivectors. This enables CDMs to capture the joint distribution across different subspaces of the algebra, incorporating richer geometric information through higher-order features. We provide empirical results for unconditional molecular generation on the QM9 dataset, showing that CDMs provide a promising avenue for generative modeling.
- [611] arXiv:2504.15903 (replaced) [pdf, other]
-
Title: Impact of Noise on LLM-Models Performance in Abstraction and Reasoning Corpus (ARC) Tasks with Model Temperature ConsiderationsComments: 60 pages, 25 figuresSubjects: Artificial Intelligence (cs.AI)
Recent advancements in Large Language Models (LLMs) have generated growing interest in their structured reasoning capabilities, particularly in tasks involving abstraction and pattern recognition. The Abstraction and Reasoning Corpus (ARC) benchmark plays a crucial role in evaluating these capabilities by testing how well AI models generalize to novel problems. While GPT-4o demonstrates strong performance by solving all ARC tasks under zero-noise conditions, other models like DeepSeek R1 and LLaMA 3.2 fail to solve any, suggesting limitations in their ability to reason beyond simple pattern matching. To explore this gap, we systematically evaluate these models across different noise levels and temperature settings. Our results reveal that the introduction of noise consistently impairs model performance, regardless of architecture. This decline highlights a shared vulnerability: current LLMs, despite showing signs of abstract reasoning, remain highly sensitive to input perturbations. Such fragility raises concerns about their real-world applicability, where noise and uncertainty are common. By comparing how different model architectures respond to these challenges, we offer insights into the structural weaknesses of modern LLMs in reasoning tasks. This work underscores the need for developing more robust and adaptable AI systems capable of handling the ambiguity and variability inherent in real-world scenarios. Our findings aim to guide future research toward enhancing model generalization, robustness, and alignment with human-like cognitive flexibility.
- [612] arXiv:2504.15909 (replaced) [pdf, html, other]
-
Title: Synergizing RAG and Reasoning: A Systematic ReviewSubjects: Information Retrieval (cs.IR)
Recent breakthroughs in large language models (LLMs), particularly in reasoning capabilities, have propelled Retrieval-Augmented Generation (RAG) to unprecedented levels. By synergizing retrieval mechanisms with advanced reasoning, LLMs can now tackle increasingly complex problems. This paper presents a systematic review of the collaborative interplay between RAG and reasoning, clearly defining "reasoning" within the RAG context. It construct a comprehensive taxonomy encompassing multi-dimensional collaborative objectives, representative paradigms, and technical implementations, and analyze the bidirectional synergy methods. Additionally, we critically evaluate current limitations in RAG assessment, including the absence of intermediate supervision for multi-step reasoning and practical challenges related to cost-risk trade-offs. To bridge theory and practice, we provide practical guidelines tailored to diverse real-world applications. Finally, we identify promising research directions, such as graph-based knowledge integration, hybrid model collaboration, and RL-driven optimization. Overall, this work presents a theoretical framework and practical foundation to advance RAG systems in academia and industry, fostering the next generation of RAG solutions.
- [613] arXiv:2504.15929 (replaced) [pdf, html, other]
-
Title: Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language ModelsComments: 18 pages, 7 figures, 6 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Diagnostic imaging relies on interpreting both images and radiology reports, but the growing data volumes place significant pressure on medical experts, yielding increased errors and workflow backlogs. Medical vision-language models (med-VLMs) have emerged as a powerful framework to efficiently process multimodal imaging data, particularly in chest X-ray (CXR) evaluations, albeit their performance hinges on how well image and text representations are aligned. Existing alignment methods, predominantly based on contrastive learning, prioritize separation between disease classes over segregation of fine-grained pathology attributes like location, size or severity, leading to suboptimal representations. Here, we propose MedTrim (Meta-entity-driven Triplet mining), a novel method that enhances image-text alignment through multimodal triplet learning synergistically guided by disease class as well as adjectival and directional pathology descriptors. Unlike common alignment methods that separate broad disease classes, MedTrim leverages structured meta-entity information to preserve subtle but clinically significant intra-class variations. For this purpose, we first introduce an ontology-based entity recognition module that extracts pathology-specific meta-entities from CXR reports, as annotations on pathology attributes are rare in public datasets. For refined sample selection in triplet mining, we then introduce a novel score function that captures an aggregate measure of inter-sample similarity based on disease classes and adjectival/directional descriptors. Lastly, we introduce a multimodal triplet alignment objective for explicit within- and cross-modal alignment between samples sharing detailed pathology characteristics. Our demonstrations indicate that MedTrim improves performance in downstream retrieval and classification tasks compared to state-of-the-art alignment methods.
- [614] arXiv:2504.15975 (replaced) [pdf, html, other]
-
Title: A New Graph Grammar Formalism for Robust Syntactic Pattern RecognitionComments: 64 pages, 23 figures. Version 2: mathematical supplement added, 98 pages, 1 figureSubjects: Formal Languages and Automata Theory (cs.FL); Computer Vision and Pattern Recognition (cs.CV)
I introduce a formalism for representing the syntax of recursively structured graph-like patterns. It does not use production rules, like a conventional graph grammar, but represents the syntactic structure in a more direct and declarative way. The grammar and the pattern are both represented as networks, and parsing is seen as the construction of a homomorphism from the pattern to the grammar. The grammars can represent iterative, hierarchical and nested recursive structure in more than one dimension.
This supports a highly parallel style of parsing, in which all aspects of pattern recognition (feature detection, segmentation, parsing, filling in missing symbols, top-down and bottom-up inference) are integrated into a single process, to exploit the synergy between them.
The emphasis of this paper is on underlying theoretical issues, but I also give some example runs to illustrate the error-tolerant parsing of complex recursively structured patterns of 50-1000 symbols, involving variability in geometric relationships, blurry and indistinct symbols, overlapping symbols, cluttered images, and erased patches. - [615] arXiv:2504.16026 (replaced) [pdf, html, other]
-
Title: Trends in AI SupercomputersSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Frontier AI development relies on powerful AI supercomputers, yet analysis of these systems is limited. We create a dataset of 500 AI supercomputers from 2019 to 2025 and analyze key trends in performance, power needs, hardware cost, ownership, and global distribution. We find that the computational performance of AI supercomputers has doubled every nine months, while hardware acquisition cost and power needs both doubled every year. The leading system in March 2025, xAI's Colossus, used 200,000 AI chips, had a hardware cost of \$7B, and required 300 MW of power, as much as 250,000 households. As AI supercomputers evolved from tools for science to industrial machines, companies rapidly expanded their share of total AI supercomputer performance, while the share of governments and academia diminished. Globally, the United States accounts for about 75% of total performance in our dataset, with China in second place at 15%. If the observed trends continue, the leading AI supercomputer in 2030 will achieve $2\times10^{22}$ 16-bit FLOP/s, use two million AI chips, have a hardware cost of \$200 billion, and require 9 GW of power. Our analysis provides visibility into the AI supercomputer landscape, allowing policymakers to assess key AI trends like resource needs, ownership, and national competitiveness.
- [616] arXiv:2504.16057 (replaced) [pdf, html, other]
-
Title: Automated Static Vulnerability Detection via a Holistic Neuro-symbolic ApproachSubjects: Cryptography and Security (cs.CR)
Static vulnerability detection is still a challenging problem and demands excessive human efforts, e.g., manual curation of good vulnerability patterns. None of prior works, including classic program analysis or Large Language Model (LLM)-based approaches, have fully automated such vulnerability pattern generations with reasonable detection accuracy. In this paper, we design and implement, MoCQ, a novel holistic neuro-symbolic framework that combines the complementary strengths of LLMs and classical static analysis to enable scalable vulnerability detection. The key insight is that MoCQ leverages an LLM to automatically extract vulnerability patterns and translate them into detection queries, and then on static analysis to refine such queries in a feedback loop and eventually execute them for analyzing large codebases and mining vulnerabilities. We evaluate MoCQ on seven types of vulnerabilities spanning two programming languages. We found MoCQ-generated queries uncovered at least 12 patterns that were missed by experts. On a ground truth dataset, MoCQ achieved comparable precision and recall compared to expert-crafted queries. Moreover, MoCQ has identified seven previously unknown vulnerabilities in real-world applications, demonstrating its practical effectiveness. We have responsibly disclosed them to the corresponding developers.
- [617] arXiv:2504.16113 (replaced) [pdf, html, other]
-
Title: AI-Based Vulnerability Analysis of NFT Smart ContractsSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
With the rapid growth of the NFT market, the security of smart contracts has become crucial. However, existing AI-based detection models for NFT contract vulnerabilities remain limited due to their complexity, while traditional manual methods are time-consuming and costly. This study proposes an AI-driven approach to detect vulnerabilities in NFT smart contracts.
We collected 16,527 public smart contract codes, classifying them into five vulnerability categories: Risky Mutable Proxy, ERC-721 Reentrancy, Unlimited Minting, Missing Requirements, and Public Burn. Python-processed data was structured into training/test sets. Using the CART algorithm with Gini coefficient evaluation, we built initial decision trees for feature extraction. A random forest model was implemented to improve robustness through random data/feature sampling and multitree integration. GridSearch hyperparameter tuning further optimized the model, with 3D visualizations demonstrating parameter impacts on vulnerability detection.
Results show the random forest model excels in detecting all five vulnerabilities. For example, it identifies Risky Mutable Proxy by analyzing authorization mechanisms and state modifications, while ERC-721 Reentrancy detection relies on external call locations and lock mechanisms. The ensemble approach effectively reduces single-tree overfitting, with stable performance improvements after parameter tuning. This method provides an efficient technical solution for automated NFT contract detection and lays groundwork for scaling AI applications. - [618] arXiv:2504.16129 (replaced) [pdf, html, other]
-
Title: MARFT: Multi-Agent Reinforcement Fine-TuningComments: 36 pagesSubjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks requiring multifaceted reasoning and collaboration, from generating high-quality presentation slides to conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methodologies to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a universal algorithmic framework tailored for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We begin by reviewing the evolution from RL to Reinforcement Fine-Tuning, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a novel, LaMAS-oriented formulation of RFT. Central to this work is the presentation of a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work aims to serve as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: this https URL.
- [619] arXiv:2504.16173 (replaced) [pdf, html, other]
-
Title: FPGA-Based Neural Network Accelerators for Space Applications: A SurveySubjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
Space missions are becoming increasingly ambitious, necessitating high-performance onboard spacecraft computing systems. In response, field-programmable gate arrays (FPGAs) have garnered significant interest due to their flexibility, cost-effectiveness, and radiation tolerance potential. Concurrently, neural networks (NNs) are being recognized for their capability to execute space mission tasks such as autonomous operations, sensor data analysis, and data compression. This survey serves as a valuable resource for researchers aiming to implement FPGA-based NN accelerators in space applications. By analyzing existing literature, identifying trends and gaps, and proposing future research directions, this work highlights the potential of these accelerators to enhance onboard computing systems.
- [620] arXiv:2504.16211 (replaced) [pdf, html, other]
-
Title: One-Point Sampling for Distributed Bandit Convex Optimization with Time-Varying ConstraintsComments: 15 pages, 3 figuresSubjects: Systems and Control (eess.SY); Signal Processing (eess.SP)
This paper considers the distributed bandit convex optimization problem with time-varying constraints. In this problem, the global loss function is the average of all the local convex loss functions, which are unknown beforehand. Each agent iteratively makes its own decision subject to time-varying inequality constraints which can be violated but are fulfilled in the long run. For a uniformly jointly strongly connected time-varying directed graph, a distributed bandit online primal-dual projection algorithm with one-point sampling is proposed. We show that sublinear dynamic network regret and network cumulative constraint violation are achieved if the path-length of the benchmark also increases in a sublinear manner. In addition, an $\mathcal{O}({T^{3/4 + g}})$ static network regret bound and an $\mathcal{O}( {{T^{1 - {g}/2}}} )$ network cumulative constraint violation bound are established, where $T$ is the total number of iterations and $g \in ( {0,1/4} )$ is a trade-off parameter. Moreover, a reduced static network regret bound $\mathcal{O}( {T^{2/3 + 4g /3}} )$ is established for strongly convex local loss functions. Finally, a numerical example is presented to validate the theoretical results.
- [621] arXiv:2504.16231 (replaced) [pdf, html, other]
-
Title: Quasitubal Tensor Algebra Over Separable Hilbert SpacesComments: 34 pagesSubjects: Numerical Analysis (math.NA); Functional Analysis (math.FA)
The tubal tensor framework provides a clean and effective algebraic setting for tensor computations, supporting matrix-mimetic features like Singular Value Decomposition and Eckart-Young-like optimality results. Underlying the tubal tensor framework is a view of a tensor as a matrix of finite sized tubes. In this work, we lay the mathematical and computational foundations for working with tensors with infinite size tubes: matrices whose elements are elements from a separable Hilbert space. A key challenge is that existence of important desired matrix-mimetic features of tubal tensors rely on the existence of a unit element in the ring of tubes. Such unit element cannot exist for tubes which are elements of an infinite-dimensional Hilbert space. We sidestep this issue by embedding the tubal space in a commutative unital C*-algebra of bounded operators. The resulting quasitubal algebra recovers the structural properties needed for decomposition and low-rank approximation. In addition to laying the theoretical groundwork for working with tubal tensors with infinite dimensional tubes, we discuss computational aspects of our construction, and provide a numerical illustration where we compute a finite dimensional approximation to a infinitely-sized synthetic tensor using our theory. We believe our theory opens new exciting avenues for applying matrix mimetic tensor framework in the context of inherently infinite dimensional problems.
- [622] arXiv:2504.16295 (replaced) [pdf, other]
-
Title: Subthreshold Jitter in VR Can Induce Visual DiscomfortSamuel J. Levulis, Kevin W. Rio, Pablo Ramon Soria, James Wilmott, Charlie S. Burlingham, Phillip GuanSubjects: Human-Computer Interaction (cs.HC)
Visual-vestibular conflicts (VVCs) are a primary contributor to visually induced motion sickness (VIMS) in head-mounted displays (HMDs). However, virtual reality (VR) comfort studies often rely on exposing seated or standing users to experiences with high intensity visual motion (such as roller coasters). These drastic VVCs tend to induce pronounced VIMS symptoms that can be reliably detected across individuals using common survey measures. The conclusions from studies using these extreme motion-based conflicts may not accurately generalize to naturalistic use cases in VR where efforts are made to minimize, rather than maximize, VIMS symptoms. In this work, we show that a subthreshold visual-vestibular conflict can induce measurable discomfort during naturalistic, long duration use. We first present a psychophysical study, conducted outside of an HMD, to rigorously identify the perceptual thresholds for sinusoidal noise in render pose (i.e., jitter) resulting in erroneous 3D motion of rendered content. We next introduce subthreshold levels of jitter to a Meta Quest 3 VR HMD and demonstrate that this can induce visual discomfort in participants playing the commercially-available game Cubism across a three-session, repeated-measures study. Importantly, we did not identify statistically significant comfort differences between control and jitter conditions with traditional pre- and post-test comparison of Simulator Sickness Questionnaire (SSQ) scores. Significant differences were only identified using the Motion Illness Symptoms Classification (MISC) survey administered every 10 minutes across each 90 minute session. This highlights the benefits of incorporating time-resolved data points and suggests that lightweight, more frequent surveys may be important tools for measuring visual discomfort in more ecologically-valid scenarios.
- [623] arXiv:2504.16369 (replaced) [pdf, html, other]
-
Title: Fast Online Adaptive Neural MPC via Meta-LearningSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Data-driven model predictive control (MPC) has demonstrated significant potential for improving robot control performance in the presence of model uncertainties. However, existing approaches often require extensive offline data collection and computationally intensive training, limiting their ability to adapt online. To address these challenges, this paper presents a fast online adaptive MPC framework that leverages neural networks integrated with Model-Agnostic Meta-Learning (MAML). Our approach focuses on few-shot adaptation of residual dynamics - capturing the discrepancy between nominal and true system behavior - using minimal online data and gradient steps. By embedding these meta-learned residual models into a computationally efficient L4CasADi-based MPC pipeline, the proposed method enables rapid model correction, enhances predictive accuracy, and improves real-time control performance. We validate the framework through simulation studies on a Van der Pol oscillator, a Cart-Pole system, and a 2D quadrotor. Results show significant gains in adaptation speed and prediction accuracy over both nominal MPC and nominal MPC augmented with a freshly initialized neural network, underscoring the effectiveness of our approach for real-time adaptive robot control.
- [624] arXiv:2504.16427 (replaced) [pdf, html, other]
-
Title: Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive BenchmarkComments: 23 pages, 5 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at this https URL.
- [625] arXiv:2504.16443 (replaced) [pdf, html, other]
-
Title: Marginalized Generalized IoU (MGIoU): A Unified Objective Function for Optimizing Any Convex Parametric ShapesComments: 8 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Optimizing the similarity between parametric shapes is crucial for numerous computer vision tasks, where Intersection over Union (IoU) stands as the canonical measure. However, existing optimization methods exhibit significant shortcomings: regression-based losses like L1/L2 lack correlation with IoU, IoU-based losses are unstable and limited to simple shapes, and task-specific methods are computationally intensive and not generalizable accross domains. As a result, the current landscape of parametric shape objective functions has become scattered, with each domain proposing distinct IoU approximations. To address this, we unify the parametric shape optimization objective functions by introducing Marginalized Generalized IoU (MGIoU), a novel loss function that overcomes these challenges by projecting structured convex shapes onto their unique shape Normals to compute one-dimensional normalized GIoU. MGIoU offers a simple, efficient, fully differentiable approximation strongly correlated with IoU. We then extend MGIoU to MGIoU+ that supports optimizing unstructured convex shapes. Together, MGIoU and MGIoU+ unify parametric shape optimization across diverse applications. Experiments on standard benchmarks demonstrate that MGIoU and MGIoU+ consistently outperform existing losses while reducing loss computation latency by 10-40x. Additionally, MGIoU and MGIoU+ satisfy metric properties and scale-invariance, ensuring robustness as an objective function. We further propose MGIoU- for minimizing overlaps in tasks like collision-free trajectory prediction. Code is available at this https URL
- [626] arXiv:2504.16450 (replaced) [pdf, html, other]
-
Title: An Effective Gram Matrix Characterizes Generalization in Deep NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We derive a differential equation that governs the evolution of the generalization gap when a deep network is trained by gradient descent. This differential equation is controlled by two quantities, a contraction factor that brings together trajectories corresponding to slightly different datasets, and a perturbation factor that accounts for them training on different datasets. We analyze this differential equation to compute an ``effective Gram matrix'' that characterizes the generalization gap after training in terms of the alignment between this Gram matrix and a certain initial ``residual''. Empirical evaluations on image classification datasets indicate that this analysis can predict the test loss accurately. Further, at any point during training, the residual predominantly lies in the subspace of the effective Gram matrix with the smallest eigenvalues. This indicates that the training process is benign, i.e., it does not lead to significant deterioration of the generalization gap (which is zero at initialization). The alignment between the effective Gram matrix and the residual is different for different datasets and architectures. The match/mismatch of the data and the architecture is primarily responsible for good/bad generalization.
- [627] arXiv:2504.16580 (replaced) [pdf, html, other]
-
Title: Hyper-Transforming Latent Diffusion ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce a novel generative framework for functions by integrating Implicit Neural Representations (INRs) and Transformer-based hypernetworks into latent variable models. Unlike prior approaches that rely on MLP-based hypernetworks with scalability limitations, our method employs a Transformer-based decoder to generate INR parameters from latent variables, addressing both representation capacity and computational efficiency. Our framework extends latent diffusion models (LDMs) to INR generation by replacing standard decoders with a Transformer-based hypernetwork, which can be trained either from scratch or via hyper-transforming-a strategy that fine-tunes only the decoder while freezing the pre-trained latent space. This enables efficient adaptation of existing generative models to INR-based representations without requiring full retraining.
- [628] arXiv:2504.16688 (replaced) [pdf, html, other]
-
Title: A Statistical Evaluation of Indoor LoRaWAN Environment-Aware Propagation for 6G: MLR, ANOVA, and Residual Distribution AnalysisComments: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media. This is the accepted version of the article: To appear in the 2025 Joint European Conference on Networks and Communications & 6G Summit (EuCNC/6G Summit)Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
Modeling path loss in indoor LoRaWAN technology deployments is inherently challenging due to structural obstructions, occupant density and activities, and fluctuating environmental conditions. This study proposes a two-stage approach to capture and analyze these complexities using an extensive dataset of 1,328,334 field measurements collected over six months in a single-floor office at the University of Siegen's Hoelderlinstrasse Campus, Germany. First, we implement a multiple linear regression model that includes traditional propagation metrics (distance, structural walls) and an extension with proposed environmental variables (relative humidity, temperature, carbon dioxide, particulate matter, and barometric pressure). Using analysis of variance, we demonstrate that adding these environmental factors can reduce unexplained variance by 42.32 percent. Secondly, we examine residual distributions by fitting five candidate probability distributions: Normal, Skew-Normal, Cauchy, Student's t, and Gaussian Mixture Models with one to five components. Our results show that a four-component Gaussian Mixture Model captures the residual heterogeneity of indoor signal propagation most accurately, significantly outperforming single-distribution approaches. Given the push toward ultra-reliable, context-aware communications in 6G networks, our analysis shows that environment-aware modeling can substantially improve LoRaWAN network design in dynamic indoor IoT deployments.
- [629] arXiv:2504.16727 (replaced) [pdf, html, other]
-
Title: V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual VariationsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Large Vision Language Models (LVLMs) excel in various vision-language tasks. Yet, their robustness to visual variations in position, scale, orientation, and context that objects in natural scenes inevitably exhibit due to changes in viewpoint and environment remains largely underexplored. To bridge this gap, we introduce V$^2$R-Bench, a comprehensive benchmark framework for evaluating Visual Variation Robustness of LVLMs, which encompasses automated evaluation dataset generation and principled metrics for thorough robustness assessment. Through extensive evaluation on 21 LVLMs, we reveal a surprising vulnerability to visual variations, in which even advanced models that excel at complex vision-language tasks significantly underperform on simple tasks such as object recognition. Interestingly, these models exhibit a distinct visual position bias that contradicts theories of effective receptive fields, and demonstrate a human-like visual acuity threshold. To identify the source of these vulnerabilities, we present a systematic framework for component-level analysis, featuring a novel visualization approach for aligned visual features. Results show that these vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment. Complementary experiments with synthetic data further demonstrate that these limitations are fundamentally architectural deficiencies, scoring the need for architectural innovations in future LVLM designs.
- [630] arXiv:2504.16734 (replaced) [pdf, html, other]
-
Title: DYNUS: Uncertainty-aware Trajectory Planner in Dynamic Unknown EnvironmentsKota Kondo, Mason Peterson, Nicholas Rober, Juan Rached Viso, Lucas Jia, Jialin Chen, Harvey Merton, Jonathan P. HowComments: 20 pages, 30 figures, Under review at IEEE Transactions on RoboticsSubjects: Robotics (cs.RO)
This paper introduces DYNUS, an uncertainty-aware trajectory planner designed for dynamic unknown environments. Operating in such settings presents many challenges -- most notably, because the agent cannot predict the ground-truth future paths of obstacles, a previously planned trajectory can become unsafe at any moment, requiring rapid replanning to avoid collisions.
Recently developed planners have used soft-constraint approaches to achieve the necessary fast computation times; however, these methods do not guarantee collision-free paths even with static obstacles. In contrast, hard-constraint methods ensure collision-free safety, but typically have longer computation times.
To address these issues, we propose three key contributions. First, the DYNUS Global Planner (DGP) and Temporal Safe Corridor Generation operate in spatio-temporal space and handle both static and dynamic obstacles in the 3D environment. Second, the Safe Planning Framework leverages a combination of exploratory, safe, and contingency trajectories to flexibly re-route when potential future collisions with dynamic obstacles are detected. Finally, the Fast Hard-Constraint Local Trajectory Formulation uses a variable elimination approach to reduce the problem size and enable faster computation by pre-computing dependencies between free and dependent variables while still ensuring collision-free trajectories.
We evaluated DYNUS in a variety of simulations, including dense forests, confined office spaces, cave systems, and dynamic environments. Our experiments show that DYNUS achieves a success rate of 100% and travel times that are approximately 25.0% faster than state-of-the-art methods. We also evaluated DYNUS on multiple platforms -- a quadrotor, a wheeled robot, and a quadruped -- in both simulation and hardware experiments. - [631] arXiv:2504.16748 (replaced) [pdf, html, other]
-
Title: Simple Graph Contrastive Learning via Fractional-order Neural Diffusion NetworksYanan Zhao, Feng Ji, Kai Zhao, Xuhao Li, Qiyu Kang, Wenfei Liang, Yahya Alkhatib, Xingchao Jian, Wee Peng TayComments: Submitted to ICMLSubjects: Machine Learning (cs.LG)
Graph Contrastive Learning (GCL) has recently made progress as an unsupervised graph representation learning paradigm. GCL approaches can be categorized into augmentation-based and augmentation-free methods. The former relies on complex data augmentations, while the latter depends on encoders that can generate distinct views of the same input. Both approaches may require negative samples for training. In this paper, we introduce a novel augmentation-free GCL framework based on graph neural diffusion models. Specifically, we utilize learnable encoders governed by Fractional Differential Equations (FDE). Each FDE is characterized by an order parameter of the differential operator. We demonstrate that varying these parameters allows us to produce learnable encoders that generate diverse views, capturing either local or global information, for contrastive learning. Our model does not require negative samples for training and is applicable to both homophilic and heterophilic datasets. We demonstrate its effectiveness across various datasets, achieving state-of-the-art performance.
- [632] arXiv:2504.16871 (replaced) [pdf, html, other]
-
Title: Exploring How LLMs Capture and Represent Domain-Specific KnowledgeMirian Hipolito Garcia, Camille Couturier, Daniel Madrigal Diaz, Ankur Mallick, Anastasios Kyrillidis, Robert Sim, Victor Ruhle, Saravan RajmohanSubjects: Machine Learning (cs.LG)
We study whether Large Language Models (LLMs) inherently capture domain-specific nuances in natural language. Our experiments probe the domain sensitivity of LLMs by examining their ability to distinguish queries from different domains using hidden states generated during the prefill phase. We reveal latent domain-related trajectories that indicate the model's internal recognition of query domains. We also study the robustness of these domain representations to variations in prompt styles and sources. Our approach leverages these representations for model selection, mapping the LLM that best matches the domain trace of the input query (i.e., the model with the highest performance on similar traces). Our findings show that LLMs can differentiate queries for related domains, and that the fine-tuned model is not always the most accurate. Unlike previous work, our interpretations apply to both closed and open-ended generative tasks
- [633] arXiv:2102.09552 (replaced) [pdf, html, other]
-
Title: Linear Functions to the Extended RealsComments: 23 pagesSubjects: Statistics Theory (math.ST); Computer Science and Game Theory (cs.GT)
This paper investigates functions from $\mathbb{R}^d$ to $\mathbb{R} \cup \{\pm \infty\}$ that satisfy axioms of linearity wherever allowed by extended-value arithmetic. They have a nontrivial structure defined inductively on $d$, and unlike finite linear functions, they require $\Omega(d^2)$ parameters to uniquely identify. In particular they can capture vertical tangent planes to epigraphs: a function (never $-\infty$) is convex if and only if it has an extended-valued subgradient at every point in its effective domain, if and only if it is the supremum of a family of "affine extended" functions. These results are applied to the well-known characterization of proper scoring rules, for the finite-dimensional case: it is carefully and rigorously extended here to a more constructive form. In particular it is investigated when proper scoring rules can be constructed from a given convex function.
- [634] arXiv:2305.04281 (replaced) [pdf, html, other]
-
Title: Analysing Multiscale Clusterings with Persistent HomologyComments: This work was presented at the Dagstuhl Seminar (23192) on "Topological Data Analysis and Applications"Subjects: Algebraic Topology (math.AT); Machine Learning (cs.LG)
In data clustering, it is often desirable to find not just a single partition into clusters but a sequence of partitions that describes the data at different scales (or levels of coarseness). A natural problem then is to analyse and compare the (not necessarily hierarchical) sequences of partitions that underpin such multiscale descriptions. Here, we use tools from topological data analysis and introduce the Multiscale Clustering Filtration (MCF), a well-defined and stable filtration of abstract simplicial complexes that encodes arbitrary cluster assignments in a sequence of partitions across scales of increasing coarseness. We show that the zero-dimensional persistent homology of the MCF measures the degree of hierarchy of this sequence, and the higher-dimensional persistent homology tracks the emergence and resolution of conflicts between cluster assignments across the sequence of partitions. To broaden the theoretical foundations of the MCF, we provide an equivalent construction via a nerve complex filtration, and we show that, in the hierarchical case, the MCF reduces to a Vietoris-Rips filtration of an ultrametric space. Using synthetic data, we then illustrate how the persistence diagram of the MCF provides a feature map that can serve to characterise and classify multiscale clusterings.
- [635] arXiv:2307.11104 (replaced) [pdf, html, other]
-
Title: Pseudorandomness of the Sticky Random WalkComments: 21 pages, 2 figuresSubjects: Probability (math.PR); Computational Complexity (cs.CC); Combinatorics (math.CO); Spectral Theory (math.SP)
We extend the pseudorandomness of random walks on expander graphs using the sticky random walk. Building on prior works, it was recently shown that expander random walks can fool all symmetric functions in total variation distance (TVD) upto an $O(\lambda(\frac{p}{\min f})^{O(p)})$ error, where $\lambda$ is the second largest eigenvalue of the expander, $p$ is the size of the arbitrary alphabet used to label the vertices, and $\min f = \min_{b\in[p]} f_b$, where $f_b$ is the fraction of vertices labeled $b$ in the graph. Golowich and Vadhan conjecture that the dependency on the $(\frac{p}{\min f})^{O(p)}$ term is not tight. In this paper, we resolve the conjecture in the affirmative for a family of expanders. We present a generalization of the sticky random walk for which Golowich and Vadhan predict a TVD upper bound of $O(\lambda p^{O(p)})$ using a Fourier-analytic approach. For this family of graphs, we use a combinatorial approach involving the Krawtchouk functions to derive a strengthened TVD of $O(\lambda)$. Furthermore, we present equivalencies between the generalized sticky random walk, and, using linear-algebraic techniques, show that the generalized sticky random walk parameterizes an infinite family of expander graphs.
- [636] arXiv:2310.16975 (replaced) [pdf, html, other]
-
Title: Efficient Neural Network Approaches for Conditional Optimal Transport with Applications in Bayesian InferenceComments: 26 pages, 7 tables, 8 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We present two neural network approaches that approximate the solutions of static and dynamic $\unicode{x1D450}\unicode{x1D45C}\unicode{x1D45B}\unicode{x1D451}\unicode{x1D456}\unicode{x1D461}\unicode{x1D456}\unicode{x1D45C}\unicode{x1D45B}\unicode{x1D44E}\unicode{x1D459}\unicode{x0020}\unicode{x1D45C}\unicode{x1D45D}\unicode{x1D461}\unicode{x1D456}\unicode{x1D45A}\unicode{x1D44E}\unicode{x1D459}\unicode{x0020}\unicode{x1D461}\unicode{x1D45F}\unicode{x1D44E}\unicode{x1D45B}\unicode{x1D460}\unicode{x1D45D}\unicode{x1D45C}\unicode{x1D45F}\unicode{x1D461}$ (COT) problems. Both approaches enable conditional sampling and conditional density estimation, which are core tasks in Bayesian inference$\unicode{x2013}$particularly in the simulation-based ($\unicode{x201C}$likelihood-free$\unicode{x201D}$) setting. Our methods represent the target conditional distribution as a transformation of a tractable reference distribution. Obtaining such a transformation, chosen here to be an approximation of the COT map, is computationally challenging even in moderate dimensions. To improve scalability, our numerical algorithms use neural networks to parameterize candidate maps and further exploit the structure of the COT problem. Our static approach approximates the map as the gradient of a partially input-convex neural network. It uses a novel numerical implementation to increase computational efficiency compared to state-of-the-art alternatives. Our dynamic approach approximates the conditional optimal transport via the flow map of a regularized neural ODE; compared to the static approach, it is slower to train but offers more modeling choices and can lead to faster sampling. We demonstrate both algorithms numerically, comparing them with competing state-of-the-art approaches, using benchmark datasets and simulation-based Bayesian inverse problems.
- [637] arXiv:2401.11679 (replaced) [pdf, html, other]
-
Title: Simulating Nighttime Visible Satellite Imagery of Tropical Cyclones Using Conditional Generative Adversarial NetworksSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
Visible (VIS) imagery is important for monitoring Tropical Cyclones (TCs) but is unavailable at night. This study presents a Conditional Generative Adversarial Networks (CGAN) model to generate nighttime VIS imagery with significantly enhanced accuracy and spatial resolution. Our method offers three key improvements compared to existing models. First, we replaced the L1 loss in the pix2pix framework with the Structural Similarity Index Measure (SSIM) loss, which significantly reduced image blurriness. Second, we selected multispectral infrared (IR) bands as input based on a thorough examination of their spectral properties, providing essential physical information for accurate simulation. Third, we incorporated the direction parameters of the sun and the satellite, which addressed the dependence of VIS images on sunlight directions and enabled a much larger training set from continuous daytime data. The model was trained and validated using data from the Advanced Himawari Imager (AHI) in the daytime, achieving statistical results of SSIM = 0.923 and Root Mean Square Error (RMSE) = 0.0299, which significantly surpasses existing models. We also performed a cross-satellite nighttime model validation using the Day/Night Band (DNB) of the Visible/Infrared Imager Radiometer Suite (VIIRS), which yields outstanding results compared to existing models. Our model is operationally applied to generate accurate VIS imagery with arbitrary virtual sunlight directions, significantly contributing to the nighttime monitoring of various meteorological phenomena.
- [638] arXiv:2402.14974 (replaced) [pdf, html, other]
-
Title: Towards Spatially-Lucid AI Classification in Non-Euclidean Space: An Application for MxIF Oncology DataComments: SIAM International Conference on Data Mining (SDM24)Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Given multi-category point sets from different place-types, our goal is to develop a spatially-lucid classifier that can distinguish between two classes based on the arrangements of their points. This problem is important for many applications, such as oncology, for analyzing immune-tumor relationships and designing new immunotherapies. It is challenging due to spatial variability and interpretability needs. Previously proposed techniques require dense training data or have limited ability to handle significant spatial variability within a single place-type. Most importantly, these deep neural network (DNN) approaches are not designed to work in non-Euclidean space, particularly point sets. Existing non-Euclidean DNN methods are limited to one-size-fits-all approaches. We explore a spatial ensemble framework that explicitly uses different training strategies, including weighted-distance learning rate and spatial domain adaptation, on various place-types for spatially-lucid classification. Experimental results on real-world datasets (e.g., MxIF oncology data) show that the proposed framework provides higher prediction accuracy than baseline methods.
- [639] arXiv:2403.10671 (replaced) [pdf, html, other]
-
Title: Variation Due to Regularization Tractably Recovers Bayesian Deep LearningComments: 16 pages, 9 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Uncertainty quantification in deep learning is crucial for safe and reliable decision-making in downstream tasks. Existing methods quantify uncertainty at the last layer or other approximations of the network which may miss some sources of uncertainty in the model. To address this gap, we propose an uncertainty quantification method for large networks based on variation due to regularization. Essentially, predictions that are more (less) sensitive to the regularization of network parameters are less (more, respectively) certain. This principle can be implemented by deterministically tweaking the training loss during the fine-tuning phase and reflects confidence in the output as a function of all layers of the network. We show that regularization variation (RegVar) provides rigorous uncertainty estimates that, in the infinitesimal limit, exactly recover the Laplace approximation in Bayesian deep learning. We demonstrate its success in several deep learning architectures, showing it can scale tractably with the network size while maintaining or improving uncertainty quantification quality. Our experiments across multiple datasets show that RegVar not only identifies uncertain predictions effectively but also provides insights into the stability of learned representations.
- [640] arXiv:2409.01444 (replaced) [pdf, html, other]
-
Title: A causal viewpoint on prediction model performance under changes in case-mix: discrimination and calibration respond differently for prognosis and diagnosis predictionsSubjects: Methodology (stat.ME); Machine Learning (cs.LG)
Prediction models need reliable predictive performance as they inform clinical decisions, aiding in diagnosis, prognosis, and treatment planning. The predictive performance of these models is typically assessed through discrimination and calibration. Changes in the distribution of the data impact model performance and there may be important changes between a model's current application and when and where its performance was last evaluated. In health-care, a typical change is a shift in case-mix. For example, for cardiovascular risk management, a general practitioner sees a different mix of patients than a specialist in a tertiary hospital.
This work introduces a novel framework that differentiates the effects of case-mix shifts on discrimination and calibration based on the causal direction of the prediction task. When prediction is in the causal direction (often the case for prognosis predictions), calibration remains stable under case-mix shifts, while discrimination does not. Conversely, when predicting in the anti-causal direction (often with diagnosis predictions), discrimination remains stable, but calibration does not.
A simulation study and empirical validation using cardiovascular disease prediction models demonstrate the implications of this framework. The causal case-mix framework provides insights for developing, evaluating and deploying prediction models across different clinical settings, emphasizing the importance of understanding the causal structure of the prediction task. - [641] arXiv:2409.18804 (replaced) [pdf, other]
-
Title: Convergence of Diffusion Models Under the Manifold Hypothesis in High-DimensionsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
Denoising Diffusion Probabilistic Models (DDPM) are powerful state-of-the-art methods used to generate synthetic data from high-dimensional data distributions and are widely used for image, audio, and video generation as well as many more applications in science and beyond. The \textit{manifold hypothesis} states that high-dimensional data often lie on lower-dimensional manifolds within the ambient space, and is widely believed to hold in provided examples. While recent results have provided invaluable insight into how diffusion models adapt to the manifold hypothesis, they do not capture the great empirical success of these models, making this a very fruitful research direction.
In this work, we study DDPMs under the manifold hypothesis and prove that they achieve rates independent of the ambient dimension in terms of score learning. In terms of sampling complexity, we obtain rates independent of the ambient dimension w.r.t. the Kullback-Leibler divergence, and $O(\sqrt{D})$ w.r.t. the Wasserstein distance. We do this by developing a new framework connecting diffusion models to the well-studied theory of extrema of Gaussian Processes. - [642] arXiv:2410.09046 (replaced) [pdf, html, other]
-
Title: Linear Convergence of Diffusion Models Under the Manifold HypothesisSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
Score-matching generative models have proven successful at sampling from complex high-dimensional data distributions. In many applications, this distribution is believed to concentrate on a much lower $d$-dimensional manifold embedded into $D$-dimensional space; this is known as the manifold hypothesis. The current best-known convergence guarantees are either linear in $D$ or polynomial (superlinear) in $d$. The latter exploits a novel integration scheme for the backward SDE. We take the best of both worlds and show that the number of steps diffusion models require in order to converge in Kullback-Leibler~(KL) divergence is linear (up to logarithmic terms) in the intrinsic dimension $d$. Moreover, we show that this linear dependency is sharp.
- [643] arXiv:2411.00617 (replaced) [pdf, html, other]
-
Title: Continuous and complete liver vessel segmentation with graph-attention guided diffusionComments: Second versionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Improving connectivity and completeness are the most challenging aspects of liver vessel segmentation, especially for small vessels. These challenges require both learning the continuous vessel geometry and focusing on small vessel detection. However, current methods do not explicitly address these two aspects and cannot generalize well when constrained by inconsistent annotations. Here, we take advantage of the generalization of the diffusion model and explicitly integrate connectivity and completeness in our diffusion-based segmentation model. Specifically, we use a graph-attention module that adds knowledge about vessel geometry. Additionally, we perform the graph-attention at multiple-scales, thus focusing on small liver vessels. Our method outperforms five state-of-the-art medical segmentation methods on two public datasets: 3D-ircadb-01 and LiVS.
- [644] arXiv:2411.13922 (replaced) [pdf, other]
-
Title: Exponentially Consistent Nonparametric Linkage-Based Clustering of Data SequencesSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
In this paper, we consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data sequences generated from {\em unknown} distributions. The distributions of the $M$ data sequences belong to $K$ underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and $k$-medoids distribution clustering, assume that the maximum intra-cluster distance ($d_L$) is smaller than the minimum inter-cluster distance ($d_H$). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, $d_I < d_H$, where $d_I$ is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that $d_I < d_L$ in general. Thus, our results show that SLINK is exponentially consistent for a larger class of problems than previously known. In our simulations, we also identify examples where $k$-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.
- [645] arXiv:2411.18218 (replaced) [pdf, other]
-
Title: Exponential speed up in Monte Carlo sampling through Radial UpdatesComments: 16 + 12 pages, 5 figures, 1 table, 2 algorithms; v2: revised, publishedSubjects: Computational Physics (physics.comp-ph); High Energy Physics - Lattice (hep-lat); Numerical Analysis (math.NA); Computation (stat.CO)
Recently, it has been shown that the hybrid Monte Carlo (HMC) algorithm is guaranteed to converge exponentially to a given target probability distribution $p(x)\propto e^{-V(x)}$ on non-compact spaces if augmented by an appropriate radial update. In this work we present a simple way to derive efficient radial updates meeting the necessary requirements for any potential $V$. We reduce the problem to finding a substitution for the radial direction $||x||=f(z)$ so that the effective potential $V(f(z))$ grows exponentially with $z\rightarrow\pm\infty$. Any additive update of $z$ then leads to the desired convergence. We show that choosing this update from a normal distribution with standard deviation $\sigma\approx 1/\sqrt{d}$ in $d$ dimensions yields very good results. We further generalise the previous results on radial updates to a wide class of Markov chain Monte Carlo (MCMC) algorithms beyond the HMC and we quantify the convergence behaviour of MCMC algorithms with badly chosen radial update. Finally, we apply the radial update to the sampling of heavy-tailed distributions and achieve a speed up of many orders of magnitude.
- [646] arXiv:2412.07007 (replaced) [pdf, html, other]
-
Title: A Diffuse Domain Approximation with Transmission-Type Boundary Conditions I: Asymptotic Analysis and NumericsSubjects: Analysis of PDEs (math.AP); Numerical Analysis (math.NA)
Diffuse domain methods (DDMs) have garnered significant attention for approximating solutions to partial differential equations on complex geometries. These methods implicitly represent the geometry by replacing the sharp boundary interface with a diffuse layer of thickness $\varepsilon$, which scales with the minimum grid size. This approach reformulates the original equations on an extended regular domain, incorporating boundary conditions through singular source terms. In this work, we conduct a matched asymptotic analysis of a DDM for a two-sided problem with transmission-type Robin boundary conditions. Our results show that, in the one dimensional space, the solution of the diffuse domain approximation asymptotically converges to the solution of the original problem, with exactly first-order accuracy in $\varepsilon$. Furthermore, we provide numerical simulations that validate and illustrate the analytical result.
- [647] arXiv:2501.12314 (replaced) [pdf, html, other]
-
Title: Uncertainty Quantification With Noise Injection in Neural Networks: A Bayesian PerspectiveSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Model uncertainty quantification involves measuring and evaluating the uncertainty linked to a model's predictions, helping assess their reliability and confidence. Noise injection is a technique used to enhance the robustness of neural networks by introducing randomness. In this paper, we establish a connection between noise injection and uncertainty quantification from a Bayesian standpoint. We theoretically demonstrate that injecting noise into the weights of a neural network is equivalent to Bayesian inference on a deep Gaussian process. Consequently, we introduce a Monte Carlo Noise Injection (MCNI) method, which involves injecting noise into the parameters during training and performing multiple forward propagations during inference to estimate the uncertainty of the prediction. Through simulation and experiments on regression and classification tasks, our method demonstrates superior performance compared to the baseline model.
- [648] arXiv:2501.13403 (replaced) [pdf, html, other]
-
Title: ROMA: ROtary and Movable AntennaComments: Rotary and movable antennas, multi-user MIMO, spectral efficiency, alternating optimizationSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
The rotary and movable antenna (ROMA) architecture represents a next-generation multi-antenna technology that enables flexible adjustment of antenna position and array rotation angles of the transceiver. In this letter, we propose a ROMA-aided multi-user MIMO communication system to fully enhance the efficiency and reliability of system transmissions. By deploying ROMA panels at both the transmitter and receiver sides, and jointly optimizing the three-dimensional (3D) rotation angles of each ROMA panel and the relative positions of antenna elements based on the spatial distribution of users and channel state information (CSI), we can achieve the objective of maximizing the average spectral efficiency (SE). Subsequently, we conduct a detailed analysis of the average SE performance of the system under the consideration of maximum ratio (MR) precoding. Due to the non-convexity of the optimization problem in the ROMA multi-user MIMO system, we propose an efficient solution based on an alternating optimization (AO) algorithm. Finally, simulation results demonstrate that the AO-based ROMA architecture can significantly improve the average SE. Furthermore, the performance improvement becomes more pronounced as the size of the movable region and the transmission power increase.
- [649] arXiv:2501.18577 (replaced) [pdf, html, other]
-
Title: Prediction-Powered Inference with Imputed Covariates and Nonuniform SamplingSubjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Machine learning models are increasingly used to produce predictions that serve as input data in subsequent statistical analyses. For example, computer vision predictions of economic and environmental indicators based on satellite imagery are used in downstream regressions; similarly, language models are widely used to approximate human ratings and opinions in social science research. However, failure to properly account for errors in the machine learning predictions renders standard statistical procedures invalid. Prior work uses what we call the Predict-Then-Debias estimator to give valid confidence intervals when machine learning algorithms impute missing variables, assuming a small complete sample from the population of interest. We expand the scope by introducing bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. Importantly, the method can be applied to many settings without requiring additional calculations. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.
- [650] arXiv:2502.05730 (replaced) [pdf, html, other]
-
Title: Attainability of Two-Point Testing Rates for Finite-Sample Location EstimationSubjects: Statistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
LeCam's two-point testing method yields perhaps the simplest lower bound for estimating the mean of a distribution: roughly, if it is impossible to well-distinguish a distribution centered at $\mu$ from the same distribution centered at $\mu+\Delta$, then it is impossible to estimate the mean by better than $\Delta/2$. It is setting-dependent whether or not a nearly matching upper bound is attainable. We study the conditions under which the two-point testing lower bound can be attained for univariate mean estimation; both in the setting of location estimation (where the distribution is known up to translation) and adaptive location estimation (unknown distribution). Roughly, we will say an estimate nearly attains the two-point testing lower bound if it incurs error that is at most polylogarithmically larger than the Hellinger modulus of continuity for $\tilde{\Omega}(n)$ samples.
Adaptive location estimation is particularly interesting as some distributions admit much better guarantees than sub-Gaussian rates (e.g. $\operatorname{Unif}(\mu-1,\mu+1)$ permits error $\Theta(\frac{1}{n})$, while the sub-Gaussian rate is $\Theta(\frac{1}{\sqrt{n}})$), yet it is not obvious whether these rates may be adaptively attained by one unified approach. Our main result designs an algorithm that nearly attains the two-point testing rate for mixtures of symmetric, log-concave distributions with a common mean. Moreover, this algorithm runs in near-linear time and is parameter-free. In contrast, we show the two-point testing rate is not nearly attainable even for symmetric, unimodal distributions.
We complement this with results for location estimation, showing the two-point testing rate is nearly attainable for unimodal distributions, but unattainable for symmetric distributions. - [651] arXiv:2503.03659 (replaced) [pdf, html, other]
-
Title: Conformal prediction of future insurance claims in the regression problemSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
In the current insurance literature, prediction of insurance claims in the regression problem is often performed with a statistical model. This model-based approach may potentially suffer from several drawbacks: (i) model misspecification, (ii) selection effect, and (iii) lack of finite-sample validity. This article addresses these three issues simultaneously by employing conformal prediction -- a general machine learning strategy for valid predictions. The proposed method is both model-free and tuning-parameter-free. It also guarantees finite-sample validity at a pre-assigned coverage probability level. Examples, based on both simulated and real data, are provided to demonstrate the excellent performance of the proposed method and its applications in insurance, especially regarding meeting the solvency capital requirement of European insurance regulation, Solvency II.
- [652] arXiv:2503.17430 (replaced) [pdf, html, other]
-
Title: Long-term excitation energy transfer predicted by a modified convolutional neural networks in the FMO complexesComments: 11 pages, 10figuresSubjects: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Quantum Physics (quant-ph)
In machine learning (ML), the risk of recursive strategies overfitting historical data has driven the development of convolutional neural networks (CNNs) in simulating quantum dissipative dynamics. In this work, we propose an efficient CNNs scheme incorporating novel redundant time-functions to predict 100 picosecond (ps) excitation energy transfer (EET) in Fenna-Matthews-Olson (FMO) complexes, in which the original time $t$ is normalized by mapping it to the [0, 1] range, allowing different functions focus on distinct time intervals, thereby effectively capturing the multi-timescale characteristics of EET dynamics. This method simplifies optimization and enhances learning efficiency, and demonstrate the accuracy, robustness, and efficiency of our approach in predicting quantum dissipative dynamics.
- [653] arXiv:2503.19175 (replaced) [pdf, html, other]
-
Title: A three-axis Nanopositioner based on Near-Field Acoustic Levitation and Electromagnetic ActuationSubjects: Applied Physics (physics.app-ph); Systems and Control (eess.SY)
Near-field acoustic levitation (NFAL) enables nanometer-scale positioning resolution and bandwidth exceeding several hundred hertz specifically along the vertical (Z) direction, owing to its high acoustic stiffness and squeeze film damping. However, its application to horizontal (XY) positioning is limited by significantly lower acoustic stiffness and insufficient damping in horizontal directions, resulting in reduced resolution and bandwidth. Moreover, NFAL-based positioning systems typically lack multi-axis actuation capabilities due to challenges in generating multi-directional acoustic forces. This work presents a hybrid positioning approach that overcomes the mentioned limitations by integrating NFAL with electromagnetic actuation. A planar magnetic platform is acoustically levitated, while a coplanar current-carrying coil provides horizontal trapping stiffness more than three orders of magnitude higher than that achievable with acoustic forces alone. Additionally, the coil generates three-dimensional electromagnetic forces, enabling multi-axis positioning capability. Eddy currents induced in a thin copper sheet integrated with the coil enhance horizontal damping by 52 times. We experimentally demonstrate precise 3-axis linear motion with a root mean square (RMS) positioning resolution better than 20 nm along all axes. The system achieves an in-plane motion range of 1.42 mm with a bandwidth of 16 Hz and a Z-axis motion range of 40 micrometers with a positioning bandwidth of 171 Hz.
- [654] arXiv:2504.01650 (replaced) [pdf, html, other]
-
Title: Sparse Gaussian Neural ProcessesComments: Proceedings of the 7th Symposium on Advances in Approximate Bayesian Inference, PMLR, 2025. 25 pages, 6 figures, 5 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Despite significant recent advances in probabilistic meta-learning, it is common for practitioners to avoid using deep learning models due to a comparative lack of interpretability. Instead, many practitioners simply use non-meta-models such as Gaussian processes with interpretable priors, and conduct the tedious procedure of training their model from scratch for each task they encounter. While this is justifiable for tasks with a limited number of data points, the cubic computational cost of exact Gaussian process inference renders this prohibitive when each task has many observations. To remedy this, we introduce a family of models that meta-learn sparse Gaussian process inference. Not only does this enable rapid prediction on new tasks with sparse Gaussian processes, but since our models have clear interpretations as members of the neural process family, it also allows manual elicitation of priors in a neural process for the first time. In meta-learning regimes for which the number of observed tasks is small or for which expert domain knowledge is available, this offers a crucial advantage.
- [655] arXiv:2504.04002 (replaced) [pdf, other]
-
Title: Machine Learning Reveals Composition Dependent Thermal Stability in Halide PerovskitesAbigail R. Hering, Mansha Dubey, Elahe Hosseini, Meghna Srivastava, Yu An, Juan-Pablo Correa-Baena, Houman Homayoun, Marina S. LeiteComments: 21 pages, 5 figuresSubjects: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
Halide perovskites exhibit unpredictable properties in response to environmental stressors, due to several composition-dependent degradation mechanisms. In this work, we apply data visualization and machine learning (ML) techniques to reveal unexpected correlations between composition, temperature, and material properties while using high throughput, in situ environmental photoluminescence (PL) experiments. Correlation heatmaps show the strong influence of Cs content on film degradation, and dimensionality reduction visualization methods uncover clear composition-based data clusters. An extreme gradient boosting algorithm (XGBoost) effectively forecasts PL features for ten perovskite films with both composition-agnostic (>85% accuracy) and composition-dependent (>75% accuracy) model approaches, while elucidating the relative feature importance of composition (up to 99%). This model validates a previously unseen anti-correlation between Cs content and material thermal stability. Our ML-based framework can be expanded to any perovskite family, significantly reducing the analysis time currently employed to identify stable options for photovoltaics.
- [656] arXiv:2504.07347 (replaced) [pdf, html, other]
-
Title: Throughput-Optimal Scheduling Algorithms for LLM Inference and AI AgentsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
As demand for Large Language Models (LLMs) and AI agents rapidly grows, optimizing systems for efficient LLM inference becomes critical. While significant efforts have focused on system-level engineering, little is explored from a mathematical modeling and queuing perspective.
In this paper, we aim to develop the queuing fundamentals for large language model (LLM) inference, bridging the gap between the queueing theory and LLM system communities. In particular, we study the throughput aspect in LLM inference systems. We prove that a large class of 'work-conserving' scheduling algorithms can achieve maximum throughput for individual inference LLM engine, highlighting 'work-conserving' as a key design principle in practice. In a network of LLM agents, work-conserving scheduling alone is insufficient, particularly when facing specific workload structures and multi-class workflows that require more sophisticated scheduling strategies. Evaluations of real-world systems show that Orca and Sarathi-serve are throughput-optimal, reassuring practitioners, while FasterTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our results highlight the substantial benefits that the queueing community can offer in improving LLM inference systems and call for more interdisciplinary development. - [657] arXiv:2504.08469 (replaced) [pdf, html, other]
-
Title: Artifact detection and localization in single-channel mobile EEG for sleep research using deep learning and attention mechanismsSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Artifacts in the electroencephalogram (EEG) degrade signal quality and impact the analysis of brain activity. Current methods for detecting artifacts in sleep EEG rely on simple threshold-based algorithms that require manual intervention, which is time-consuming and impractical due to the vast volume of data that novel mobile recording systems generate. We propose a convolutional neural network (CNN) model incorporating a convolutional block attention module (CNN-CBAM) to detect and identify the location of artifacts in the sleep EEG with attention maps. We benchmarked this model against six other machine learning and signal processing approaches. We trained/tuned all models on 72 manually annotated EEG recordings obtained during home-based monitoring from 18 healthy participants with a mean (SD) age of 68.05 y ($\pm$5.02). We tested them on 26 separate recordings from 6 healthy participants with a mean (SD) age of 68.33 y ($\pm$4.08), with contained artifacts in 4\% of epochs. CNN-CBAM achieved the highest area under the receiver operating characteristic curve (0.88), sensitivity (0.81), and specificity (0.86) when compared to the other approaches. The attention maps from CNN-CBAM localized artifacts within the epoch with a sensitivity of 0.71 and specificity of 0.67. This work demonstrates the feasibility of automating the detection and localization of artifacts in wearable sleep EEG.
- [658] arXiv:2504.09655 (replaced) [pdf, other]
-
Title: OmniMamba4D: Spatio-temporal Mamba for longitudinal CT lesion segmentationJustin Namuk Kim, Yiqiao Liu, Rajath Soans, Keith Persson, Sarah Halek, Michal Tomaszewski, Jianda Yuan, Gregory Goldmacher, Antong ChenComments: Accepted at IEEE International Symposium on Biomedical Imaging (ISBI) 2025Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Accurate segmentation of longitudinal CT scans is important for monitoring tumor progression and evaluating treatment responses. However, existing 3D segmentation models solely focus on spatial information. To address this gap, we propose OmniMamba4D, a novel segmentation model designed for 4D medical images (3D images over time). OmniMamba4D utilizes a spatio-temporal tetra-orientated Mamba block to effectively capture both spatial and temporal features. Unlike traditional 3D models, which analyze single-time points, OmniMamba4D processes 4D CT data, providing comprehensive spatio-temporal information on lesion progression. Evaluated on an internal dataset comprising of 3,252 CT scans, OmniMamba4D achieves a competitive Dice score of 0.682, comparable to state-of-the-arts (SOTA) models, while maintaining computational efficiency and better detecting disappeared lesions. This work demonstrates a new framework to leverage spatio-temporal information for longitudinal CT lesion segmentation.
- [659] arXiv:2504.12352 (replaced) [pdf, html, other]
-
Title: Deep Generative Model-Based Generation of Synthetic Individual-Specific Brain MRI SegmentationsSubjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
To the best of our knowledge, all existing methods that can generate synthetic brain magnetic resonance imaging (MRI) scans for a specific individual require detailed structural or volumetric information about the individual's brain. However, such brain information is often scarce, expensive, and difficult to obtain. In this paper, we propose the first approach capable of generating synthetic brain MRI segmentations -- specifically, 3D white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) segmentations -- for individuals using their easily obtainable and often readily available demographic, interview, and cognitive test information. Our approach features a novel deep generative model, CSegSynth, which outperforms existing prominent generative models, including conditional variational autoencoder (C-VAE), conditional generative adversarial network (C-GAN), and conditional latent diffusion model (C-LDM). We demonstrate the high quality of our synthetic segmentations through extensive evaluations. Also, in assessing the effectiveness of the individual-specific generation, we achieve superior volume prediction, with mean absolute errors of only 36.44mL, 29.20mL, and 35.51mL between the ground-truth WM, GM, and CSF volumes of test individuals and those volumes predicted based on generated individual-specific segmentations, respectively.
- [660] arXiv:2504.13340 (replaced) [pdf, html, other]
-
Title: Putting the Segment Anything Model to the Test with 3D Knee MRI - A Comparison with State-of-the-Art PerformanceComments: Work accepted at BMVC 2024. Minor changes to the camera-ready version since acceptance include a corrected running header and the addition of an Acknowledgments section (including code availability)Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Menisci are cartilaginous tissue found within the knee that contribute to joint lubrication and weight dispersal. Damage to menisci can lead to onset and progression of knee osteoarthritis (OA), a condition that is a leading cause of disability, and for which there are few effective therapies. Accurate automated segmentation of menisci would allow for earlier detection and treatment of meniscal abnormalities, as well as shedding more light on the role the menisci play in OA pathogenesis. Focus in this area has mainly used variants of convolutional networks, but there has been no attempt to utilise recent large vision transformer segmentation models. The Segment Anything Model (SAM) is a so-called foundation segmentation model, which has been found useful across a range of different tasks due to the large volume of data used for training the model. In this study, SAM was adapted to perform fully-automated segmentation of menisci from 3D knee magnetic resonance images. A 3D U-Net was also trained as a baseline. It was found that, when fine-tuning only the decoder, SAM was unable to compete with 3D U-Net, achieving a Dice score of $0.81\pm0.03$, compared to $0.87\pm0.03$, on a held-out test set. When fine-tuning SAM end-to-end, a Dice score of $0.87\pm0.03$ was achieved. The performance of both the end-to-end trained SAM configuration and the 3D U-Net were comparable to the winning Dice score ($0.88\pm0.03$) in the IWOAI Knee MRI Segmentation Challenge 2019. Performance in terms of the Hausdorff Distance showed that both configurations of SAM were inferior to 3D U-Net in matching the meniscus morphology. Results demonstrated that, despite its generalisability, SAM was unable to outperform a basic 3D U-Net in meniscus segmentation, and may not be suitable for similar 3D medical image segmentation tasks also involving fine anatomical structures with low contrast and poorly-defined boundaries.
- [661] arXiv:2504.13433 (replaced) [pdf, html, other]
-
Title: A Recursive Block Pillar Structure in the Kolakoski Sequence K(1,3)Comments: 12 pages, no figures. Undergraduate research. Includes full proofs and referencesSubjects: Combinatorics (math.CO); Formal Languages and Automata Theory (cs.FL); Dynamical Systems (math.DS)
The Kolakoski sequence K(1,3) over {1, 3} is known to be structured, unlike K(1,2), with symbol frequency d approx. 0.397 linked to the Pisot number alpha (real root of x^3 - 2x^2 - 1 = 0). We reveal an explicit nested recursion defining block sequences B(n) and pillar sequences P(n) via B(n+1) = B(n) P(n) B(n) and P(n+1) = G(R(P(n)), 3), where G generates runs from vector R(P(n)). We prove B(n) are prefixes of K(1,3) converging to it, and B(n+1) = G(R(B(n)), 1), directly reflecting the Kolakoski self-encoding property. We derive recurrences for lengths |B(n)|, |P(n)| and symbol counts, confirming growth governed by alpha (limit |B(n+1)|/|B(n)| = alpha as n -> infinity). If block/pillar densities converge, they must equal d. This constructive framework provides an alternative perspective on K(1,3)'s regularity, consistent with known results from substitution dynamics.
- [662] arXiv:2504.15603 (replaced) [pdf, html, other]
-
Title: Quantum Speedup for Sampling Random Spanning TreesSubjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS)
We present a quantum algorithm for sampling random spanning trees from a weighted graph in $\widetilde{O}(\sqrt{mn})$ time, where $n$ and $m$ denote the number of vertices and edges, respectively. Our algorithm has sublinear runtime for dense graphs and achieves a quantum speedup over the best-known classical algorithm, which runs in $\widetilde{O}(m)$ time. The approach carefully combines, on one hand, a classical method based on ``large-step'' random walks for reduced mixing time and, on the other hand, quantum algorithmic techniques, including quantum graph sparsification and a sampling-without-replacement variant of Hamoudi's multiple-state preparation. We also establish a matching lower bound, proving the optimality of our algorithm up to polylogarithmic factors. These results highlight the potential of quantum computing in accelerating fundamental graph sampling problems.
- [663] arXiv:2504.16098 (replaced) [pdf, html, other]
-
Title: SeizureFormer: A Transformer Model for IEA-Based Seizure Risk ForecastingTianning Feng (1), Junting Ni (1), Ezequiel Gleichgerrcht (2), Wei Jin (1) ((1) Department of Computer Science, Emory University, Atlanta, GA, USA, (2) Department of Neurology, Emory University, Atlanta, GA, USA)Comments: 9 pages, 2 figures. Submitted as an undergraduate honors thesis at Emory UniversitySubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
We present SeizureFormer, a Transformer-based model for long-term seizure risk forecasting using interictal epileptiform activity (IEA) surrogate biomarkers and long episode (LE) biomarkers from responsive neurostimulation (RNS) systems. Unlike raw scalp EEG-based models, SeizureFormer leverages structured, clinically relevant features and integrates CNN-based patch embedding, multi-head self-attention, and squeeze-and-excitation blocks to model both short-term dynamics and long-term seizure cycles. Tested across five patients and multiple prediction windows (1 to 14 days), SeizureFormer achieved state-of-the-art performance with mean ROC AUC of 79.44 percent and mean PR AUC of 76.29 percent. Compared to statistical, machine learning, and deep learning baselines, it demonstrates enhanced generalizability and seizure risk forecasting performance under class imbalance. This work supports future clinical integration of interpretable and robust seizure forecasting tools for personalized epilepsy management.