Computer Science
See recent articles
Showing new listings for Friday, 28 March 2025
- [1] arXiv:2503.20790 [pdf, html, other]
-
Title: Toward a Human-Centered AI-assisted Colonoscopy System in AustraliaComments: 4 pages, accepted by CHI '25 workshop Envisioning the Future of Interactive HealthSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
While AI-assisted colonoscopy promises improved colorectal cancer screening, its success relies on effective integration into clinical practice, not just algorithmic accuracy. This paper, based on an Australian field study (observations and gastroenterologist interviews), highlights a critical disconnect: current development prioritizes machine learning model performance, overlooking essential aspects of user interface design, workflow integration, and overall user experience. Industry interactions reveal a similar emphasis on data and algorithms. To realize AI's full potential, the HCI community must champion user-centered design, ensuring these systems are usable, support endoscopist expertise, and enhance patient outcomes.
- [2] arXiv:2503.20791 [pdf, html, other]
-
Title: ECLAIR: Enhanced Clarification for Interactive Responses in an Enterprise AI AssistantComments: 3 pages, 1 figureSubjects: Computation and Language (cs.CL)
Large language models (LLMs) have shown remarkable progress in understanding and generating natural language across various applications. However, they often struggle with resolving ambiguities in real-world, enterprise-level interactions, where context and domain-specific knowledge play a crucial role. In this demonstration, we introduce ECLAIR (Enhanced CLArification for Interactive Responses), a multi-agent framework for interactive disambiguation. ECLAIR enhances ambiguous user query clarification through an interactive process where custom agents are defined, ambiguity reasoning is conducted by the agents, clarification questions are generated, and user feedback is leveraged to refine the final response. When tested on real-world customer data, ECLAIR demonstrates significant improvements in clarification question generation compared to standard few-shot methods.
- [3] arXiv:2503.20793 [pdf, html, other]
-
Title: Semantic Web -- A Forgotten Wave of Artificial Intelligence?Comments: 21 pages, 9 figuresSubjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY)
The history of Artificial Intelligence is a narrative of waves - rising optimism followed by crashing disappointments. AI winters, such as the early 2000s, are often remembered as barren periods of innovation. This paper argues that such a perspective overlooks a crucial wave of AI that seems to be forgotten: the rise of the Semantic Web, which is based on knowledge representation, logic, and reasoning, and its interplay with intelligent Software Agents. Fast forward to today, and ChatGPT has reignited AI enthusiasm, built on deep learning and advanced neural models. However, before Large Language Models dominated the conversation, another ambitious vision emerged - one where AI-driven Software Agents autonomously served Web users based on a structured, machine-interpretable Web. The Semantic Web aimed to transform the World Wide Web into an ecosystem where AI could reason, understand, and act. Between 2000 and 2010, this vision sparked a significant research boom, only to fade into obscurity as AI's mainstream narrative shifted elsewhere. Today, as LLMs edge toward autonomous execution, we revisit this overlooked wave. By analyzing its academic impact through bibliometric data, we highlight the Semantic Web's role in AI history and its untapped potential for modern Software Agent development. Recognizing this forgotten chapter not only deepens our understanding of AI's cyclical evolution but also offers key insights for integrating emerging technologies.
- [4] arXiv:2503.20794 [pdf, html, other]
-
Title: Can Zero-Shot Commercial APIs Deliver Regulatory-Grade Clinical Text DeIdentification?Comments: 14 pages, accepted at Text2Story Workshop at ECIR 2025Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
We systematically assess the performance of three leading API-based de-identification systems - Azure Health Data Services, AWS Comprehend Medical, and OpenAI GPT-4o - against our de-identification systems on a ground truth dataset of 48 clinical documents annotated by medical experts. Our analysis, conducted at both entity-level and token-level, demonstrates that our solution, Healthcare NLP, achieves the highest accuracy, with a 96% F1-score in protected health information (PHI) detection, significantly outperforming Azure (91%), AWS (83%), and GPT-4o (79%). Beyond accuracy, Healthcare NLP is also the most cost-effective solution, reducing processing costs by over 80% compared to Azure and GPT-4o. Its fixed-cost local deployment model avoids the escalating per-request fees of cloud-based services, making it a scalable and economical choice. Our results underscore a critical limitation: zero-shot commercial APIs fail to meet the accuracy, adaptability, and cost-efficiency required for regulatory-grade clinical de-identification. Healthcare NLP's superior performance, customization capabilities, and economic advantages position it as the more viable solution for healthcare organizations seeking compliance and scalability in clinical NLP workflows.
- [5] arXiv:2503.20796 [pdf, html, other]
-
Title: EXPLICATE: Enhancing Phishing Detection through Explainable AI and LLM-Powered InterpretabilitySubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Sophisticated phishing attacks have emerged as a major cybersecurity threat, becoming more common and difficult to prevent. Though machine learning techniques have shown promise in detecting phishing attacks, they function mainly as "black boxes" without revealing their decision-making rationale. This lack of transparency erodes the trust of users and diminishes their effective threat response. We present EXPLICATE: a framework that enhances phishing detection through a three-component architecture: an ML-based classifier using domain-specific features, a dual-explanation layer combining LIME and SHAP for complementary feature-level insights, and an LLM enhancement using DeepSeek v3 to translate technical explanations into accessible natural language. Our experiments show that EXPLICATE attains 98.4 % accuracy on all metrics, which is on par with existing deep learning techniques but has better explainability. High-quality explanations are generated by the framework with an accuracy of 94.2 % as well as a consistency of 96.8\% between the LLM output and model prediction. We create EXPLICATE as a fully usable GUI application and a light Chrome extension, showing its applicability in many deployment situations. The research shows that high detection performance can go hand-in-hand with meaningful explainability in security applications. Most important, it addresses the critical divide between automated AI and user trust in phishing detection systems.
- [6] arXiv:2503.20797 [pdf, html, other]
-
Title: "Whose Side Are You On?" Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration SelectionSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
The rapid growth of social media platforms has led to concerns about radicalization, filter bubbles, and content bias. Existing approaches to classifying ideology are limited in that they require extensive human effort, the labeling of large datasets, and are not able to adapt to evolving ideological contexts. This paper explores the potential of Large Language Models (LLMs) for classifying the political ideology of online content in the context of the two-party US political spectrum through in-context learning (ICL). Our extensive experiments involving demonstration selection in label-balanced fashion, conducted on three datasets comprising news articles and YouTube videos, reveal that our approach significantly outperforms zero-shot and traditional supervised methods. Additionally, we evaluate the influence of metadata (e.g., content source and descriptions) on ideological classification and discuss its implications. Finally, we show how providing the source for political and non-political content influences the LLM's classification.
- [7] arXiv:2503.20798 [pdf, html, other]
-
Title: Payload-Aware Intrusion Detection with CMAE and Large Language ModelsSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Intrusion Detection Systems (IDS) are crucial for identifying malicious traffic, yet traditional signature-based methods struggle with zero-day attacks and high false positive rates. AI-driven packet-capture analysis offers a promising alternative. However, existing approaches rely heavily on flow-based or statistical features, limiting their ability to detect fine-grained attack patterns. This study proposes Xavier-CMAE, an enhanced Convolutional Multi-Head Attention Ensemble (CMAE) model that improves detection accuracy while reducing computational overhead. By replacing Word2Vec embeddings with a Hex2Int tokenizer and Xavier initialization, Xavier-CMAE eliminates pre-training, accelerates training, and achieves 99.971% accuracy with a 0.018% false positive rate, outperforming Word2Vec-based methods. Additionally, we introduce LLM-CMAE, which integrates pre-trained Large Language Model (LLM) tokenizers into CMAE. While LLMs enhance feature extraction, their computational cost hinders real-time detection. LLM-CMAE balances efficiency and performance, reaching 99.969% accuracy with a 0.019% false positive rate. This work advances AI-powered IDS by (1) introducing a payload-based detection framework, (2) enhancing efficiency with Xavier-CMAE, and (3) integrating LLM tokenizers for improved real-time detection.
- [8] arXiv:2503.20800 [pdf, html, other]
-
Title: Evidencing Unauthorized Training Data from AI Generated Content using Information IsotopesQi Tao, Yin Jinhua, Cai Dongqi, Xie Yueqi, Wang Huili, Hu Zhiyang, Yang Peiru, Nan Guoshun, Zhou Zhili, Wang Shangguang, Lyu Lingjuan, Huang Yongfeng, Lane NicholasSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In light of scaling laws, many AI institutions are intensifying efforts to construct advanced AIs on extensive collections of high-quality human data. However, in a rush to stay competitive, some institutions may inadvertently or even deliberately include unauthorized data (like privacy- or intellectual property-sensitive content) for AI training, which infringes on the rights of data owners. Compounding this issue, these advanced AI services are typically built on opaque cloud platforms, which restricts access to internal information during AI training and inference, leaving only the generated outputs available for forensics. Thus, despite the introduction of legal frameworks by various countries to safeguard data rights, uncovering evidence of data misuse in modern opaque AI applications remains a significant challenge. In this paper, inspired by the ability of isotopes to trace elements within chemical reactions, we introduce the concept of information isotopes and elucidate their properties in tracing training data within opaque AI systems. Furthermore, we propose an information isotope tracing method designed to identify and provide evidence of unauthorized data usage by detecting the presence of target information isotopes in AI generations. We conduct experiments on ten AI models (including GPT-4o, Claude-3.5, and DeepSeek) and four benchmark datasets in critical domains (medical data, copyrighted books, and news). Results show that our method can distinguish training datasets from non-training datasets with 99\% accuracy and significant evidence (p-value$<0.001$) by examining a data entry equivalent in length to a research paper. The findings show the potential of our work as an inclusive tool for empowering individuals, including those without expertise in AI, to safeguard their data rights in the rapidly evolving era of AI advancements and applications.
- [9] arXiv:2503.20801 [pdf, html, other]
-
Title: SE-GNN: Seed Expanded-Aware Graph Neural Network with Iterative Optimization for Semi-supervised Entity AlignmentComments: 15 pagesSubjects: Computation and Language (cs.CL)
Entity alignment aims to use pre-aligned seed pairs to find other equivalent entities from different knowledge graphs (KGs) and is widely used in graph fusion-related fields. However, as the scale of KGs increases, manually annotating pre-aligned seed pairs becomes difficult. Existing research utilizes entity embeddings obtained by aggregating single structural information to identify potential seed pairs, thus reducing the reliance on pre-aligned seed pairs. However, due to the structural heterogeneity of KGs, the quality of potential seed pairs obtained using only a single structural information is not ideal. In addition, although existing research improves the quality of potential seed pairs through semi-supervised iteration, they underestimate the impact of embedding distortion produced by noisy seed pairs on the alignment effect. In order to solve the above problems, we propose a seed expanded-aware graph neural network with iterative optimization for semi-supervised entity alignment, named SE-GNN. First, we utilize the semantic attributes and structural features of entities, combined with a conditional filtering mechanism, to obtain high-quality initial potential seed pairs. Next, we designed a local and global awareness mechanism. It introduces initial potential seed pairs and combines local and global information to obtain a more comprehensive entity embedding representation, which alleviates the impact of KGs structural heterogeneity and lays the foundation for the optimization of initial potential seed pairs. Then, we designed the threshold nearest neighbor embedding correction strategy. It combines the similarity threshold and the bidirectional nearest neighbor method as a filtering mechanism to select iterative potential seed pairs and also uses an embedding correction strategy to eliminate the embedding distortion.
- [10] arXiv:2503.20802 [pdf, html, other]
-
Title: CEFW: A Comprehensive Evaluation Framework for Watermark in Large Language ModelsSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Text watermarking provides an effective solution for identifying synthetic text generated by large language models. However, existing techniques often focus on satisfying specific criteria while ignoring other key aspects, lacking a unified evaluation. To fill this gap, we propose the Comprehensive Evaluation Framework for Watermark (CEFW), a unified framework that comprehensively evaluates watermarking methods across five key dimensions: ease of detection, fidelity of text quality, minimal embedding cost, robustness to adversarial attacks, and imperceptibility to prevent imitation or forgery. By assessing watermarks according to all these key criteria, CEFW offers a thorough evaluation of their practicality and effectiveness. Moreover, we introduce a simple and effective watermarking method called Balanced Watermark (BW), which guarantees robustness and imperceptibility through balancing the way watermark information is added. Extensive experiments show that BW outperforms existing methods in overall performance across all evaluation dimensions. We release our code to the community for future research. this https URL.
- [11] arXiv:2503.20803 [pdf, html, other]
-
Title: Leveraging VAE-Derived Latent Spaces for Enhanced Malware Detection with Machine Learning ClassifiersSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
This paper assesses the performance of five machine learning classifiers: Decision Tree, Naive Bayes, LightGBM, Logistic Regression, and Random Forest using latent representations learned by a Variational Autoencoder from malware datasets. Results from the experiments conducted on different training-test splits with different random seeds reveal that all the models perform well in detecting malware with ensemble methods (LightGBM and Random Forest) performing slightly better than the rest. In addition, the use of latent features reduces the computational cost of the model and the need for extensive hyperparameter tuning for improved efficiency of the model for deployment. Statistical tests show that these improvements are significant, and thus, the practical relevance of integrating latent space representation with traditional classifiers for effective malware detection in cybersecurity is established.
- [12] arXiv:2503.20804 [pdf, html, other]
-
Title: AED: Automatic Discovery of Effective and Diverse Vulnerabilities for Autonomous Driving Policy with Large Language ModelsSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Assessing the safety of autonomous driving policy is of great importance, and reinforcement learning (RL) has emerged as a powerful method for discovering critical vulnerabilities in driving policies. However, existing RL-based approaches often struggle to identify vulnerabilities that are both effective-meaning the autonomous vehicle is genuinely responsible for the accidents-and diverse-meaning they span various failure types. To address these challenges, we propose AED, a framework that uses large language models (LLMs) to automatically discover effective and diverse vulnerabilities in autonomous driving policies. We first utilize an LLM to automatically design reward functions for RL training. Then we let the LLM consider a diverse set of accident types and train adversarial policies for different accident types in parallel. Finally, we use preference-based learning to filter ineffective accidents and enhance the effectiveness of each vulnerability. Experiments across multiple simulated traffic scenarios and tested policies show that AED uncovers a broader range of vulnerabilities and achieves higher attack success rates compared with expert-designed rewards, thereby reducing the need for manual reward engineering and improving the diversity and effectiveness of vulnerability discovery.
- [13] arXiv:2503.20806 [pdf, html, other]
-
Title: SCVI: Bridging Social and Cyber Dimensions for Comprehensive Vulnerability AssessmentShutonu Mitra, Tomas Neguyen, Qi Zhang, Hyungmin Kim, Hossein Salemi, Chen-Wei Chang, Fengxiu Zhang, Michin Hong, Chang-Tien Lu, Hemant Purohit, Jin-Hee ChoSubjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY)
The rise of cyber threats on social media platforms necessitates advanced metrics to assess and mitigate social cyber vulnerabilities. This paper presents the Social Cyber Vulnerability Index (SCVI), a novel framework integrating individual-level factors (e.g., awareness, behavioral traits, psychological attributes) and attack-level characteristics (e.g., frequency, consequence, sophistication) for comprehensive socio-cyber vulnerability assessment. SCVI is validated using survey data (iPoll) and textual data (Reddit scam reports), demonstrating adaptability across modalities while revealing demographic disparities and regional vulnerabilities. Comparative analyses with the Common Vulnerability Scoring System (CVSS) and the Social Vulnerability Index (SVI) show the superior ability of SCVI to capture nuanced socio-technical risks. Monte Carlo-based weight variability analysis confirms SCVI is robust and highlights its utility in identifying high-risk groups. By addressing gaps in traditional metrics, SCVI offers actionable insights for policymakers and practitioners, advancing inclusive strategies to mitigate emerging threats such as AI-powered phishing and deepfake scams.
- [14] arXiv:2503.20808 [pdf, html, other]
-
Title: Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for Federated Continual LearningJournal-ref: Information Processing in Medical Imaging(IPMI)2025Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (FedDAH). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:this https URL.
- [15] arXiv:2503.20819 [pdf, html, other]
-
Title: Reflections on Diversity: A Real-time Virtual Mirror for Inclusive 3D Face TransformationsSubjects: Graphics (cs.GR); Image and Video Processing (eess.IV)
Real-time 3D face manipulation has significant applications in virtual reality, social media and human-computer interaction. This paper introduces a novel system, which we call Mirror of Diversity (MOD), that combines Generative Adversarial Networks (GANs) for texture manipulation and 3D Morphable Models (3DMMs) for facial geometry to achieve realistic face transformations that reflect various demographic characteristics, emphasizing the beauty of diversity and the universality of human features. As participants sit in front of a computer monitor with a camera positioned above, their facial characteristics are captured in real time and can further alter their digital face reconstruction with transformations reflecting different demographic characteristics, such as gender and ethnicity (e.g., a person from Africa, Asia, Europe). Another feature of our system, which we call Collective Face, generates an averaged face representation from multiple participants' facial data. A comprehensive evaluation protocol is implemented to assess the realism and demographic accuracy of the transformations. Qualitative feedback is gathered through participant questionnaires, which include comparisons of MOD transformations with similar filters on platforms like Snapchat and TikTok. Additionally, quantitative analysis is conducted using a pretrained Convolutional Neural Network that predicts gender and ethnicity, to validate the accuracy of demographic transformations.
- [16] arXiv:2503.20820 [pdf, html, other]
-
Title: Benchmarking Multi-Object GraspingTianze Chen, Ricardo Frumento, Giulia Pagnanelli, Gianmarco Cei, Villa Keth, Shahadding Gafarov, Jian Gong, Zihe Ye, Marco Baracca, Salvatore D'Avella, Matteo Bianchi, Yu SunComments: This paper contains 11 pages and 5 figures. This paper is under review of a robotics journalSubjects: Robotics (cs.RO)
In this work, we describe a multi-object grasping benchmark to evaluate the grasping and manipulation capabilities of robotic systems in both pile and surface scenarios. The benchmark introduces three robot multi-object grasping benchmarking protocols designed to challenge different aspects of robotic manipulation. These protocols are: 1) the Only-Pick-Once protocol, which assesses the robot's ability to efficiently pick multiple objects in a single attempt; 2) the Accurate pick-trnsferring protocol, which evaluates the robot's capacity to selectively grasp and transport a specific number of objects from a cluttered environment; and 3) the Pick-transferring-all protocol, which challenges the robot to clear an entire scene by sequentially grasping and transferring all available objects. These protocols are intended to be adopted by the broader robotics research community, providing a standardized method to assess and compare robotic systems' performance in multi-object grasping tasks. We establish baselines for these protocols using standard planning and perception algorithms on a Barrett hand, Robotiq parallel jar gripper, and the Pisa/IIT Softhand-2, which is a soft underactuated robotic hand. We discuss the results in relation to human performance in similar tasks we well.
- [17] arXiv:2503.20821 [pdf, html, other]
-
Title: "Hello, is this Anna?": A First Look at Pig-Butchering ScamsSubjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY)
Pig-butchering scams, or Sha Zhu Pan, have emerged as a complex form of cyber-enabled financial fraud that combines elements of romance, investment fraud, and advanced social engineering tactics to systematically exploit victims. In this paper, we present the first qualitative analysis of pig-butchering scams, informed by in-depth semi-structured interviews with N=26 victims. We capture nuanced, first-hand accounts from victims across multiple regions, providing insight into the lifecycle of pig-butchering scams and the complex emotional and financial manipulation involved. We systematically analyze each phase of the scam, revealing that perpetrators employ tactics such as staged trust-building, fraudulent financial platforms, fabricated investment returns, and repeated high-pressure tactics, all designed to exploit victims' trust and financial resources over extended periods. Our findings reveal an organized scam lifecycle characterized by emotional manipulation, staged financial exploitation, and persistent re-engagement efforts that amplify victim losses. We also find complex psychological and financial impacts on victims, including heightened vulnerability to secondary scams. Finally, we propose actionable intervention points for social media and financial platforms to curb the prevalence of these scams and highlight the need for non-stigmatizing terminology to encourage victims to report and seek assistance.
- [18] arXiv:2503.20823 [pdf, html, other]
-
Title: Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution StrategyComments: Accepted at CVPR2025Subjects: Cryptography and Security (cs.CR)
Despite the remarkable versatility of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) to generalize across both language and vision tasks, LLMs and MLLMs have shown vulnerability to jailbreaking, generating textual outputs that undermine safety, ethical, and bias standards when exposed to harmful or sensitive inputs. With the recent advancement of safety alignment via preference-tuning from human feedback, LLMs and MLLMs have been equipped with safety guardrails to yield safe, ethical, and fair responses with regard to harmful inputs. However, despite the significance of safety alignment, research on the vulnerabilities remains largely underexplored. In this paper, we investigate the unexplored vulnerability of the safety alignment, examining its ability to consistently provide safety guarantees for out-of-distribution(OOD)-ifying harmful inputs that may fall outside the aligned data distribution. Our key observation is that OOD-ifying the vanilla harmful inputs highly increases the uncertainty of the model to discern the malicious intent within the input, leading to a higher chance of being jailbroken. Exploiting this vulnerability, we propose JOOD, a new Jailbreak framework via OOD-ifying inputs beyond the safety alignment. We explore various off-the-shelf visual and textual transformation techniques for OOD-ifying the harmful inputs. Notably, we observe that even simple mixing-based techniques such as image mixup prove highly effective in increasing the uncertainty of the model, thereby facilitating the bypass of the safety alignment. Experiments across diverse jailbreak scenarios demonstrate that JOOD effectively jailbreaks recent proprietary LLMs and MLLMs such as GPT-4 and o1 with high attack success rate, which previous attack approaches have consistently struggled to jailbreak. Code is available at this https URL.
- [19] arXiv:2503.20827 [pdf, html, other]
-
Title: Multimodal Image Matching based on Frequency-domain Information of Local Energy ResponseComments: 34 pages, 11 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Complicated nonlinear intensity differences, nonlinear local geometric distortions, noises and rotation transformation are main challenges in multimodal image matching. In order to solve these problems, we propose a method based on Frequency-domain Information of Local Energy Response called FILER. The core of FILER is the local energy response model based on frequency-domain information, which can overcome the effect of nonlinear intensity differences. To improve the robustness to local nonlinear geometric distortions and noises, we design a new edge structure enhanced feature detector and convolutional feature weighted descriptor, respectively. In addition, FILER overcomes the sensitivity of the frequency-domain information to the rotation angle and achieves rotation invariance. Extensive experiments multimodal image pairs show that FILER outperforms other state-of-the-art algorithms and has good robustness and universality.
- [20] arXiv:2503.20830 [pdf, html, other]
-
Title: MedSegNet10: A Publicly Accessible Network Repository for Split Federated Medical Image SegmentationComments: 20 pages, 14 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Machine Learning (ML) and Deep Learning (DL) have shown significant promise in healthcare, particularly in medical image segmentation, which is crucial for accurate disease diagnosis and treatment planning. Despite their potential, challenges such as data privacy concerns, limited annotated data, and inadequate training data persist. Decentralized learning approaches such as federated learning (FL), split learning (SL), and split federated learning (SplitFed/SFL) address these issues effectively. This paper introduces "MedSegNet10," a publicly accessible repository designed for medical image segmentation using split-federated learning. MedSegNet10 provides a collection of pre-trained neural network architectures optimized for various medical image types, including microscopic images of human blastocysts, dermatoscopic images of skin lesions, and endoscopic images of lesions, polyps, and ulcers, with applications extending beyond these examples. By leveraging SplitFed's benefits, MedSegNet10 allows collaborative training on privately stored, horizontally split data, ensuring privacy and integrity. This repository supports researchers, practitioners, trainees, and data scientists, aiming to advance medical image segmentation while maintaining patient data privacy. The repository is available at: this https URL (password upon request to the authors).
- [21] arXiv:2503.20831 [pdf, html, other]
-
Title: Advancing Vulnerability Classification with BERT: A Multi-Objective Learning ModelComments: 9 PagesSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
The rapid increase in cybersecurity vulnerabilities necessitates automated tools for analyzing and classifying vulnerability reports. This paper presents a novel Vulnerability Report Classifier that leverages the BERT (Bidirectional Encoder Representations from Transformers) model to perform multi-label classification of Common Vulnerabilities and Exposures (CVE) reports from the National Vulnerability Database (NVD). The classifier predicts both the severity (Low, Medium, High, Critical) and vulnerability types (e.g., Buffer Overflow, XSS) from textual descriptions. We introduce a custom training pipeline using a combined loss function-Cross-Entropy for severity and Binary Cross-Entropy with Logits for types-integrated into a Hugging Face Trainer subclass. Experiments on recent NVD data demonstrate promising results, with decreasing evaluation loss across epochs. The system is deployed via a REST API and a Streamlit UI, enabling real-time vulnerability analysis. This work contributes a scalable, open-source solution for cybersecurity practitioners to automate vulnerability triage.
- [22] arXiv:2503.20833 [pdf, other]
-
Title: The Oxford Insights Government AI Readiness Index (GARI): An Analysis of its Data and Overcoming Obstacles, with a Case Study of IraqComments: 18 pages, 5 figuresSubjects: Computers and Society (cs.CY)
This research examines the "Government AI Readines Index" (GARI) issued by Oxford, analyzing data on governmental preparedness for adopting artificial intelligence acros different countrie. It highlights the evaluation criteria used to assess readiness, including technological infrastructure, human resources, supportive policies, and the level of innovation.
The study specifically focuses on Iraq, exploring the challenge the Iraqi government face in adopting and implementing AI technology. It discussed economic, social, and political barriers that hinder this transition and provides concrete recommendations to overcome these obstacle.
By analyzing Iraq case, the research aims to offer insight into improving collaboration between the public and private sectors to enhance the effective use of AI in governance and public administration. Additionally, the study emphasizes the importance of investing in education, training, and capacity building to develop a skilled workforce, enabling countries to harness AI potential and improve government service efficiency. - [23] arXiv:2503.20835 [pdf, html, other]
-
Title: Comprehensive Manuscript Assessment with Text Summarization Using 69707 articlesSubjects: Computation and Language (cs.CL)
Rapid and efficient assessment of the future impact of research articles is a significant concern for both authors and reviewers. The most common standard for measuring the impact of academic papers is the number of citations. In recent years, numerous efforts have been undertaken to predict citation counts within various citation windows. However, most of these studies focus solely on a specific academic field or require early citation counts for prediction, rendering them impractical for the early-stage evaluation of papers. In this work, we harness Scopus to curate a significantly comprehensive and large-scale dataset of information from 69707 scientific articles sourced from 99 journals spanning multiple disciplines. We propose a deep learning methodology for the impact-based classification tasks, which leverages semantic features extracted from the manuscripts and paper metadata. To summarize the semantic features, such as titles and abstracts, we employ a Transformer-based language model to encode semantic features and design a text fusion layer to capture shared information between titles and abstracts. We specifically focus on the following impact-based prediction tasks using information of scientific manuscripts in pre-publication stage: (1) The impact of journals in which the manuscripts will be published. (2) The future impact of manuscripts themselves. Extensive experiments on our datasets demonstrate the superiority of our proposed model for impact-based prediction tasks. We also demonstrate potentials in generating manuscript's feedback and improvement suggestions.
- [24] arXiv:2503.20836 [pdf, other]
-
Title: Named Entity Recognition in ContextJournal-ref: Second Workshop on Ancient Language Processing, Mar 2025, Albuquerque, United StatesSubjects: Computation and Language (cs.CL)
We present the Named Entity Recognition system developed by the Edit Dunhuang team for the EvaHan2025 competition. Our approach integrates three core components: (1) Pindola, a modern transformer-based bidirectional encoder pretrained on a large corpus of Classical Chinese texts; (2) a retrieval module that fetches relevant external context for each target sequence; and (3) a generative reasoning step that summarizes retrieved context in Classical Chinese for more robust entity disambiguation. Using this approach, we achieve an average F1 score of 85.58, improving upon the competition baseline by nearly 5 points.
- [25] arXiv:2503.20839 [pdf, html, other]
-
Title: TAR: Teacher-Aligned Representations via Contrastive Learning for Quadrupedal LocomotionComments: This work has been submitted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025 for reviewSubjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Quadrupedal locomotion via Reinforcement Learning (RL) is commonly addressed using the teacher-student paradigm, where a privileged teacher guides a proprioceptive student policy. However, key challenges such as representation misalignment between the privileged teacher and the proprioceptive-only student, covariate shift due to behavioral cloning, and lack of deployable adaptation lead to poor generalization in real-world scenarios. We propose Teacher-Aligned Representations via Contrastive Learning (TAR), a framework that leverages privileged information with self-supervised contrastive learning to bridge this gap. By aligning representations to a privileged teacher in simulation via contrastive objectives, our student policy learns structured latent spaces and exhibits robust generalization to Out-of-Distribution (OOD) scenarios, surpassing the fully privileged "Teacher". Results showed accelerated training by 2x compared to state-of-the-art baselines to achieve peak performance. OOD scenarios showed better generalization by 40 percent on average compared to existing methods. Additionally, TAR transitions seamlessly into learning during deployment without requiring privileged states, setting a new benchmark in sample-efficient, adaptive locomotion and enabling continual fine-tuning in real-world scenarios. Open-source code and videos are available at this https URL.
- [26] arXiv:2503.20840 [pdf, html, other]
-
Title: CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process SupervisionSubjects: Software Engineering (cs.SE)
Tool invocation significantly enhances the capabilities of Large Language Models (LLMs), yet challenges persist, particularly in complex task scenarios. Current methods, such as instruction-enhanced reasoning and supervised fine-tuning, often result in unnecessarily long reasoning paths and face difficulties in verifying the correctness of intermediate steps. In this paper, we propose CodeTool, a novel framework for stepwise code generation that improves LLM tool invocation by leveraging the concise and easily verifiable nature of code. CodeTool incorporates two distinct process rewards: the On-the-spot Reward, which provides immediate feedback on the accuracy of each tool invocation, and the Latent Reward, which assesses the contribution of each step toward overall task completion. By maximizing the cumulative reward of the On-the-spot and Latend Rewards at each step, LLMs are guided to follow efficient and accurate reasoning paths. Extensive experiments on StableToolBench and RestBench-TMDB demonstrate the superiority of CodeTool over existing approaches.
- [27] arXiv:2503.20842 [pdf, other]
-
Title: Anti Robot SpeciesismSubjects: Robotics (cs.RO)
Humanoid robots are a form of embodied artificial intelligence (AI) that looks and acts more and more like humans. Powered by generative AI and advances in robotics, humanoid robots can speak and interact with humans rather naturally but are still easily recognizable as robots. But how will we treat humanoids when they seem indistinguishable from humans in appearance and mind? We find a tendency (called "anti-robot" speciesism) to deny such robots humanlike capabilities, driven by motivations to accord members of the human species preferential treatment. Six experiments show that robots are denied humanlike attributes, simply because they are not biological beings and because humans want to avoid feelings of cognitive dissonance when utilizing such robots for unsavory tasks. Thus, people do not rationally attribute capabilities to perfectly humanlike robots but deny them capabilities as it suits them.
- [28] arXiv:2503.20844 [pdf, html, other]
-
Title: Robust Deep Reinforcement Learning in Robotics via Adaptive Gradient-Masked Adversarial AttacksZongyuan Zhang, Tianyang Duan, Zheng Lin, Dong Huang, Zihan Fang, Zekai Sun, Ling Xiong, Hongbin Liang, Heming Cui, Yong Cui, Yue GaoComments: 9 pages, 6 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Robotics (cs.RO)
Deep reinforcement learning (DRL) has emerged as a promising approach for robotic control, but its realworld deployment remains challenging due to its vulnerability to environmental perturbations. Existing white-box adversarial attack methods, adapted from supervised learning, fail to effectively target DRL agents as they overlook temporal dynamics and indiscriminately perturb all state dimensions, limiting their impact on long-term rewards. To address these challenges, we propose the Adaptive Gradient-Masked Reinforcement (AGMR) Attack, a white-box attack method that combines DRL with a gradient-based soft masking mechanism to dynamically identify critical state dimensions and optimize adversarial policies. AGMR selectively allocates perturbations to the most impactful state features and incorporates a dynamic adjustment mechanism to balance exploration and exploitation during training. Extensive experiments demonstrate that AGMR outperforms state-of-the-art adversarial attack methods in degrading the performance of the victim agent and enhances the victim agent's robustness through adversarial defense mechanisms.
- [29] arXiv:2503.20846 [pdf, other]
-
Title: Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road AheadComments: 23 pages + references + Appendix. PreprintSubjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees.
Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($\epsilon \leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential. - [30] arXiv:2503.20847 [pdf, html, other]
-
Title: The Data Sharing Paradox of Synthetic Data in HealthcareJim Achterberg, Bram van Dijk, Saif ul Islam, Hafiz Muhammad Waseem, Parisis Gallos, Gregory Epiphaniou, Carsten Maple, Marcel Haas, Marco SpruitComments: Accepted for publication at Medical Informatics Europe 2025 conference, GlasgowSubjects: Databases (cs.DB); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
Synthetic data offers a promising solution to privacy concerns in healthcare by generating useful datasets in a privacy-aware manner. However, although synthetic data is typically developed with the intention of sharing said data, ambiguous reidentification risk assessments often prevent synthetic data from seeing the light of day. One of the main causes is that privacy metrics for synthetic data, which inform on reidentification risks, are not well-aligned with practical requirements and regulations regarding data sharing in healthcare. This article discusses the paradoxical situation where synthetic data is designed for data sharing but is often still restricted. We also discuss how the field should move forward to mitigate this issue.
- [31] arXiv:2503.20848 [pdf, html, other]
-
Title: The Backfiring Effect of Weak AI Safety RegulationComments: 28 pages, 8 figuresSubjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Theoretical Economics (econ.TH)
Recent policy proposals aim to improve the safety of general-purpose AI, but there is little understanding of the efficacy of different regulatory approaches to AI safety. We present a strategic model that explores the interactions between the regulator, the general-purpose AI technology creators, and domain specialists--those who adapt the AI for specific applications. Our analysis examines how different regulatory measures, targeting different parts of the development chain, affect the outcome of the development process. In particular, we assume AI technology is described by two key attributes: safety and performance. The regulator first sets a minimum safety standard that applies to one or both players, with strict penalties for non-compliance. The general-purpose creator then develops the technology, establishing its initial safety and performance levels. Next, domain specialists refine the AI for their specific use cases, and the resulting revenue is distributed between the specialist and generalist through an ex-ante bargaining process. Our analysis of this game reveals two key insights: First, weak safety regulation imposed only on the domain specialists can backfire. While it might seem logical to regulate use cases (as opposed to the general-purpose technology), our analysis shows that weak regulations targeting domain specialists alone can unintentionally reduce safety. This effect persists across a wide range of settings. Second, in sharp contrast to the previous finding, we observe that stronger, well-placed regulation can in fact benefit all players subjected to it. When regulators impose appropriate safety standards on both AI creators and domain specialists, the regulation functions as a commitment mechanism, leading to safety and performance gains, surpassing what is achieved under no regulation or regulating one player only.
- [32] arXiv:2503.20849 [pdf, html, other]
-
Title: An Algebraic Approach to Weighted Answer-set ProgrammingSubjects: Logic in Computer Science (cs.LO); Programming Languages (cs.PL); Symbolic Computation (cs.SC)
Logic programs, more specifically, Answer-set programs, can be annotated with probabilities on facts to express uncertainty. We address the problem of propagating weight annotations on facts (eg probabilities) of an ASP to its standard models, and from there to events (defined as sets of atoms) in a dataset over the program's domain. We propose a novel approach which is algebraic in the sense that it relies on an equivalence relation over the set of events. Uncertainty is then described as polynomial expressions over variables. We propagate the weight function in the space of models and events, rather than doing so within the syntax of the program. As evidence that our approach is sound, we show that certain facts behave as expected. Our approach allows us to investigate weight annotated programs and to determine how suitable a given one is for modeling a given dataset containing events.
- [33] arXiv:2503.20850 [pdf, html, other]
-
Title: Both Direct and Indirect Evidence Contribute to Dative Alternation Preferences in Language ModelsSubjects: Computation and Language (cs.CL)
Language models (LMs) tend to show human-like preferences on a number of syntactic phenomena, but the extent to which these are attributable to direct exposure to the phenomena or more general properties of language is unclear. We explore this with the English dative alternation (DO: "gave Y the X" vs. PO: "gave the X to Y"), using a controlled rearing paradigm wherein we iteratively train small LMs on systematically manipulated input. We focus on properties that affect the choice of alternant: length and animacy. Both properties are directly present in datives but also reflect more global tendencies for shorter elements to precede longer ones and animates to precede inanimates. First, by manipulating and ablating datives for these biases in the input, we show that direct evidence of length and animacy matters, but easy-first preferences persist even without such evidence. Then, using LMs trained on systematically perturbed datasets to manipulate global length effects (re-linearizing sentences globally while preserving dependency structure), we find that dative preferences can emerge from indirect evidence. We conclude that LMs' emergent syntactic preferences come from a mix of direct and indirect sources.
- [34] arXiv:2503.20851 [pdf, html, other]
-
Title: StepGrade: Grading Programming Assignments with Context-Aware LLMsComments: Accepted to the 15th IEEE Integrated STEM Education Conference (ISEC)Subjects: Software Engineering (cs.SE)
Grading programming assignments is a labor-intensive and time-consuming process that demands careful evaluation across multiple dimensions of the code. To overcome these challenges, automated grading systems are leveraged to enhance efficiency and reduce the workload on educators. Traditional automated grading systems often focus solely on correctness, failing to provide interpretable evaluations or actionable feedback for students. This study introduces StepGrade, which explores the use of Chain-of-Thought (CoT) prompting with Large Language Models (LLMs) as an innovative solution to address these challenges. Unlike regular prompting, which offers limited and surface-level outputs, CoT prompting allows the model to reason step-by-step through the interconnected grading criteria, i.e., functionality, code quality, and algorithmic efficiency, ensuring a more comprehensive and transparent evaluation. This interconnectedness necessitates the use of CoT to systematically address each criterion while considering their mutual influence. To empirically validate the efficiency of StepGrade, we conducted a case study involving 30 Python programming assignments across three difficulty levels (easy, intermediate, and advanced). The approach is validated against expert human evaluations to assess its consistency, accuracy, and fairness. Results demonstrate that CoT prompting significantly outperforms regular prompting in both grading quality and interpretability. By reducing the time and effort required for manual grading, this research demonstrates the potential of GPT-4 with CoT prompting to revolutionize programming education through scalable and pedagogically effective automated grading systems.
- [35] arXiv:2503.20853 [pdf, html, other]
-
Title: Unified Multimodal Discrete DiffusionComments: Project Website: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches, which process tokens sequentially from left to right, or top to bottom. These models jointly handle images, text, video, and audio for various tasks such as image captioning, question answering, and image generation. In this work, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain, building upon their recent success in text generation. Discrete diffusion models offer several advantages over AR models, including improved control over quality versus diversity of generated samples, the ability to perform joint multimodal inpainting (across both text and image domains), and greater controllability in generation through guidance. Leveraging these benefits, we present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images for a variety of downstream tasks. We compare UniDisc to multimodal AR models, performing a scaling analysis and demonstrating that UniDisc outperforms them in terms of both performance and inference-time compute, enhanced controllability, editability, inpainting, and flexible trade-off between inference time and generation quality. Code and additional visualizations are available at this https URL.
- [36] arXiv:2503.20868 [pdf, other]
-
Title: Advances in Semantic Patching for HPC-oriented Refactorings with CoccinelleSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
Currently, the most energy-efficient hardware platforms for floating point-intensive calculations (also known as High Performance Computing, or HPC) are graphical processing units (GPUs). However, porting existing scientific codes to GPUs can be far from trivial. This article summarizes our recent advances in enabling machine-assisted, HPC-oriented refactorings with reference to existing APIs and programming idioms available in C and C++. The tool we are extending and using for the purpose is called Coccinelle. An important workflow we aim to support is that of writing and maintaining tersely written application code, while deferring circumstantial, ad-hoc, performance-related changes to specific, separate rules called semantic patches. GPUs currently offer very limited debugging facilities. The approach we are developing aims at preserving intelligibility, longevity, and relatedly, debuggability of existing code on CPUs, while at the same time enabling HPC-oriented code evolutions such as introducing support for GPUs, in a scriptable and possibly parametric manner. This article sketches a number of self-contained use cases, including further HPC-oriented cases which are independent from GPUs.
- [37] arXiv:2503.20871 [pdf, html, other]
-
Title: VinaBench: Benchmark for Faithful and Consistent Visual NarrativesSilin Gao, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Syrielle Montariol, Antoine BosselutComments: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.
- [38] arXiv:2503.20880 [pdf, html, other]
-
Title: BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational PathologyAmaya Gallagher-Syed, Henry Senior, Omnia Alwazzan, Elena Pontarini, Michele Bombardieri, Costantino Pitzalis, Myles J. Lewis, Michael R. Barnes, Luca Rossi, Gregory SlabaughComments: Accepted for publication at CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Cell Behavior (q-bio.CB); Quantitative Methods (q-bio.QM); Tissues and Organs (q-bio.TO)
The development of biologically interpretable and explainable models remains a key challenge in computational pathology, particularly for multistain immunohistochemistry (IHC) analysis. We present BioX-CPath, an explainable graph neural network architecture for whole slide image (WSI) classification that leverages both spatial and semantic features across multiple stains. At its core, BioX-CPath introduces a novel Stain-Aware Attention Pooling (SAAP) module that generates biologically meaningful, stain-aware patient embeddings. Our approach achieves state-of-the-art performance on both Rheumatoid Arthritis and Sjogren's Disease multistain datasets. Beyond performance metrics, BioX-CPath provides interpretable insights through stain attention scores, entropy measures, and stain interaction scores, that permit measuring model alignment with known pathological mechanisms. This biological grounding, combined with strong classification performance, makes BioX-CPath particularly suitable for clinical applications where interpretability is key. Source code and documentation can be found at: this https URL.
- [39] arXiv:2503.20883 [pdf, html, other]
-
Title: Solving the Correlation Cluster LP in Sublinear TimeNairen Cao, Vincent Cohen-Addad, Shi Li, Euiwoong Lee, David Rasmussen Lolck, Alantha Newman, Mikkel Thorup, Lukas Vogl, Shuyi Yan, Hanwen ZhangSubjects: Data Structures and Algorithms (cs.DS)
Correlation Clustering is a fundamental and widely-studied problem in unsupervised learning and data mining. The input is a graph and the goal is to construct a clustering minimizing the number of inter-cluster edges plus the number of missing intra-cluster edges.
CCL+24 introduced the cluster LP for Correlation Clustering, which they argued captures the problem much more succinctly than previous linear programming formulations. However, the Cluster LP has exponential size, with a variable for every possible set of vertices in the input graph. Nevertheless, CCL+24 showed how to find a feasible solution for the Cluster LP in time O(n^{\text{poly}(1/\eps)}) with objective value at most (1+\epsilon) times the value of an optimal solution for the respective Correlation Clustering instance. Furthermore, they showed how to round a solution to the Cluster LP, yielding a (1.437+\eps)-approximation algorithm for the Correlation Clustering problem.
The main technical result of this paper is a new approach to find a feasible solution for the Cluster LP with objective value at most (1+\epsilon) of the optimum in time \widetilde O(2^{\text{poly}(1/\eps)} n), where n is the number of vertices in the graph. We also show how to implement the rounding within the same time bounds, thus achieving a fast (1.437+\epsilon)-approximation algorithm for the Correlation Clustering problem. This bridges the gap between state-of-the-art methods for approximating Correlation Clustering and the recent focus on fast algorithms. - [40] arXiv:2503.20884 [pdf, html, other]
-
Title: Robust Federated Learning Against Poisoning Attacks: A GAN-Based Defense FrameworkSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated Learning (FL) enables collaborative model training across decentralized devices without sharing raw data, but it remains vulnerable to poisoning attacks that compromise model integrity. Existing defenses often rely on external datasets or predefined heuristics (e.g. number of malicious clients), limiting their effectiveness and scalability. To address these limitations, we propose a privacy-preserving defense framework that leverages a Conditional Generative Adversarial Network (cGAN) to generate synthetic data at the server for authenticating client updates, eliminating the need for external datasets. Our framework is scalable, adaptive, and seamlessly integrates into FL workflows. Extensive experiments on benchmark datasets demonstrate its robust performance against a variety of poisoning attacks, achieving high True Positive Rate (TPR) and True Negative Rate (TNR) of malicious and benign clients, respectively, while maintaining model accuracy. The proposed framework offers a practical and effective solution for securing federated learning systems.
- [41] arXiv:2503.20888 [pdf, html, other]
-
Title: Coolight: Enhancing Nighttime Safety for Urban Student CommutersSubjects: Human-Computer Interaction (cs.HC)
Safety while walking alone at night is a key indicator of a citizen's well-being and a society's inclusiveness. However, this is not equally felt across all demographic groups, especially for university students living in urban areas. We present Coolight, a mobile application designed to reduce stress and anxiety for nighttime walking through an interactive live map, real-time community incident reports, location sharing, and a route planner optimized for user safety. Coolight's design was informed through interviews, questionnaires, and usability tests with university students and their friends and families in Toronto, Canada. This paper describes the concept, research, design approach, and evaluation results of a solution addressing safety concerns urban commuters face at night.
- [42] arXiv:2503.20897 [pdf, html, other]
-
Title: Feature Modulation for Semi-Supervised Domain Generalization without Domain LabelsVenuri Amarasinghe (1), Asini Jayakody (1), Isun Randila (1), Kalinga Bandara (1), Chamuditha Jayanga Galappaththige (2), Ranga Rodrigo (1) ((1) University of Moratuwa, (2) Queensland University of Technology)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Semi-supervised domain generalization (SSDG) leverages a small fraction of labeled data alongside unlabeled data to enhance model generalization. Most of the existing SSDG methods rely on pseudo-labeling (PL) for unlabeled data, often assuming access to domain labels-a privilege not always available. However, domain shifts introduce domain noise, leading to inconsistent PLs that degrade model performance. Methods derived from FixMatch suffer particularly from lower PL accuracy, reducing the effectiveness of unlabeled data. To address this, we tackle the more challenging domain-label agnostic SSDG, where domain labels for unlabeled data are not available during training. First, we propose a feature modulation strategy that enhances class-discriminative features while suppressing domain-specific information. This modulation shifts features toward Similar Average Representations-a modified version of class prototypes-that are robust across domains, encouraging the classifier to distinguish between closely related classes and feature extractor to form tightly clustered, domain-invariant representations. Second, to mitigate domain noise and improve pseudo-label accuracy, we introduce a loss-scaling function that dynamically lowers the fixed confidence threshold for pseudo-labels, optimizing the use of unlabeled data. With these key innovations, our approach achieves significant improvements on four major domain generalization benchmarks-even without domain labels. We will make the code available.
- [43] arXiv:2503.20903 [pdf, html, other]
-
Title: Assessing Generative Models for Structured DataSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Synthetic tabular data generation has emerged as a promising method to address limited data availability and privacy concerns. With the sharp increase in the performance of large language models in recent years, researchers have been interested in applying these models to the generation of tabular data. However, little is known about the quality of the generated tabular data from large language models. The predominant method for assessing the quality of synthetic tabular data is the train-synthetic-test-real approach, where the artificial examples are compared to the original by how well machine learning models, trained separately on the real and synthetic sets, perform in some downstream tasks. This method does not directly measure how closely the distribution of generated data approximates that of the original. This paper introduces rigorous methods for directly assessing synthetic tabular data against real data by looking at inter-column dependencies within the data. We find that large language models (GPT-2), both when queried via few-shot prompting and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data. Results from this study can inform future practice in synthetic data generation to improve data quality.
- [44] arXiv:2503.20913 [pdf, html, other]
-
Title: TransDiffSBDD: Causality-Aware Multi-Modal Structure-Based Drug DesignSubjects: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Structure-based drug design (SBDD) is a critical task in drug discovery, requiring the generation of molecular information across two distinct modalities: discrete molecular graphs and continuous 3D coordinates. However, existing SBDD methods often overlook two key challenges: (1) the multi-modal nature of this task and (2) the causal relationship between these modalities, limiting their plausibility and performance. To address both challenges, we propose TransDiffSBDD, an integrated framework combining autoregressive transformers and diffusion models for SBDD. Specifically, the autoregressive transformer models discrete molecular information, while the diffusion model samples continuous distributions, effectively resolving the first challenge. To address the second challenge, we design a hybrid-modal sequence for protein-ligand complexes that explicitly respects the causality between modalities. Experiments on the CrossDocked2020 benchmark demonstrate that TransDiffSBDD outperforms existing baselines.
- [45] arXiv:2503.20914 [pdf, html, other]
-
Title: D4R -- Exploring and Querying Relational Graphs Using Natural Language and Large Language Models -- the Case of Historical DocumentsComments: 8 pages, 7 figuresSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
D4R is a digital platform designed to assist non-technical users, particularly historians, in exploring textual documents through advanced graphical tools for text analysis and knowledge extraction. By leveraging a large language model, D4R translates natural language questions into Cypher queries, enabling the retrieval of data from a Neo4J database. A user-friendly graphical interface allows for intuitive interaction, enabling users to navigate and analyse complex relational data extracted from unstructured textual documents. Originally designed to bridge the gap between AI technologies and historical research, D4R's capabilities extend to various other domains. A demonstration video and a live software demo are available.
- [46] arXiv:2503.20916 [pdf, html, other]
-
Title: A Study of Perceived Safety for Soft Robotics in Caregiving TasksCosima du Pasquier, Jennifer Grannen, Chuer Pan, Serin L. Huber, Aliyah Smith, Monroe Kennedy, Shuran Song, Dorsa Sadigh, Allison M. OkamuraSubjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
In this project, we focus on human-robot interaction in caregiving scenarios like bathing, where physical contact is inevitable and necessary for proper task execution because force must be applied to the skin. Using finite element analysis, we designed a 3D-printed gripper combining positive and negative pressure for secure yet compliant handling. Preliminary tests showed it exerted a lower, more uniform pressure profile than a standard rigid gripper. In a user study, participants' trust in robots significantly increased after they experienced a brief bathing demonstration performed by a robotic arm equipped with the soft gripper. These results suggest that soft robotics can enhance perceived safety and acceptance in intimate caregiving scenarios.
- [47] arXiv:2503.20918 [pdf, html, other]
-
Title: Locally Optimal Solutions for Integer Programming GamesSubjects: Computer Science and Game Theory (cs.GT)
Integer programming games (IPGs) are n-person games with integer strategy spaces. These games are used to model non-cooperative combinatorial decision-making and are used in domains such as cybersecurity and transportation. The prevalent solution concept for IPGs, Nash equilibrium, is difficult to compute and even showing whether such an equilibrium exists is known to be Sp2-complete. In this work, we introduce a class of relaxed solution concepts for IPGs called locally optimal integer solutions (LOIS) that are simpler to obtain than pure Nash equilibria. We demonstrate that LOIS are not only faster and more readily scalable in large-scale games but also support desirable features such as equilibrium enumeration and selection. We also show that these solutions can model a broader class of problems including Stackelberg, Stackelberg-Nash, and generalized IPGs. Finally, we provide initial comparative results in a cybersecurity game called the Critical Node game, showing the performance gains of LOIS in comparison to the existing Nash equilibrium solution concept.
- [48] arXiv:2503.20919 [pdf, html, other]
-
Title: GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in ConversationsSubjects: Computation and Language (cs.CL); Sound (cs.SD)
Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual's expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverages multiple signals but traditionally relies on utterance-level analysis, overlooking the dynamic nature of emotions in conversations. Emotion Recognition in Conversation (ERC) addresses this limitation, yet existing methods struggle to align multimodal features and explain why emotions evolve within dialogues. To bridge this gap, we propose GatedxLSTM, a novel speech-text multimodal ERC model that explicitly considers voice and transcripts of both the speaker and their conversational partner(s) to identify the most influential sentences driving emotional shifts. By integrating Contrastive Language-Audio Pretraining (CLAP) for improved cross-modal alignment and employing a gating mechanism to emphasise emotionally impactful utterances, GatedxLSTM enhances both interpretability and performance. Additionally, the Dialogical Emotion Decoder (DED) refines emotion predictions by modelling contextual dependencies. Experiments on the IEMOCAP dataset demonstrate that GatedxLSTM achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification. These results validate its effectiveness for ERC applications and provide an interpretability analysis from a psychological perspective.
- [49] arXiv:2503.20920 [pdf, html, other]
-
Title: Variants of thick-restart Lanczos for the Bethe-Salpeter eigenvalue problemSubjects: Numerical Analysis (math.NA)
The non-Hermitian Bethe-Salpeter eigenvalue problem is a structured eigenproblem, with real eigenvalues coming in pairs $\{\lambda,-\lambda\}$ where the corresponding pair of eigenvectors are closely related, and furthermore the left eigenvectors can be trivially obtained from the right ones. We exploit these properties to devise three variants of structure-preserving Lanczos eigensolvers to compute a subset of eigenvalues (those of either smallest or largest magnitude) together with their corresponding right and left eigenvectors. For this to be effective in real applications, we need to incorporate a thick-restart technique in a way that the overall computation preserves the problem structure. The new methods are validated in an implementation within the SLEPc library using several test matrices, some of them coming from the Yambo materials science code.
- [50] arXiv:2503.20921 [pdf, html, other]
-
Title: The MINI mixed virtual element for the Stokes equationComments: 31 pages, 8 figures, 1 tableSubjects: Numerical Analysis (math.NA)
We present and discuss a generalization of the popular MINI mixed finite element for the 2D Stokes equation by means of conforming virtual elements on polygonal meshes. We prove optimal error estimates for both velocity and pressure. Theoretical results are confirmed by several numerical tests performed with different choices of polynomial accuracy and meshes.
- [51] arXiv:2503.20925 [pdf, html, other]
-
Title: Prototype Guided Backdoor DefenseSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Deep learning models are susceptible to {\em backdoor attacks} involving malicious attackers perturbing a small subset of training data with a {\em trigger} to causes misclassifications. Various triggers have been used, including semantic triggers that are easily realizable without requiring the attacker to manipulate the image. The emergence of generative AI has eased the generation of varied poisoned samples. Robustness across types of triggers is crucial to effective defense. We propose Prototype Guided Backdoor Defense (PGBD), a robust post-hoc defense that scales across different trigger types, including previously unsolved semantic triggers. PGBD exploits displacements in the geometric spaces of activations to penalize movements toward the trigger. This is done using a novel sanitization loss of a post-hoc fine-tuning step. The geometric approach scales easily to all types of attacks. PGBD achieves better performance across all settings. We also present the first defense against a new semantic attack on celebrity face images. Project page: \hyperlink{this https URL}{this https URL}.
- [52] arXiv:2503.20929 [pdf, html, other]
-
Title: Global and Local Structure Learning for Sparse Tensor CompletionSubjects: Machine Learning (cs.LG)
How can we accurately complete tensors by learning relationships of dimensions along each mode? Tensor completion, a widely studied problem, is to predict missing entries in incomplete tensors. Tensor decomposition methods, fundamental tensor analysis tools, have been actively developed to solve tensor completion tasks. However, standard tensor decomposition models have not been designed to learn relationships of dimensions along each mode, which limits to accurate tensor completion. Also, previously developed tensor decomposition models have required prior knowledge between relations within dimensions to model the relations, expensive to obtain.
This paper proposes TGL (Tensor Decomposition Learning Global and Local Structures) to accurately predict missing entries in tensors. TGL reconstructs a tensor with factor matrices which learn local structures with GNN without prior knowledges. Extensive experiments are conducted to evaluate TGL with baselines and datasets. - [53] arXiv:2503.20930 [pdf, html, other]
-
Title: Centroidal Voronoi Refinement in the Geometric Refinement Transform: Symmetry, Stability, and Optimal ReconstructionSubjects: Numerical Analysis (math.NA)
We extend the Geometric Refinement Transform (GRT) by introducing centroidal Voronoi tessellations (CVTs) into the refinement process, enhancing symmetry, reconstruction accuracy, and numerical stability. By applying Lloyds algorithm at each refinement level, we minimize centroidal energy and generate Voronoi regions that better align with the functions underlying structure. This approach reduces geometric distortion, suppresses reconstruction error, and provides a natural framework for adaptive refinement. We analyze convergence properties, quantify the reduction in reconstruction error using Taylor-based estimates and Lipschitz continuous functions, and propose perturbation strategies to escape symmetry-preserving local minima. The resulting transform offers improved accuracy for applications in medical imaging, signal processing, and physics simulations, while preserving the theoretical completeness and stability guarantees of the original GRT framework.
- [54] arXiv:2503.20932 [pdf, other]
-
Title: Reflex: Speeding Up SMPC Query Execution through Efficient and Flexible Intermediate Result Size TrimmingSubjects: Databases (cs.DB); Cryptography and Security (cs.CR)
There is growing interest in Secure Analytics, but fully oblivious query execution in Secure Multi-Party Computation (MPC) settings is often prohibitively expensive. Recent related works propose different approaches to trimming the size of intermediate results between query operators, resulting in significant speedups at the cost of some information leakage. In this work, we generalize these ideas into a method of flexible and efficient trimming of operator outputs that can be added to MPC operators easily. This allows for precisely controlling the security/performance trade-off on a per-operator and per-query basis. We demonstrate that our work is practical by porting a state-of-the-art trimming approach to it, resulting in a faster runtime and increased security. Our work lays down the foundation for a future MPC query planner that can pick different performance and security targets when composing physical query plans.
- [55] arXiv:2503.20934 [pdf, html, other]
-
Title: Leveraging LLMs, IDEs, and Semantic Embeddings for Automated Move Method RefactoringFraol Batole, Abhiram Bellur, Malinda Dilhara, Mohammed Raihan Ullah, Yaroslav Zharov, Timofey Bryksin, Kai Ishikawa, Haifeng Chen, Masaharu Morimoto, Shota Motoura, Takeo Hosomi, Tien N. Nguyen, Hridesh Rajan, Nikolaos Tsantalis, Danny DigComments: 12 pages, 2 figuresSubjects: Software Engineering (cs.SE)
MOVEMETHOD is a hallmark refactoring. Despite a plethora of research tools that recommend which methods to move and where, these recommendations do not align with how expert developers perform MOVEMETHOD. Given the extensive training of Large Language Models and their reliance upon naturalness of code, they should expertly recommend which methods are misplaced in a given class and which classes are better hosts. Our formative study of 2016 LLM recommendations revealed that LLMs give expert suggestions, yet they are unreliable: up to 80% of the suggestions are hallucinations. We introduce the first LLM fully powered assistant for MOVEMETHOD refactoring that automates its whole end-to-end lifecycle, from recommendation to execution. We designed novel solutions that automatically filter LLM hallucinations using static analysis from IDEs and a novel workflow that requires LLMs to be self-consistent, critique, and rank refactoring suggestions. As MOVEMETHOD refactoring requires global, projectlevel reasoning, we solved the limited context size of LLMs by employing refactoring-aware retrieval augment generation (RAG). Our approach, MM-assist, synergistically combines the strengths of the LLM, IDE, static analysis, and semantic relevance. In our thorough, multi-methodology empirical evaluation, we compare MM-assist with the previous state-of-the-art approaches. MM-assist significantly outperforms them: (i) on a benchmark widely used by other researchers, our Recall@1 and Recall@3 show a 1.7x improvement; (ii) on a corpus of 210 recent refactorings from Open-source software, our Recall rates improve by at least 2.4x. Lastly, we conducted a user study with 30 experienced participants who used MM-assist to refactor their own code for one week. They rated 82.8% of MM-assist recommendations positively. This shows that MM-assist is both effective and useful.
- [56] arXiv:2503.20936 [pdf, html, other]
-
Title: LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular VideosComments: CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Physical agility is a necessary skill in competitive table tennis, but by no means sufficient. Champions excel in this fast-paced and highly dynamic environment by anticipating their opponent's intent - buying themselves the necessary time to react. In this work, we take one step towards designing such an anticipatory agent. Previous works have developed systems capable of real-time table tennis gameplay, though they often do not leverage anticipation. Among the works that forecast opponent actions, their approaches are limited by dataset size and variety. Our paper contributes (1) a scalable system for reconstructing monocular video of table tennis matches in 3D and (2) an uncertainty-aware controller that anticipates opponent actions. We demonstrate in simulation that our policy improves the ball return rate against high-speed hits from 49.9% to 59.0% as compared to a baseline non-anticipatory policy.
- [57] arXiv:2503.20938 [pdf, other]
-
Title: ConicCurv: A curvature estimation algorithm for planar polygonsSubjects: Numerical Analysis (math.NA)
ConicCurv is a new derivative-free algorithm to estimate the curvature of a plane curve from a sample of data points. It is based on a known tangent estimator method grounded on classic results of Projective Geometry and Bézier rational conic curves. The curvature values estimated by ConicCurv are invariant to Euclidean changes of coordinates and reproduce the exact curvature values if the data are sampled from a conic.
We show that ConicCurv< has convergence order $3$ and, if the sample points are uniformly arc-length distributed, the convergence order is $4$. The performance of ConicCurv is compared with some of the most frequently used algorithms to estimate curvatures and its performance is illustrated in the calculation of the elastic energy of subdivision curves and the location of L-curves corners. - [58] arXiv:2503.20939 [pdf, html, other]
-
Title: Hacia la interpretabilidad de la detección anticipada de riesgos de depresión utilizando grandes modelos de lenguajeComments: In Spanish language, In 30° Congreso Argentino de Ciencias de la Computación (CACIC 2024), La Plata, ArgentinaJournal-ref: In Libro de Actas CACIC 2024, pp. 72-81Subjects: Computation and Language (cs.CL)
Early Detection of Risks (EDR) on the Web involves identifying at-risk users as early as possible. Although Large Language Models (LLMs) have proven to solve various linguistic tasks efficiently, assessing their reasoning ability in specific domains is crucial. In this work, we propose a method for solving depression-related EDR using LLMs on Spanish texts, with responses that can be interpreted by humans. We define a reasoning criterion to analyze users through a specialist, apply in-context learning to the Gemini model, and evaluate its performance both quantitatively and qualitatively. The results show that accurate predictions can be obtained, supported by explanatory reasoning, providing a deeper understanding of the solution. Our approach offers new perspectives for addressing EDR problems by leveraging the power of LLMs.
- [59] arXiv:2503.20950 [pdf, html, other]
-
Title: DEMENTIA-PLAN: An Agent-Based Framework for Multi-Knowledge Graph Retrieval-Augmented Generation in Dementia CareComments: Accepted by AAAI 2025 Workshop on Knowledge Graphs for Personalized Public HealthSubjects: Artificial Intelligence (cs.AI)
Mild-stage dementia patients primarily experience two critical symptoms: severe memory loss and emotional instability. To address these challenges, we propose DEMENTIA-PLAN, an innovative retrieval-augmented generation framework that leverages large language models to enhance conversational support. Our model employs a multiple knowledge graph architecture, integrating various dimensional knowledge representations including daily routine graphs and life memory graphs. Through this multi-graph architecture, DEMENTIA-PLAN comprehensively addresses both immediate care needs and facilitates deeper emotional resonance through personal memories, helping stabilize patient mood while providing reliable memory support. Our notable innovation is the self-reflection planning agent, which systematically coordinates knowledge retrieval and semantic integration across multiple knowledge graphs, while scoring retrieved content from daily routine and life memory graphs to dynamically adjust their retrieval weights for optimized response generation. DEMENTIA-PLAN represents a significant advancement in the clinical application of large language models for dementia care, bridging the gap between AI tools and caregivers interventions.
- [60] arXiv:2503.20952 [pdf, html, other]
-
Title: TS-Inverse: A Gradient Inversion Attack Tailored for Federated Time Series Forecasting ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Federated learning (FL) for time series forecasting (TSF) enables clients with privacy-sensitive time series (TS) data to collaboratively learn accurate forecasting models, for example, in energy load prediction. Unfortunately, privacy risks in FL persist, as servers can potentially reconstruct clients' training data through gradient inversion attacks (GIA). Although GIA is demonstrated for image classification tasks, little is known about time series regression tasks. In this paper, we first conduct an extensive empirical study on inverting TS data across 4 TSF models and 4 datasets, identifying the unique challenges of reconstructing both observations and targets of TS data. We then propose TS-Inverse, a novel GIA that improves the inversion of TS data by (i) learning a gradient inversion model that outputs quantile predictions, (ii) a unique loss function that incorporates periodicity and trend regularization, and (iii) regularization according to the quantile predictions. Our evaluations demonstrate a remarkable performance of TS-Inverse, achieving at least a 2x-10x improvement in terms of the sMAPE metric over existing GIA methods on TS data. Code repository: this https URL
- [61] arXiv:2503.20953 [pdf, html, other]
-
Title: Clean & Clear: Feasibility of Safe LLM Clinical GuidanceSubjects: Computation and Language (cs.CL)
Background:
Clinical guidelines are central to safe evidence-based medicine in modern healthcare, providing diagnostic criteria, treatment options and monitoring advice for a wide range of illnesses. LLM-empowered chatbots have shown great promise in Healthcare Q&A tasks, offering the potential to provide quick and accurate responses to medical inquiries.
Our main objective was the development and preliminary assessment of an LLM-empowered chatbot software capable of reliably answering clinical guideline questions using University College London Hospital (UCLH) clinical guidelines.
Methods: We used the open-weight Llama-3.1-8B LLM to extract relevant information from the UCLH guidelines to answer questions. Our approach highlights the safety and reliability of referencing information over its interpretation and response generation. Seven doctors from the ward assessed the chatbot's performance by comparing its answers to the gold standard.
Results: Our chatbot demonstrates promising performance in terms of relevance, with ~73% of its responses rated as very relevant, showcasing a strong understanding of the clinical context. Importantly, our chatbot achieves a recall of 0.98 for extracted guideline lines, substantially minimising the risk of missing critical information. Approximately 78% of responses were rated satisfactory in terms of completeness. A small portion (~14.5%) contained minor unnecessary information, indicating occasional lapses in precision. The chatbot' showed high efficiency, with an average completion time of 10 seconds, compared to 30 seconds for human respondents. Evaluation of clinical reasoning showed that 72% of the chatbot's responses were without flaws. Our chatbot demonstrates significant potential to speed up and improve the process of accessing locally relevant clinical information for healthcare professionals. - [62] arXiv:2503.20957 [pdf, html, other]
-
Title: Pellet-based 3D Printing of Soft Thermoplastic Elastomeric Membranes for Soft Robotic ApplicationsSubjects: Robotics (cs.RO)
Additive Manufacturing (AM) is a promising solution for handling the complexity of fabricating soft robots. However, the AM of hyperelastic materials is still challenging with limited material types. Within this work, pellet-based 3D printing of very soft thermoplastic elastomers (TPEs) was explored. Our results show that TPEs can have similar engineering stress and maximum strain as Ecoflex OO-10. These TPEs were used to 3D-print airtight thin membranes (0.2-1.2 mm), which could inflate up to a stretch of 1320\%. Combining the membrane's large expansion and softness with the 3D printing of hollow structures simplified the design of a bending actuator that can bend 180 degrees and reach a blocked force of 238 times its weight. In addition, by 3D printing TPE pellets and rigid filaments, the soft membrane could grasp objects by enveloping an object or as a sensorized sucker, which relied on the TPE's softness to conform to the object or act as a seal. In addition, the membrane of the sucker was utilized as a tactile sensor to detect an object before adhesion. These results suggest the feasibility of 3D printing soft robots by using soft TPEs and membranes as an interesting class of materials and sensorized actuators, respectively.
- [63] arXiv:2503.20959 [pdf, html, other]
-
Title: Sociotechnical Effects of Machine TranslationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
While the previous chapters have shown how machine translation (MT) can be useful, in this chapter we discuss some of the side-effects and risks that are associated, and how they might be mitigated. With the move to neural MT and approaches using Large Language Models (LLMs), there is an associated impact on climate change, as the models built by multinational corporations are massive. They are hugely expensive to train, consume large amounts of electricity, and output huge volumes of kgCO2 to boot. However, smaller models which still perform to a high level of quality can be built with much lower carbon footprints, and tuning pre-trained models saves on the requirement to train from scratch. We also discuss the possible detrimental effects of MT on translators and other users. The topics of copyright and ownership of data are discussed, as well as ethical considerations on data and MT use. Finally, we show how if done properly, using MT in crisis scenarios can save lives, and we provide a method of how this might be done.
- [64] arXiv:2503.20960 [pdf, html, other]
-
Title: Multi-Modal Framing Analysis of NewsSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, this leaves out valuable information about editorial choices, which include not just the written article but also accompanying photographs. To overcome such limitations, we present a method for conducting multi-modal, multi-label framing analysis at scale using large (vision-)language models. Grounding our work in framing theory, we extract latent meaning embedded in images used to convey a certain point and contrast that to the text by comparing the respective frames used. We also identify highly partisan framing of topics with issue-specific frame analysis found in prior qualitative work. We demonstrate a method for doing scalable integrative framing analysis of both text and image in news, providing a more complete picture for understanding media bias.
- [65] arXiv:2503.20967 [pdf, html, other]
-
Title: Eyes Tell the Truth: GazeVal Highlights Shortcomings of Generative AI in Medical ImagingDavid Wong, Bin Wang, Gorkem Durak, Marouane Tliba, Akshay Chaudhari, Aladine Chetouani, Ahmet Enis Cetin, Cagdas Topel, Nicolo Gennaro, Camila Lopes Vendrami, Tugce Agirlar Trabzonlu, Amir Ali Rahsepar, Laetitia Perronne, Matthew Antalek, Onural Ozturk, Gokcan Okur, Andrew C. Gordon, Ayis Pyrros, Frank H. Miller, Amir Borhani, Hatice Savas, Eric Hart, Drew Torigian, Jayaram K. Udupa, Elizabeth Krupinski, Ulas BagciSubjects: Computer Vision and Pattern Recognition (cs.CV)
The demand for high-quality synthetic data for model training and augmentation has never been greater in medical imaging. However, current evaluations predominantly rely on computational metrics that fail to align with human expert recognition. This leads to synthetic images that may appear realistic numerically but lack clinical authenticity, posing significant challenges in ensuring the reliability and effectiveness of AI-driven medical tools. To address this gap, we introduce GazeVal, a practical framework that synergizes expert eye-tracking data with direct radiological evaluations to assess the quality of synthetic medical images. GazeVal leverages gaze patterns of radiologists as they provide a deeper understanding of how experts perceive and interact with synthetic data in different tasks (i.e., diagnostic or Turing tests). Experiments with sixteen radiologists revealed that 96.6% of the generated images (by the most recent state-of-the-art AI algorithm) were identified as fake, demonstrating the limitations of generative AI in producing clinically accurate images.
- [66] arXiv:2503.20968 [pdf, html, other]
-
Title: Reinforcement Learning for Efficient Toxicity Detection in Competitive Online Video GamesSubjects: Machine Learning (cs.LG)
Online platforms take proactive measures to detect and address undesirable behavior, aiming to focus these resource-intensive efforts where such behavior is most prevalent. This article considers the problem of efficient sampling for toxicity detection in competitive online video games. To make optimal monitoring decisions, video game service operators need estimates of the likelihood of toxic behavior. If no model is available for these predictions, one must be estimated in real time. To close this gap, we propose a contextual bandit algorithm that makes monitoring decisions based on a small set of variables that, according to domain expertise, are associated with toxic behavior. This algorithm balances exploration and exploitation to optimize long-term outcomes and is deliberately designed for easy deployment in production. Using data from the popular first-person action game Call of Duty: Modern Warfare III, we show that our algorithm consistently outperforms baseline algorithms that rely solely on players' past behavior. This finding has substantive implications for the nature of toxicity. It also illustrates how domain expertise can be harnessed to help video game service operators identify and mitigate toxicity, ultimately fostering a safer and more enjoyable gaming experience.
- [67] arXiv:2503.20975 [pdf, html, other]
-
Title: Competitive Multi-armed Bandit Games for Resource SharingComments: This paper has been accepted by IEEE TMCSubjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
In modern resource-sharing systems, multiple agents access limited resources with unknown stochastic conditions to perform tasks. When multiple agents access the same resource (arm) simultaneously, they compete for successful usage, leading to contention and reduced rewards. This motivates our study of competitive multi-armed bandit (CMAB) games. In this paper, we study a new N-player K-arm competitive MAB game, where non-myopic players (agents) compete with each other to form diverse private estimations of unknown arms over time. Their possible collisions on same arms and time-varying nature of arm rewards make the policy analysis more involved than existing studies for myopic players. We explicitly analyze the threshold-based structures of social optimum and existing selfish policy, showing that the latter causes prolonged convergence time $\Omega(\frac{K}{\eta^2}\ln({\frac{KN}{\delta}}))$, while socially optimal policy with coordinated communication reduces it to $\mathcal{O}(\frac{K}{N\eta^2}\ln{(\frac{K}{\delta})})$. Based on the comparison, we prove that the competition among selfish players for the best arm can result in an infinite price of anarchy (PoA), indicating an arbitrarily large efficiency loss compared to social optimum. We further prove that no informational (non-monetary) mechanism (including Bayesian persuasion) can reduce the infinite PoA, as the strategic misreporting by non-myopic players undermines such approaches. To address this, we propose a Combined Informational and Side-Payment (CISP) mechanism, which provides socially optimal arm recommendations with proper informational and monetary incentives to players according to their time-varying private beliefs. Our CISP mechanism keeps ex-post budget balanced for social planner and ensures truthful reporting from players, achieving the minimum PoA=1 and same convergence time as social optimum.
- [68] arXiv:2503.20976 [pdf, html, other]
-
Title: Generator Cost Coefficients Inference Attack via Exploitation of Locational Marginal Prices in Smart GridComments: Submitted to IEEE Smart Grid Communication Conference 2025Subjects: Cryptography and Security (cs.CR)
Real-time price signals and power generation levels (disaggregated or aggregated) are commonly made available to the public by Independent System Operators (ISOs) to promote efficiency and transparency. However, they may inadvertently reveal crucial private information about the power grid, such as the cost functions of generators. Adversaries can exploit these vulnerabilities for strategic bidding, potentially leading to financial losses for power market participants and consumers. In this paper, we prove the existence of a closed-form solution for recovering coefficients in cost functions when LMPs and disaggregated power generation data are available. Additionally, we establish the convergence conditions for inference the quadratic coefficients of cost functions when LMPs and aggregated generation data are given. Our theoretical analysis provides the conditions under which the algorithm is guaranteed to converge, and our experiments demonstrate the efficacy of this method on IEEE benchmark systems, including 14-bus and 30-bus and 118-bus systems.
- [69] arXiv:2503.20978 [pdf, html, other]
-
Title: ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and PredictionComments: Accepted to MM4SG Workshop at The Web Conference 2025Subjects: Computation and Language (cs.CL)
Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI interactions that captures key user actions and intentions over time. Building on this foundation, we introduce ScreenLLM, a set of multimodal large language models (MLLMs) tailored for advanced UI understanding and action prediction. Extensive experiments on both open-source and proprietary models show that ScreenLLM accurately models user behavior and predicts actions. Our work lays the foundation for scalable, robust, and intelligent GUI agents that enhance user interaction in diverse software environments.
- [70] arXiv:2503.20981 [pdf, html, other]
-
Title: Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care SatisfactionXiaoran Xu, Zhaoqian Xue, Chi Zhang, Jhonatan Medri, Junjie Xiong, Jiayan Zhou, Jin Jin, Yongfeng Zhang, Siyuan Ma, Lingyao LiSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Investigating the public experience of urgent care facilities is essential for promoting community healthcare development. Traditional survey methods often fall short due to limited scope, time, and spatial coverage. Crowdsourcing through online reviews or social media offers a valuable approach to gaining such insights. With recent advancements in large language models (LLMs), extracting nuanced perceptions from reviews has become feasible. This study collects Google Maps reviews across the DMV and Florida areas and conducts prompt engineering with the GPT model to analyze the aspect-based sentiment of urgent care. We first analyze the geospatial patterns of various aspects, including interpersonal factors, operational efficiency, technical quality, finances, and facilities. Next, we determine Census Block Group(CBG)-level characteristics underpinning differences in public perception, including population density, median income, GINI Index, rent-to-income ratio, household below poverty rate, no insurance rate, and unemployment rate. Our results show that interpersonal factors and operational efficiency emerge as the strongest determinants of patient satisfaction in urgent care, while technical quality, finances, and facilities show no significant independent effects when adjusted for in multivariate models. Among socioeconomic and demographic factors, only population density demonstrates a significant but modest association with patient ratings, while the remaining factors exhibit no significant correlations. Overall, this study highlights the potential of crowdsourcing to uncover the key factors that matter to residents and provide valuable insights for stakeholders to improve public satisfaction with urgent care.
- [71] arXiv:2503.20982 [pdf, html, other]
-
Title: Permutation polynomials over finite fields from low-degree rational functionsComments: 21 pagesSubjects: Cryptography and Security (cs.CR); Number Theory (math.NT)
This paper considers permutation polynomials over the finite field $F_{q^2}$ in even characteristic by utilizing low-degree permutation rational functions over $F_q$. As a result, we obtain two classes of permutation binomials and six classes of permutation pentanomials over $F_{q^2}$. Additionally, we show that the obtained binomials and pentanomials are quasi-multiplicative inequivalent to the known ones in the literature.
- [72] arXiv:2503.20985 [pdf, html, other]
-
Title: Deterministic Vertex Connectivity via Common-Neighborhood Clustering and PseudorandomnessSubjects: Data Structures and Algorithms (cs.DS)
We give a deterministic algorithm for computing a global minimum vertex cut in a vertex-weighted graph $n$ vertices and $m$ edges in $\widehat O(mn)$ time. This breaks the long-standing $\widehat \Omega(n^{4})$-time barrier in dense graphs, achievable by trivially computing all-pairs maximum flows. Up to subpolynomial factors, we match the fastest randomized $\tilde O(mn)$-time algorithm by [Henzinger, Rao, and Gabow'00], and affirmatively answer the question by [Gabow'06] whether deterministic $O(mn)$-time algorithms exist even for unweighted graphs. Our algorithm works in directed graphs, too.
In unweighted undirected graphs, we present a faster deterministic $\widehat O(m\kappa)$-time algorithm where $\kappa\le n$ is the size of the global minimum vertex cut. For a moderate value of $\kappa$, this strictly improves upon all previous deterministic algorithms in unweighted graphs with running time $\widehat
O(m(n+\kappa^{2}))$ [Even'75], $\widehat O(m(n+\kappa\sqrt{n}))$ [Gabow'06], and $\widehat O(m2^{O(\kappa^{2})})$ [Saranurak and Yingchareonthawornchai'22]. Recently, a linear-time algorithm has been shown by [Korhonen'24] for very small $\kappa$.
Our approach applies the common-neighborhood clustering, recently introduced by [Blikstad, Jiang, Mukhopadhyay, Yingchareonthawornchai'25], in novel ways, e.g., on top of weighted graphs and on top of vertex-expander decomposition. We also exploit pseudorandom objects often used in computational complexity communities, including crossing families based on dispersers from [Wigderson and Zuckerman'99; TaShma, Umans and Zuckerman'01] and selectors based on linear lossless condensers [Guruswwami, Umans and Vadhan'09; Cheraghchi'11]. To our knowledge, this is the first application of selectors in graph algorithms. - [73] arXiv:2503.20986 [pdf, html, other]
-
Title: Musical Chairs: A new benchmark to evaluate AIComments: 16 pages, 3 figures, accepted at this https URLSubjects: Computers and Society (cs.CY); Theoretical Economics (econ.TH)
This paper presents a new contribution to the growing set of benchmarks used to prune potential AI designs. Much as one might evaluate a machine in terms of its performance at chess, this benchmark involves testing a machine in terms of its performance at a game called "Musical Chairs." At the time of writing, Claude, ChatGPT, and Qwen each failed this test, so the test could aid in their ongoing improvement. Furthermore, this paper sets a stage for future innovation in game theory and AI safety by providing an example of success with non-standard approaches to each: studying a game beyond the scope of previous game theoretic tools and mitigating a serious AI safety risk in a way that requires neither determination of values nor their enforcement.
- [74] arXiv:2503.20988 [pdf, html, other]
-
Title: Cross-Modal State-Space Graph Reasoning for Structured SummarizationSubjects: Computation and Language (cs.CL); Graphics (cs.GR)
The ability to extract compact, meaningful summaries from large-scale and multimodal data is critical for numerous applications, ranging from video analytics to medical reports. Prior methods in cross-modal summarization have often suffered from high computational overheads and limited interpretability. In this paper, we propose a \textit{Cross-Modal State-Space Graph Reasoning} (\textbf{CSS-GR}) framework that incorporates a state-space model with graph-based message passing, inspired by prior work on efficient state-space models. Unlike existing approaches relying on purely sequential models, our method constructs a graph that captures inter- and intra-modal relationships, allowing more holistic reasoning over both textual and visual streams. We demonstrate that our approach significantly improves summarization quality and interpretability while maintaining computational efficiency, as validated on standard multimodal summarization benchmarks. We also provide a thorough ablation study to highlight the contributions of each component.
- [75] arXiv:2503.20989 [pdf, html, other]
-
Title: Inferring fine-grained migration patterns across the United StatesSubjects: Computers and Society (cs.CY)
Fine-grained migration data illuminate important demographic, environmental, and health phenomena. However, migration datasets within the United States remain lacking: publicly available Census data are neither spatially nor temporally granular, and proprietary data have higher resolution but demographic and other biases. To address these limitations, we develop a scalable iterative-proportional-fitting based method which reconciles high-resolution but biased proprietary data with low-resolution but more reliable Census data. We apply this method to produce MIGRATE, a dataset of annual migration matrices from 2010 - 2019 which captures flows between 47.4 billion pairs of Census Block Groups -- about four thousand times more granular than publicly available data. These estimates are highly correlated with external ground-truth datasets, and improve accuracy and reduce bias relative to raw proprietary data. We publicly release MIGRATE estimates and provide a case study illustrating how they reveal granular patterns of migration in response to California wildfires.
- [76] arXiv:2503.20990 [pdf, html, other]
-
Title: FinAudio: A Benchmark for Audio Large Language Models in Financial ApplicationsYupeng Cao, Haohang Li, Yangyang Yu, Shashidhar Reddy Javaji, Yueru He, Jimin Huang, Zining Zhu, Qianqian Xie, Xiao-yang Liu, Koduvayur Subbalakshmi, Meikang Qiu, Sophia Ananiadou, Jian-Yun NieSubjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Audio Large Language Models (AudioLLMs) have received widespread attention and have significantly improved performance on audio tasks such as conversation, audio understanding, and automatic speech recognition (ASR). Despite these advancements, there is an absence of a benchmark for assessing AudioLLMs in financial scenarios, where audio data, such as earnings conference calls and CEO speeches, are crucial resources for financial analysis and investment decisions. In this paper, we introduce \textsc{FinAudio}, the first benchmark designed to evaluate the capacity of AudioLLMs in the financial domain. We first define three tasks based on the unique characteristics of the financial domain: 1) ASR for short financial audio, 2) ASR for long financial audio, and 3) summarization of long financial audio. Then, we curate two short and two long audio datasets, respectively, and develop a novel dataset for financial audio summarization, comprising the \textsc{FinAudio} benchmark. Then, we evaluate seven prevalent AudioLLMs on \textsc{FinAudio}. Our evaluation reveals the limitations of existing AudioLLMs in the financial domain and offers insights for improving AudioLLMs. All datasets and codes will be released.
- [77] arXiv:2503.20991 [pdf, html, other]
-
Title: MVFNet: Multipurpose Video Forensics Network using Multiple Forms of Forensic EvidenceComments: Proceedings of the Winter Conference on Applications of Computer Vision (WACV) 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
While videos can be falsified in many different ways, most existing forensic networks are specialized to detect only a single manipulation type (e.g. deepfake, inpainting). This poses a significant issue as the manipulation used to falsify a video is not known a priori. To address this problem, we propose MVFNet - a multipurpose video forensics network capable of detecting multiple types of manipulations including inpainting, deepfakes, splicing, and editing. Our network does this by extracting and jointly analyzing a broad set of forensic feature modalities that capture both spatial and temporal anomalies in falsified videos. To reliably detect and localize fake content of all shapes and sizes, our network employs a novel Multi-Scale Hierarchical Transformer module to identify forensic inconsistencies across multiple spatial scales. Experimental results show that our network obtains state-of-the-art performance in general scenarios where multiple different manipulations are possible, and rivals specialized detectors in targeted scenarios.
- [78] arXiv:2503.20992 [pdf, html, other]
-
Title: ReverBERT: A State Space Model for Efficient Text-Driven Speech Style TransferSubjects: Graphics (cs.GR); Computation and Language (cs.CL)
Text-driven speech style transfer aims to mold the intonation, pace, and timbre of a spoken utterance to match stylistic cues from text descriptions. While existing methods leverage large-scale neural architectures or pre-trained language models, the computational costs often remain high. In this paper, we present \emph{ReverBERT}, an efficient framework for text-driven speech style transfer that draws inspiration from a state space model (SSM) paradigm, loosely motivated by the image-based method of Wang and Liu~\cite{wang2024stylemamba}. Unlike image domain techniques, our method operates in the speech space and integrates a discrete Fourier transform of latent speech features to enable smooth and continuous style modulation. We also propose a novel \emph{Transformer-based SSM} layer for bridging textual style descriptors with acoustic attributes, dramatically reducing inference time while preserving high-quality speech characteristics. Extensive experiments on benchmark speech corpora demonstrate that \emph{ReverBERT} significantly outperforms baselines in terms of naturalness, expressiveness, and computational efficiency. We release our model and code publicly to foster further research in text-driven speech style transfer.
- [79] arXiv:2503.20994 [pdf, html, other]
-
Title: Deep Learning for Forensic Identification of SourceSubjects: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
We used contrastive neural networks to learn useful similarity scores between the 144 cartridge casings in the NBIDE dataset, under the common-but-unknown source paradigm. The common-but-unknown source problem is a problem archetype in forensics where the question is whether two objects share a common source (e.g. were two cartridge casings fired from the same firearm). Similarity scores are often used to interpret evidence under this paradigm. We directly compared our results to a state-of-the-art algorithm, Congruent Matching Cells (CMC). When trained on the E3 dataset of 2967 cartridge casings, contrastive learning achieved an ROC AUC of 0.892. The CMC algorithm achieved 0.867. We also conducted an ablation study where we varied the neural network architecture; specifically, the network's width or depth. The ablation study showed that contrastive network performance results are somewhat robust to the network architecture. This work was in part motivated by the use of similarity scores attained via contrastive learning for standard evidence interpretation methods such as score-based likelihood ratios.
- [80] arXiv:2503.20995 [pdf, html, other]
-
Title: Multi-head Reward Aggregation Guided by EntropySubjects: Computation and Language (cs.CL)
Aligning large language models (LLMs) with safety guidelines typically involves reinforcement learning from human feedback (RLHF), relying on human-generated preference annotations. However, assigning consistent overall quality ratings is challenging, prompting recent research to shift towards detailed evaluations based on multiple specific safety criteria. This paper uncovers a consistent observation: safety rules characterized by high rating entropy are generally less reliable in identifying responses preferred by humans. Leveraging this finding, we introduce ENCORE, a straightforward entropy-guided approach that composes multi-head rewards by downweighting rules exhibiting high rating entropy. Theoretically, we demonstrate that rules with elevated entropy naturally receive minimal weighting in the Bradley-Terry optimization framework, justifying our entropy-based penalization. Through extensive experiments on RewardBench safety tasks, our method significantly surpasses several competitive baselines, including random weighting, uniform weighting, single-head Bradley-Terry models, and LLM-based judging methods. Our proposed approach is training-free, broadly applicable to various datasets, and maintains interpretability, offering a practical and effective solution for multi-attribute reward modeling.
- [81] arXiv:2503.20998 [pdf, html, other]
-
Title: CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View SynthesisComments: Accepted to CVPR 2025, Mistakenly submitted as a replacement for arXiv:2402.11057Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
We propose Covisibility Map-based Gaussian Splatting (CoMapGS), designed to recover underrepresented sparse regions in sparse novel view synthesis. CoMapGS addresses both high- and low-uncertainty regions by constructing covisibility maps, enhancing initial point clouds, and applying uncertainty-aware weighted supervision using a proximity classifier. Our contributions are threefold: (1) CoMapGS reframes novel view synthesis by leveraging covisibility maps as a core component to address region-specific uncertainty; (2) Enhanced initial point clouds for both low- and high-uncertainty regions compensate for sparse COLMAP-derived point clouds, improving reconstruction quality and benefiting few-shot 3DGS methods; (3) Adaptive supervision with covisibility-score-based weighting and proximity classification achieves consistent performance gains across scenes with varying sparsity scores derived from covisibility maps. Experimental results demonstrate that CoMapGS outperforms state-of-the-art methods on datasets including Mip-NeRF 360 and LLFF.
- [82] arXiv:2503.20999 [pdf, html, other]
-
Title: Text-Driven Voice Conversion via Latent State-Space ModelingSubjects: Graphics (cs.GR); Sound (cs.SD)
Text-driven voice conversion allows customization of speaker characteristics and prosodic elements using textual descriptions. However, most existing methods rely heavily on direct text-to-speech training, limiting their flexibility in controlling nuanced style elements or timbral features. In this paper, we propose a novel \textbf{Latent State-Space} approach for text-driven voice conversion (\textbf{LSS-VC}). Our method treats each utterance as an evolving dynamical system in a continuous latent space. Drawing inspiration from mamba, which introduced a state-space model for efficient text-driven \emph{image} style transfer, we adapt a loosely related methodology for \emph{voice} style transformation. Specifically, we learn a voice latent manifold where style and content can be manipulated independently by textual style prompts. We propose an adaptive cross-modal fusion mechanism to inject style information into the voice latent representation, enabling interpretable and fine-grained control over speaker identity, speaking rate, and emphasis. Extensive experiments show that our approach significantly outperforms recent baselines in both subjective and objective quality metrics, while offering smoother transitions between styles, reduced artifacts, and more precise text-based style control.
- [83] arXiv:2503.21000 [pdf, html, other]
-
Title: Improving User Behavior Prediction: Leveraging Annotator Metadata in Supervised Machine Learning ModelsComments: Accepted at CSCW 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Supervised machine-learning models often underperform in predicting user behaviors from conversational text, hindered by poor crowdsourced label quality and low NLP task accuracy. We introduce the Metadata-Sensitive Weighted-Encoding Ensemble Model (MSWEEM), which integrates annotator meta-features like fatigue and speeding. First, our results show MSWEEM outperforms standard ensembles by 14\% on held-out data and 12\% on an alternative dataset. Second, we find that incorporating signals of annotator behavior, such as speed and fatigue, significantly boosts model performance. Third, we find that annotators with higher qualifications, such as Master's, deliver more consistent and faster annotations. Given the increasing uncertainty over annotation quality, our experiments show that understanding annotator patterns is crucial for enhancing model accuracy in user behavior prediction.
- [84] arXiv:2503.21003 [pdf, html, other]
-
Title: Forensic Self-Descriptions Are All You Need for Zero-Shot Detection, Open-Set Source Attribution, and Clustering of AI-generated ImagesComments: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
The emergence of advanced AI-based tools to generate realistic images poses significant challenges for forensic detection and source attribution, especially as new generative techniques appear rapidly. Traditional methods often fail to generalize to unseen generators due to reliance on features specific to known sources during training. To address this problem, we propose a novel approach that explicitly models forensic microstructures - subtle, pixel-level patterns unique to the image creation process. Using only real images in a self-supervised manner, we learn a set of diverse predictive filters to extract residuals that capture different aspects of these microstructures. By jointly modeling these residuals across multiple scales, we obtain a compact model whose parameters constitute a unique forensic self-description for each image. This self-description enables us to perform zero-shot detection of synthetic images, open-set source attribution of images, and clustering based on source without prior knowledge. Extensive experiments demonstrate that our method achieves superior accuracy and adaptability compared to competing techniques, advancing the state of the art in synthetic media forensics.
- [85] arXiv:2503.21004 [pdf, other]
-
Title: Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and ParametersMahmoud Alwakeel, Emory Buck, Jonathan G. Martin, Imran Aslam, Sudarshan Rajagopal, Jian Pei, Mihai V. Podgoreanu, Christopher J. Lindsell, An-Kwok Ian WongSubjects: Computation and Language (cs.CL)
Pulmonary embolism (PE) is a leading cause of cardiovascular mortality, yet our understanding of optimal management remains limited due to heterogeneous and inaccessible radiology documentation. The PERT Consortium registry standardizes PE management data but depends on resource-intensive manual abstraction. Large language models (LLMs) offer a scalable alternative for automating concept extraction from computed tomography PE (CTPE) reports. This study evaluated the accuracy of LLMs in extracting PE-related concepts compared to a human-curated criterion standard. We retrospectively analyzed MIMIC-IV and Duke Health CTPE reports using multiple LLaMA models. Larger models (70B) outperformed smaller ones (8B), achieving kappa values of 0.98 (PE detection), 0.65-0.75 (PE location), 0.48-0.51 (right heart strain), and 0.65-0.70 (image artifacts). Moderate temperature tuning (0.2-0.5) improved accuracy, while excessive in-context examples reduced performance. A dual-model review framework achieved >80-90% precision. LLMs demonstrate strong potential for automating PE registry abstraction, minimizing manual workload while preserving accuracy.
- [86] arXiv:2503.21007 [pdf, html, other]
-
Title: Bounds on Deep Neural Network Partial Derivatives with Respect to ParametersComments: 8 pagesSubjects: Systems and Control (eess.SY)
Deep neural networks (DNNs) have emerged as a powerful tool with a growing body of literature exploring Lyapunov-based approaches for real-time system identification and control. These methods depend on establishing bounds for the second partial derivatives of DNNs with respect to their parameters, a requirement often assumed but rarely addressed explicitly. This paper provides rigorous mathematical formulations of polynomial bounds on both the first and second partial derivatives of DNNs with respect to their parameters. We present lemmas that characterize these bounds for fully-connected DNNs, while accommodating various classes of activation function including sigmoidal and ReLU-like functions. Our analysis yields closed-form expressions that enable precise stability guarantees for Lyapunov-based deep neural networks (Lb-DNNs). Furthermore, we extend our results to bound the higher-order terms in first-order Taylor approximations of DNNs, providing important tools for convergence analysis in gradient-based learning algorithms. The developed theoretical framework develops explicit, computable expressions, for previously assumed bounds, thereby strengthening the mathematical foundation of neural network applications in safety-critical control systems.
- [87] arXiv:2503.21010 [pdf, html, other]
-
Title: Privacy in Immersive Extended Reality: Exploring User Perceptions, Concerns, and Coping StrategiesComments: 25 pages, 4 figures, 8 tables. the 2024 CHI Conference on Human Factors in Computing Systems (CHI'24)Subjects: Human-Computer Interaction (cs.HC)
Extended Reality (XR) technology is changing online interactions, but its granular data collection sensors may be more invasive to user privacy than web, mobile, and the Internet of Things technologies. Despite an increased interest in studying developers' concerns about XR device privacy, user perceptions have rarely been addressed. We surveyed 464 XR users to assess their awareness, concerns, and coping strategies around XR data in 18 scenarios. Our findings demonstrate that many factors, such as data types and sensitivity, affect users' perceptions of privacy in XR. However, users' limited awareness of XR sensors' granular data collection capabilities, such as involuntary body signals of emotional responses, restricted the range of privacy-protective strategies they used. Our results highlight a need to enhance users' awareness of data privacy threats in XR, design privacy-choice interfaces tailored to XR environments, and develop transparent XR data practices.
- [88] arXiv:2503.21011 [pdf, html, other]
-
Title: Can Large Language Models Predict Associations Among Human Attitudes?Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Prior work has shown that large language models (LLMs) can predict human attitudes based on other attitudes, but this work has largely focused on predictions from highly similar and interrelated attitudes. In contrast, human attitudes are often strongly associated even across disparate and dissimilar topics. Using a novel dataset of human responses toward diverse attitude statements, we found that a frontier language model (GPT-4o) was able to recreate the pairwise correlations among individual attitudes and to predict individuals' attitudes from one another. Crucially, in an advance over prior work, we tested GPT-4o's ability to predict in the absence of surface-similarity between attitudes, finding that while surface similarity improves prediction accuracy, the model was still highly-capable of generating meaningful social inferences between dissimilar attitudes. Altogether, our findings indicate that LLMs capture crucial aspects of the deeper, latent structure of human belief systems.
- [89] arXiv:2503.21013 [pdf, html, other]
-
Title: AllReduce Scheduling with Hierarchical Deep Reinforcement LearningSubjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC)
AllReduce is a technique in distributed computing which saw use in many critical applications of deep learning. Existing methods of AllReduce scheduling oftentimes lack flexibility due to being topology-specific or relying on extensive handcrafted designs that require domain-specific knowledge. In this work, we aim to alleviate this inflexibility by proposing a deep-reinforcement-learning (DRL)-based pipeline that can generate AllReduce scheduling for various network topologies without topology-specific design features. The flow scheduling module of this pipeline consists of two hierarchically-structured DRL policies that work cooperatively to find optimal scheduling. We showcase the performance of our method compared to the baseline methods on three topologies: BCube, DCell, and Jellyfish. Finally, we contributed a Python-based simulation environment simulating AllReduce scheduling on these network topologies.
- [90] arXiv:2503.21016 [pdf, html, other]
-
Title: History-Independent Concurrent Hash TablesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
A history-independent data structure does not reveal the history of operations applied to it, only its current logical state, even if its internal state is examined. This paper studies history-independent concurrent dictionaries, in particular, hash tables, and establishes inherent bounds on their space requirements.
This paper shows that there is a lock-free history-independent concurrent hash table, in which each memory cell stores two elements and two bits, based on Robin Hood hashing. Our implementation is linearizable, and uses the shared memory primitive LL/SC. The expected amortized step complexity of the hash table is $O(c)$, where $c$ is an upper bound on the number of concurrent operations that access the same element, assuming the hash table is not overpopulated. We complement this positive result by showing that even if we have only two concurrent processes, no history-independent concurrent dictionary that supports sets of any size, with wait-free membership queries and obstruction-free insertions and deletions, can store only two elements of the set and a constant number of bits in each memory cell. This holds even if the step complexity of operations on the dictionary is unbounded. - [91] arXiv:2503.21018 [pdf, other]
-
Title: Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse DatasetsSubjects: Machine Learning (cs.LG)
While sequential decision-making environments often involve high-dimensional observations, not all features of these observations are relevant for control. In particular, the observation space may capture factors of the environment which are not controllable by the agent, but which add complexity to the observation space. The need to ignore these "noise" features in order to operate in a tractably-small state space poses a challenge for efficient policy learning. Due to the abundance of video data available in many such environments, task-independent representation learning from action-free offline data offers an attractive solution. However, recent work has highlighted theoretical limitations in action-free learning under the Exogenous Block MDP (Ex-BMDP) model, where temporally-correlated noise features are present in the observations. To address these limitations, we identify a realistic setting where representation learning in Ex-BMDPs becomes tractable: when action-free video data from multiple agents with differing policies are available. Concretely, this paper introduces CRAFT (Comparison-based Representations from Action-Free Trajectories), a sample-efficient algorithm leveraging differences in controllable feature dynamics across agents to learn representations. We provide theoretical guarantees for CRAFT's performance and demonstrate its feasibility on a toy example, offering a foundation for practical methods in similar settings.
- [92] arXiv:2503.21022 [pdf, html, other]
-
Title: Reconstructing Gridded Data from Higher AutocorrelationsComments: 13 pages, 1 figureSubjects: Computer Vision and Pattern Recognition (cs.CV); Group Theory (math.GR); Data Analysis, Statistics and Probability (physics.data-an)
The higher-order autocorrelations of integer-valued or rational-valued gridded data sets appear naturally in X-ray crystallography, and have applications in computer vision systems, correlation tomography, correlation spectroscopy, and pattern recognition. In this paper, we consider the problem of reconstructing a gridded data set from its higher-order autocorrelations. We describe an explicit reconstruction algorithm, and prove that the autocorrelations up to order 3r + 3 are always sufficient to determine the data up to translation, where r is the dimension of the grid. We also provide examples of rational-valued gridded data sets which are not determined by their autocorrelations up to order 3r + 2.
- [93] arXiv:2503.21023 [pdf, html, other]
-
Title: Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian FrameworkSubjects: Machine Learning (cs.LG)
Careful curation of data sources can significantly improve the performance of LLM pre-training, but predominant approaches rely heavily on intuition or costly trial-and-error, making them difficult to generalize across different data domains and downstream tasks. Although scaling laws can provide a principled and general approach for data curation, standard deterministic extrapolation from small-scale experiments to larger scales requires strong assumptions on the reliability of such extrapolation, whose brittleness has been highlighted in prior works. In this paper, we introduce a $\textit{probabilistic extrapolation framework}$ for data mixture optimization that avoids rigid assumptions and explicitly models the uncertainty in performance across decision variables. We formulate data curation as a sequential decision-making problem$\unicode{x2013}$multi-fidelity, multi-scale Bayesian optimization$\unicode{x2013}$where $\{$data mixtures, model scale, training steps$\}$ are adaptively selected to balance training cost and potential information gain. Our framework naturally gives rise to algorithm prototypes that leverage noisy information from inexpensive experiments to systematically inform costly training decisions. To accelerate methodological progress, we build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset. We observe that even simple kernels and acquisition functions can enable principled decisions across training models from 20M to 1B parameters and achieve $\textbf{2.6x}$ and $\textbf{3.3x}$ speedups compared to multi-fidelity BO and random search baselines. Taken together, our framework underscores potential efficiency gains achievable by developing principled and transferable data mixture optimization methods.
- [94] arXiv:2503.21025 [pdf, html, other]
-
Title: Improving Speech Recognition Accuracy Using Custom Language Models with the Vosk ToolkitComments: 10 pages, 7 figures, includes workflow diagram, accuracy and WER comparisons, spectrograms, and model evaluationSubjects: Sound (cs.SD); Machine Learning (cs.LG)
Although speech recognition algorithms have developed quickly in recent years, achieving high transcription accuracy across diverse audio formats and acoustic environments remains a major challenge. This work explores how incorporating custom language models with the open-source Vosk Toolkit can improve speech-to-text accuracy in varied settings. Unlike many conventional systems limited to specific audio types, this approach supports multiple audio formats such as WAV, MP3, FLAC, and OGG by using Python modules for preprocessing and format conversion.
A Python-based transcription pipeline was developed to process input audio, perform speech recognition using Vosk's KaldiRecognizer, and export the output to a DOCX file. Results showed that custom models reduced word error rates, especially in domain-specific scenarios involving technical terminology, varied accents, or background noise. This work presents a cost-effective, offline solution for high-accuracy transcription and opens up future opportunities for automation and real-time applications. - [95] arXiv:2503.21029 [pdf, html, other]
-
Title: Enhancing Korean Dependency Parsing with Morphosyntactic FeaturesSubjects: Computation and Language (cs.CL)
This paper introduces UniDive for Korean, an integrated framework that bridges Universal Dependencies (UD) and Universal Morphology (UniMorph) to enhance the representation and processing of Korean {morphosyntax}. Korean's rich inflectional morphology and flexible word order pose challenges for existing frameworks, which often treat morphology and syntax separately, leading to inconsistencies in linguistic analysis. UniDive unifies syntactic and morphological annotations by preserving syntactic dependencies while incorporating UniMorph-derived features, improving consistency in annotation. We construct an integrated dataset and apply it to dependency parsing, demonstrating that enriched morphosyntactic features enhance parsing accuracy, particularly in distinguishing grammatical relations influenced by morphology. Our experiments, conducted with both encoder-only and decoder-only models, confirm that explicit morphological information contributes to more accurate syntactic analysis.
- [96] arXiv:2503.21033 [pdf, html, other]
-
Title: Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It provides a detailed mapping of current frameworks for distributed deep learning in multinode and multi-GPU settings, including Horovod from Uber, DeepSpeed from Microsoft, and the built-in distributed capabilities of PyTorch and TensorFlow. We compare various multi-GPU setups for different dataset configurations, utilizing multiple HPC nodes independently and focusing on scalability, speedup, efficiency, and overhead. The analysis leverages HPC infrastructure with SLURM, Apptainer (Singularity) containers, CUDA, PyTorch, and shell scripts to support training workflows and automation. We achieved a sub-linear speedup when scaling the number of GPUs, with values of 1.6x for two and 1.9x for four.
- [97] arXiv:2503.21036 [pdf, html, other]
-
Title: The Art of Tool Interface DesignSubjects: Artificial Intelligence (cs.AI)
We present an agentic framework, Thinker, which achieves state of art performance in challenging reasoning tasks for realistic customer service scenarios that involve complex business logic and human interactions via long horizons. On the $\tau$-bench retail dataset, Thinker achieves 82.6\% success rate with GPT-4o (version 2024-06-01) (baseline: 68.3\%), and 81.9\% success rate with Llama-3.1 405B (baseline: 49.6\%), without any fine-tuning. Thinker effectively closes the gap in reasoning capabilities between the base models by introducing proper structure.
The key features of the Thinker framework are: (1) State-Machine Augmented Generation (SMAG), which represents business logic as state machines and the LLM uses state machines as tools. (2) Delegation of tasks from the main reasoning loop to LLM-powered tools. (3) Adaptive context management.
Our prompting-only solution achieves signficant gains, while still maintaining a standard agentic architecture with a ReAct style reasoning loop. The key is to innovate on the tool interface design, as exemplified by SMAG and the LLM-powered tools. - [98] arXiv:2503.21040 [pdf, html, other]
-
Title: Local Stability and Stabilization of Quadratic-Bilinear Systems using Petersen's LemmaSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Quadratic-bilinear (QB) systems arise in many areas of science and engineering. In this paper, we present a scalable approach for designing locally stabilizing state-feedback control laws and certifying the local stability of QB systems. Sufficient conditions are established for local stability and stabilization based on quadratic Lyapunov functions, which also provide ellipsoidal inner-estimates for the region of attraction and region of stabilizability of an equilibrium point. Our formulation exploits Petersen's Lemma to convert the problem of certifying the sign-definiteness of the Lyapunov condition into a line search over a single scalar parameter. The resulting linear matrix inequality (LMI) conditions scale quadratically with the state dimension for both stability analysis and control synthesis, thus enabling analysis and control of QB systems with hundreds of state variables without resorting to specialized implementations. We demonstrate the approach on three benchmark problems from the existing literature. In all cases, we find our formulation yields comparable approximations of stability domains as determined by other established tools that are otherwise restricted to systems with up to tens of state variables.
- [99] arXiv:2503.21042 [pdf, html, other]
-
Title: Dissipativity-Based Distributed Control and Communication Topology Co-Design for DC Microgrids with ZIP LoadsSubjects: Systems and Control (eess.SY)
This paper presents a novel dissipativity-based distributed droop-free control approach for voltage regulation, current sharing, and Constant Power Load (CPL) stability in DC microgrids (MGs). We describe the closed-loop DC MG as a networked system where DGs, lines, and nonlinear loads (including destabilizing CPLs) are interconnected via a static interconnection matrix. Each DG has a local controller and a distributed global controller, designed using dissipativity properties and sector-bounded techniques. For controller synthesis, we formulate a Linear Matrix Inequality (LMI) problem that simultaneously addresses voltage regulation, current sharing, and CPL stability guarantees. To support the feasibility of this problem, we propose a sector-bounded approach that characterizes CPL nonlinearities and integrates them into the dissipativity framework through S-procedure techniques. Our approach provides a unified framework for co-designing distributed controllers and communication topologies that ensure stability despite the presence of destabilizing CPL effects. The effectiveness of the proposed solution was verified by simulating an islanded DC MG under different scenarios, demonstrating superior performance compared to traditional control approaches when handling CPLs.
- [100] arXiv:2503.21044 [pdf, html, other]
-
Title: Exploring Interference between Concurrent Skin StretchesSubjects: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
Proprioception is essential for coordinating human movements and enhancing the performance of assistive robotic devices. Skin stretch feedback, which closely aligns with natural proprioception mechanisms, presents a promising method for conveying proprioceptive information. To better understand the impact of interference on skin stretch perception, we conducted a user study with 30 participants that evaluated the effect of two simultaneous skin stretches on user perception. We observed that when participants experience simultaneous skin stretch stimuli, a masking effect occurs which deteriorates perception performance in the collocated skin stretch configurations. However, the perceived workload stays the same. These findings show that interference can affect the perception of skin stretch such that multi-channel skin stretch feedback designs should avoid locating modules in close proximity.
- [101] arXiv:2503.21047 [pdf, html, other]
-
Title: World Model Agents with Change-Based Intrinsic MotivationComments: Submitted to Northern Lights Deep Learning Conference 2025Subjects: Machine Learning (cs.LG)
Sparse reward environments pose a significant challenge for reinforcement learning due to the scarcity of feedback. Intrinsic motivation and transfer learning have emerged as promising strategies to address this issue. Change Based Exploration Transfer (CBET), a technique that combines these two approaches for model-free algorithms, has shown potential in addressing sparse feedback but its effectiveness with modern algorithms remains understudied. This paper provides an adaptation of CBET for world model algorithms like DreamerV3 and compares the performance of DreamerV3 and IMPALA agents, both with and without CBET, in the sparse reward environments of Crafter and Minigrid. Our tabula rasa results highlight the possibility of CBET improving DreamerV3's returns in Crafter but the algorithm attains a suboptimal policy in Minigrid with CBET further reducing returns. In the same vein, our transfer learning experiments show that pre-training DreamerV3 with intrinsic rewards does not immediately lead to a policy that maximizes extrinsic rewards in Minigrid. Overall, our results suggest that CBET provides a positive impact on DreamerV3 in more complex environments like Crafter but may be detrimental in environments like Minigrid. In the latter case, the behaviours promoted by CBET in DreamerV3 may not align with the task objectives of the environment, leading to reduced returns and suboptimal policies.
- [102] arXiv:2503.21048 [pdf, html, other]
-
Title: Integrated utilization of equations and small dataset in the Koopman operator: applications to forward and inverse ProblemsComments: 10 pages, 8 figuresSubjects: Machine Learning (cs.LG)
In recent years, there has been a growing interest in data-driven approaches in physics, such as extended dynamic mode decomposition (EDMD). The EDMD algorithm focuses on nonlinear time-evolution systems, and the constructed Koopman matrix yields the next-time prediction with only linear matrix-product operations. Note that data-driven approaches generally require a large dataset. However, assume that one has some prior knowledge, even if it may be ambiguous. Then, one could achieve sufficient learning from only a small dataset by taking advantage of the prior knowledge. This paper yields methods for incorporating ambiguous prior knowledge into the EDMD algorithm. The ambiguous prior knowledge in this paper corresponds to the underlying time-evolution equations with unknown parameters. First, we apply the proposed method to forward problems, i.e., prediction tasks. Second, we propose a scheme to apply the proposed method to inverse problems, i.e., parameter estimation tasks. We demonstrate the learning with only a small dataset using guiding examples, i.e., the Duffing and the van der Pol systems.
- [103] arXiv:2503.21049 [pdf, other]
-
Title: On the Hardness Hierarchy for the $O(n \sqrt{\log n})$ Complexity in the Word RAMComments: Accepted to STOC 2025Subjects: Data Structures and Algorithms (cs.DS)
In this work, we study the relative hardness of fundamental problems with state-of-the-art word RAM algorithms that take $O(n\sqrt{\log n})$ time for instances described in $\Theta(n)$ machine words ($\Theta(n\log n)$ bits). This complexity class, one of six hardness levels identified by Chan and Pătraşcu [SODA 2010], includes diverse problems from several domains: Counting Inversions, string processing problems (BWT Construction, LZ77 Factorization, Longest Common Substring, Batched Longest Previous Factor Queries, Batched Inverse Suffix Array Queries), and computational geometry tasks (Orthogonal Range Counting, Orthogonal Segment Intersection). We offer two main contributions:
We establish new links between the above string problems and Dictionary Matching, a classic task solvable using the Aho-Corasick automaton. We restrict Dictionary Matching to instances with $O(n)$ binary patterns of length $m = O(\log n)$ each, and we prove that, unless these instances can be solved in $o(n\sqrt{\log n})$ time, the aforementioned string problems cannot be solved faster either.
Via further reductions, we extend this hardness to Counting Inversions (a fundamental component in geometric algorithms) and thus to Orthogonal Range Counting and Orthogonal Segment Intersection. This hinges on String Nesting, a new problem which is equivalent to Dictionary Matching and can be reduced to Counting Inversions in three steps.
Together, our results unveil a single problem, with two equivalent formulations, that underlies the hardness of nearly all major problems currently occupying the $O(n\sqrt{\log n})$ level of hardness. These results drastically funnel further efforts to improve the complexity of near-linear problems. As an auxiliary outcome of our framework, we also prove that the alphabet in several central string problems can be efficiently reduced to binary. - [104] arXiv:2503.21055 [pdf, html, other]
-
Title: What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation LearningComments: 16 pages, 4 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Existing work has studied procedure-aware video representations by proposing novel approaches such as modeling the temporal order of actions and has not explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by Large Language Models (LLMs) as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining the unseen ``What if'' scenarios. This counterfactual reasoning facilitates the model's ability to understand the cause and effect of each step in an activity. To verify the procedure awareness of our model, we conduct extensive experiments on procedure-aware tasks, including temporal action segmentation and error detection. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals and achieve significant improvements on multiple tasks. We will make our source code and data publicly available soon.
- [105] arXiv:2503.21056 [pdf, html, other]
-
Title: Online Reasoning Video Segmentation with Just-in-Time Digital TwinsSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where -- given an implicit query -- a LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as "just-in-time" because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity.
- [106] arXiv:2503.21057 [pdf, html, other]
-
Title: Validation and Calibration of Energy Models with Real Vehicle Data from Chassis Dynamometer ExperimentsJoy Carpio, Sulaiman Almatrudi, Nour Khoudari, Zhe Fu, Kenneth Butts, Jonathan Lee, Benjamin Seibold, Alexandre BayenSubjects: Systems and Control (eess.SY)
Accurate estimation of vehicle fuel consumption typically requires detailed modeling of complex internal powertrain dynamics, often resulting in computationally intensive simulations. However, many transportation applications-such as traffic flow modeling, optimization, and control-require simplified models that are fast, interpretable, and easy to implement, while still maintaining fidelity to physical energy behavior. This work builds upon a recently developed model reduction pipeline that derives physics-like energy models from high-fidelity Autonomie vehicle simulations. These reduced models preserve essential vehicle dynamics, enabling realistic fuel consumption estimation with minimal computational overhead. While the reduced models have demonstrated strong agreement with their Autonomie counterparts, previous validation efforts have been confined to simulation environments. This study extends the validation by comparing the reduced energy model's outputs against real-world vehicle data. Focusing on the MidSUV category, we tune the baseline Autonomie model to closely replicate the characteristics of a Toyota RAV4. We then assess the accuracy of the resulting reduced model in estimating fuel consumption under actual drive conditions. Our findings suggest that, when the reference Autonomie model is properly calibrated, the simplified model produced by the reduction pipeline can provide reliable, semi-principled fuel rate estimates suitable for large-scale transportation applications.
- [107] arXiv:2503.21059 [pdf, html, other]
-
Title: Uncertainty propagation in feed-forward neural network modelsComments: 21 pages, 13 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We develop new uncertainty propagation methods for feed-forward neural network architectures with leaky ReLu activation functions subject to random perturbations in the input vectors. In particular, we derive analytical expressions for the probability density function (PDF) of the neural network output and its statistical moments as a function of the input uncertainty and the parameters of the network, i.e., weights and biases. A key finding is that an appropriate linearization of the leaky ReLu activation function yields accurate statistical results even for large perturbations in the input vectors. This can be attributed to the way information propagates through the network. We also propose new analytically tractable Gaussian copula surrogate models to approximate the full joint PDF of the neural network output. To validate our theorical results, we conduct Monte Carlo simulations and a thorough error analysis on a multi-layer neural network representing a nonlinear integro-differential operator between two polynomial function spaces. Our findings demonstrate excellent agreement between the theoretical predictions and Monte Carlo simulations.
- [108] arXiv:2503.21061 [pdf, html, other]
-
Title: Neural Architecture Search by Learning a Hierarchical Search SpaceSubjects: Computer Vision and Pattern Recognition (cs.CV)
Monte-Carlo Tree Search (MCTS) is a powerful tool for many non-differentiable search related problems such as adversarial games. However, the performance of such approach highly depends on the order of the nodes that are considered at each branching of the tree. If the first branches cannot distinguish between promising and deceiving configurations for the final task, the efficiency of the search is exponentially reduced. In Neural Architecture Search (NAS), as only the final architecture matters, the visiting order of the branching can be optimized to improve learning. In this paper, we study the application of MCTS to NAS for image classification. We analyze several sampling methods and branching alternatives for MCTS and propose to learn the branching by hierarchical clustering of architectures based on their similarity. The similarity is measured by the pairwise distance of output vectors of architectures. Extensive experiments on two challenging benchmarks on CIFAR10 and ImageNet show that MCTS, if provided with a good branching hierarchy, can yield promising solutions more efficiently than other approaches for NAS problems.
- [109] arXiv:2503.21062 [pdf, html, other]
-
Title: DBRAA: Sub-6 GHz and Millimeter Wave Dual-Band Reconfigurable Antenna Array for ISACSubjects: Information Theory (cs.IT)
This paper proposes a dual-band reconfigurable antenna array (DBRAA), enabling wireless capabilities in both sub-6 GHz (sub-6G) and millimeter wave (mmWave) bands using a single array. For the sub-6G band, we propose a reconfigurable antenna selection structure, where each sub-6G antenna is formed by multiplexing several mmWave antennas, with its position dynamically adjusted using PIN diodes. For the mmWave band, we develop a reconfigurable hybrid beamforming structure that connects radio frequency chains to the antennas via phase shifters and a reconfigurable switch network. We then investigate integrated sensing and communications (ISAC) in sub-6G and mmWave bands using the proposed DBRAA and formulate a dual-band ISAC beamforming design problem. This problem aims at maximizing the mmWave communication sum-rate subject to the constraints of sub-6G communication quality of service and sensing beamforming gain requirements. The dual-band ISAC beamforming design is decoupled into sub-6G beamforming design and mmWave beamforming design. For the sub-6G beamforming design, we develop a fast search-based joint beamforming and antenna selection algorithm. For the mmWave beamforming design, we develop an alternating direction method of multipliers-based reconfigurable hybrid beamforming algorithm. Simulation results demonstrate the effectiveness of the proposed methods.
- [110] arXiv:2503.21065 [pdf, html, other]
-
Title: Fuzzy-Logic-based model predictive control: A paradigm integrating optimal and common-sense decision makingComments: 50 Pages, 8 figures, 3 tablesSubjects: Robotics (cs.RO); Optimization and Control (math.OC)
This paper introduces a novel concept, fuzzy-logic-based model predictive control (FLMPC), along with a multi-robot control approach for exploring unknown environments and locating targets. Traditional model predictive control (MPC) methods rely on Bayesian theory to represent environmental knowledge and optimize a stochastic cost function, often leading to high computational costs and lack of effectiveness in locating all the targets. Our approach instead leverages FLMPC and extends it to a bi-level parent-child architecture for enhanced coordination and extended decision making horizon. Extracting high-level information from probability distributions and local observations, FLMPC simplifies the optimization problem and significantly extends its operational horizon compared to other MPC methods. We conducted extensive simulations in unknown 2-dimensional environments with randomly placed obstacles and humans. We compared the performance and computation time of FLMPC against MPC with a stochastic cost function, then evaluated the impact of integrating the high-level parent FLMPC layer. The results indicate that our approaches significantly improve both performance and computation time, enhancing coordination of robots and reducing the impact of uncertainty in large-scale search and rescue environments.
- [111] arXiv:2503.21067 [pdf, html, other]
-
Title: AskSport: Web Application for Sports Question-AnsweringEnzo B Onofre (1), Leonardo M P Moraes (2), Cristina D Aguiar (2) ((1) Faculty of Computing, Federal University of Uberlandia, Brazil, (2) Institute of Mathematics and Computer Sciences, University of Sao Paulo, Brazil)Comments: for accessing the application, see this https URLSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
This paper introduces AskSport, a question-answering web application about sports. It allows users to ask questions using natural language and retrieve the three most relevant answers, including related information and documents. The paper describes the characteristics and functionalities of the application, including use cases demonstrating its ability to return names and numerical values. AskSport and its implementation are available for public access on HuggingFace.
- [112] arXiv:2503.21069 [pdf, html, other]
-
Title: Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt ParsingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model's parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis.
- [113] arXiv:2503.21070 [pdf, html, other]
-
Title: Cubature Kalman Filter as a Robust State Estimator Against Model Uncertainty and Cyber Attacks in Power SystemsSubjects: Systems and Control (eess.SY)
It is known that the conventional estimators such as extended Kalman filter (EKF) and unscented Kalman filter (UKF) may provide favorable performance; However, they may not guarantee the robustness against model uncertainty and cyber attacks. In this paper, we compare the performance of cubature Kalman filter (CKF) to the conventional nonlinear estimator, the EKF, under the affect of model uncertainty and cyber-attack. We show that the CKF has better estimation accuracy than the EKF under some conditions. In order to verify our claim, we have tested the performance various nonlinear estimators on the single machine infinite-bus (SMIB) system under different scenarios. We show that (1) the CKF provides better estimation results than the EKF; (2) the CKF is able to detect different types of cyber attacks reliably which is superior to the EKF.
- [114] arXiv:2503.21071 [pdf, html, other]
-
Title: Purifying Approximate Differential Privacy with Randomized Post-processingSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose a framework to convert $(\varepsilon, \delta)$-approximate Differential Privacy (DP) mechanisms into $(\varepsilon, 0)$-pure DP mechanisms, a process we call ``purification''. This algorithmic technique leverages randomized post-processing with calibrated noise to eliminate the $\delta$ parameter while preserving utility. By combining the tighter utility bounds and computational efficiency of approximate DP mechanisms with the stronger guarantees of pure DP, our approach achieves the best of both worlds. We illustrate the applicability of this framework in various settings, including Differentially Private Empirical Risk Minimization (DP-ERM), data-dependent DP mechanisms such as Propose-Test-Release (PTR), and query release tasks. To the best of our knowledge, this is the first work to provide a systematic method for transforming approximate DP into pure DP while maintaining competitive accuracy and computational efficiency.
- [115] arXiv:2503.21072 [pdf, html, other]
-
Title: HSLiNets: Evaluating Band Ordering Strategies in Hyperspectral and LiDAR FusionComments: 2 figures, 5 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
The integration of hyperspectral imaging (HSI) and Light Detection and Ranging (LiDAR) data provides complementary spectral and spatial information for remote sensing applications. While previous studies have explored the role of band selection and grouping in HSI classification, little attention has been given to how the spectral sequence or band order affects classification outcomes when fused with LiDAR. In this work, we systematically investigate the influence of band order on HSI-LiDAR fusion performance. Through extensive experiments, we demonstrate that band order significantly impacts classification accuracy, revealing a previously overlooked factor in fusion-based models. Motivated by this observation, we propose a novel fusion architecture that not only integrates HSI and LiDAR data but also learns from multiple band order configurations. The proposed method enhances feature representation by adaptively fusing different spectral sequences, leading to improved classification accuracy. Experimental results on the Houston 2013 and Trento datasets show that our approach outperforms state-of-the-art fusion models. Data and code are available at this https URL.
- [116] arXiv:2503.21073 [pdf, html, other]
-
Title: Shared Global and Local Geometry of Language Model EmbeddingsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Researchers have recently suggested that models share common representations. In this work, we find that the token embeddings of language models exhibit common geometric structure. First, we find ``global'' similarities: token embeddings often share similar relative orientations. Next, we characterize local geometry in two ways: (1) by using Locally Linear Embeddings, and (2) by defining a simple measure for the intrinsic dimension of each token embedding. Our intrinsic dimension measure demonstrates that token embeddings lie on a lower dimensional manifold. We qualitatively show that tokens with lower intrinsic dimensions often have semantically coherent clusters, while those with higher intrinsic dimensions do not. Both characterizations allow us to find similarities in the local geometry of token embeddings. Perhaps most surprisingly, we find that alignment in token embeddings persists through the hidden states of language models, allowing us to develop an application for interpretability. Namely, we empirically demonstrate that steering vectors from one language model can be transferred to another, despite the two models having different dimensions.
- [117] arXiv:2503.21074 [pdf, other]
-
Title: Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing SystemsComments: 106 pages total (main text: 42, 48 w/refs, 100 w/appendices). 21 figures, 4 tables in main; 106 figs, 8 tables total. Code and data at this URL: this https URL. Submitted as undergrad thesis at Duke Kunshan University; accepted for presentation at the 2025 Computer Applications and Quantitative Methods in Archaeology Conference, AthensSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
This thesis employs a hybrid CNN-Transformer architecture, in conjunction with a detailed anthropological framework, to investigate potential historical connections between the visual morphology of the Indus Valley script and pictographic systems of the Tibetan-Yi Corridor. Through an ensemble methodology of three target scripts across 15 independently trained models, we demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold higher visual similarity to the Indus script (61.7%-63.5%) than to the Bronze Age Proto-Cuneiform (10.2%-10.9%) or Proto-Elamite (7.6%-8.7%) systems. Additionally and contrarily to our current understanding of the networks of the Indus Valley Civilization, the Indus script unexpectedly maps closer to Tibetan-Yi Corridor scripts, with a mean cosine similarity of 0.629, than to the aforementioned contemporaneous West Asian signaries, both of which recorded mean cosine similarities of 0.104 and 0.080 despite their close geographic proximity and evident trade relations. Across various dimensionality reduction practices and clustering methodologies, the Indus script consistently clusters closest to Tibetan-Yi Corridor scripts. Our computational results align with qualitative observations of specific pictorial parallels in numeral systems, gender markers, and key iconographic elements; this is further supported by archaeological evidence of sustained contact networks along the ancient Shu-Shendu road in tandem with the Indus Valley Civilization's decline, providing a plausible transmission pathway. While alternative explanations cannot be ruled out, the specificity and consistency of observed similarities challenge conventional narratives of isolated script development and suggest more complex ancient cultural transmission networks between South and East Asia than previously recognized.
- [118] arXiv:2503.21076 [pdf, html, other]
-
Title: KAC: Kolmogorov-Arnold Classifier for Continual LearningComments: CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Inspired by the success of Kolmogorov-Arnold Networks (KAN) in preserving learning stability during simple continual regression tasks, we set out to explore their potential in more complex continual learning scenarios. In this paper, we introduce the Kolmogorov-Arnold Classifier (KAC), a novel classifier developed for continual learning based on the KAN structure. We delve into the impact of KAN's spline functions and introduce Radial Basis Functions (RBF) for improved compatibility with continual learning. We replace linear classifiers with KAC in several recent approaches and conduct experiments across various continual learning benchmarks, all of which demonstrate performance improvements, highlighting the effectiveness and robustness of KAC in continual learning. The code is available at this https URL.
- [119] arXiv:2503.21078 [pdf, other]
-
Title: Sub-ODEs Simplify Taylor Series Algorithms for Ordinary Differential EquationsComments: 25 pagesSubjects: Numerical Analysis (math.NA); Mathematical Software (cs.MS)
A Taylor method for solving an ordinary differential equation initial-value problem $\dot x = f(t,x)$, $x(t_0) = x_0$, computes the Taylor series (TS) of the solution at the current point, truncated to some order, and then advances to the next point by summing the TS with a suitable step size.
A standard ODE method (e.g. Runge-Kutta) treats function $f$ as a black box, but a Taylor solver requires $f$ to be preprocessed into a code-list of elementary operations that it interprets as operations on (truncated) TS.
The trade-off for this extra work includes arbitrary order, typically enabling much larger step sizes.
For a standard function, such as $\exp$, this means evaluating $v(t)=\exp(u(t))$, where $u(t),v(t)$ are TS.
The sub-ODE method applies the ODE $d v/d u=v$, obeyed by $v=\exp(u)$, to in-line this operation as $\dot v=v\dot u$.
This gives economy of implementation: each function that satisfies a simple ODE goes into the "Taylor library" with a few lines of code--not needing a separate recurrence relation, which is the typical approach.
Mathematically, however, the use of sub-ODEs generally transforms the original ODE into a differential-algebraic system, making it nontrivial to ensure a sound system of recurrences for Taylor coefficients.
We prove that, regardless of how many sub-ODEs are incorporated into $f$, this approach guarantees a sound system.
We introduce our sub-ODE-based Matlab ODE solver and show that its performance compares favorably with solvers from the Matlab ODE suite. - [120] arXiv:2503.21080 [pdf, html, other]
-
Title: EQ-Negotiator: An Emotion-Reasoning LLM Agent in Credit DialoguesSubjects: Computation and Language (cs.CL)
While large language model (LLM)-based chatbots have been applied for effective engagement in credit dialogues, their capacity for dynamic emotional expression remains limited. Current agents primarily rely on passive empathy rather than affective reasoning. For instance, when faced with persistent client negativity, the agent should employ strategic emotional adaptation by expressing measured anger to discourage counterproductive behavior and guide the conversation toward resolution. This context-aware emotional modulation is essential for imitating the nuanced decision-making of human negotiators. This paper introduces an EQ-negotiator that combines emotion sensing from pre-trained language models (PLMs) with emotional reasoning based on Game Theory and Hidden Markov Models. It takes into account both the current and historical emotions of the client to better manage and address negative emotions during interactions. By fine-tuning pre-trained language models (PLMs) on public emotion datasets and validating them on the credit dialogue datasets, our approach enables LLM-based agents to effectively capture shifts in client emotions and dynamically adjust their response tone based on our emotion decision policies in real-world financial negotiations. This EQ-negotiator can also help credit agencies foster positive client relationships, enhancing satisfaction in credit services.
- [121] arXiv:2503.21082 [pdf, html, other]
-
Title: Can Video Diffusion Model Reconstruct 4D Geometry?Subjects: Computer Vision and Pattern Recognition (cs.CV)
Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemporal priors of large-scale video diffusion models to directly infer 4D pointmaps from casual videos. Sora3R follows a two-stage pipeline: (1) we adapt a pointmap VAE from a pretrained video VAE, ensuring compatibility between the geometry and video latent spaces; (2) we finetune a diffusion backbone in combined video and pointmap latent space to generate coherent 4D pointmaps for every frame. Sora3R operates in a fully feedforward manner, requiring no external modules (e.g., depth, optical flow, or segmentation) or iterative global alignment. Extensive experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction across diverse scenarios.
- [122] arXiv:2503.21084 [pdf, other]
-
Title: Geographical hotspot prediction based on point cloud-voxel-community partition clusteringSubjects: Machine Learning (cs.LG)
Existing solutions to the hotspot prediction problem in the field of geographic information remain at a relatively preliminary stage. This study presents a novel approach for detecting and predicting geographical hotspots, utilizing point cloud-voxel-community partition clustering. By analyzing high-dimensional data, we represent spatial information through point clouds, which are then subdivided into multiple voxels to enhance analytical efficiency. Our method identifies spatial voxels with similar characteristics through community partitioning, thereby revealing underlying patterns in hotspot distributions. Experimental results indicate that when applied to a dataset of archaeological sites in Turkey, our approach achieves a 19.31% increase in processing speed, with an accuracy loss of merely 6%, outperforming traditional clustering methods. This method not only provides a fresh perspective for hotspot prediction but also serves as an effective tool for high-dimensional data analysis.
- [123] arXiv:2503.21086 [pdf, html, other]
-
Title: Less Noise, More Signal: DRR for Better Optimizations of SE TasksSubjects: Software Engineering (cs.SE)
SE analytics problems do not always need complex AI. Better and faster solutions can sometimes be obtained by matching the complexity of the problem to the complexity of the solution. This paper introduces the Dimensionality Reduction Ratio (DRR), a new metric for predicting when lightweight algorithms suffice. Analyzing SE optimization problems from software configuration to process decisions and open-source project health we show that DRR pinpoints "simple" tasks where costly methods like DEHB (a state-of-the-art evolutionary optimizer) are overkill. For high-DRR problems, simpler methods can be just as effective and run two orders of magnitude faster.
- [124] arXiv:2503.21087 [pdf, html, other]
-
Title: PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees (Technical Report)Comments: 23 pages, 19 figuresJournal-ref: SIGMOD 2025Subjects: Databases (cs.DB)
After decades of research in approximate query processing (AQP), its adoption in the industry remains limited. Existing methods struggle to simultaneously provide user-specified error guarantees, eliminate maintenance overheads, and avoid modifications to database management systems. To address these challenges, we introduce two novel techniques, TAQA and BSAP. TAQA is a two-stage online AQP algorithm that achieves all three properties for arbitrary queries. However, it can be slower than exact queries if we use standard row-level sampling. BSAP resolves this by enabling block-level sampling with statistical guarantees in TAQA. We simple ment TAQA and BSAP in a prototype middleware system, PilotDB, that is compatible with all DBMSs supporting efficient block-level sampling. We evaluate PilotDB on PostgreSQL, SQL Server, and DuckDB over real-world benchmarks, demonstrating up to 126X speedups when running with a 5% guaranteed error.
- [125] arXiv:2503.21088 [pdf, html, other]
-
Title: ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model MergingHaoming Xu, Shuxun Wang, Yanqiu Zhao, Yi Zhong, Ziyan Jiang, Ningyuan Zhao, Shumin Deng, Huajun Chen, Ningyu ZhangComments: Work in progressSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
This paper presents the ZJUKLAB team's submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models. This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues. We propose an unlearning system that leverages Model Merging (specifically TIES-Merging), combining two specialized models into a more balanced unlearned model. Our system achieves competitive results, ranking second among 26 teams, with an online score of 0.944 for Task Aggregate and 0.487 for overall Aggregate. In this paper, we also conduct local experiments and perform a comprehensive analysis of the unlearning process, examining performance trajectories, loss dynamics, and weight perspectives, along with several supplementary experiments, to understand the effectiveness of our method. Furthermore, we analyze the shortcomings of our method and evaluation metrics, emphasizing that MIA scores and ROUGE-based metrics alone are insufficient to fully evaluate successful unlearning. Finally, we emphasize the need for more comprehensive evaluation methodologies and rethinking of unlearning objectives in future research. Code is available at this https URL.
- [126] arXiv:2503.21092 [pdf, html, other]
-
Title: FAIR-QR: Enhancing Fairness-aware Information Retrieval through Query RefinementComments: This is a preprint of our paper accepted at ECIR 2025Journal-ref: ECIR 2025, Part IV, LNCS 15575Subjects: Information Retrieval (cs.IR)
Information retrieval systems such as open web search and recommendation systems are ubiquitous and significantly impact how people receive and consume online information. Previous research has shown the importance of fairness in information retrieval systems to combat the issue of echo chambers and mitigate the rich-get-richer effect. Therefore, various fairness-aware information retrieval methods have been proposed. Score-based fairness-aware information retrieval algorithms, focusing on statistical parity, are interpretable but could be mathematically infeasible and lack generalizability. In contrast, learning-to-rank-based fairness-aware information retrieval algorithms using fairness-aware loss functions demonstrate strong performance but lack interpretability. In this study, we proposed a novel and interpretable framework that recursively refines query keywords to retrieve documents from underrepresented groups and achieve group fairness. Retrieved documents using refined queries will be re-ranked to ensure relevance. Our method not only shows promising retrieval results regarding relevance and fairness but also preserves interpretability by showing refined keywords used at each iteration.
- [127] arXiv:2503.21094 [pdf, html, other]
-
Title: GazeSwipe: Enhancing Mobile Touchscreen Reachability through Seamless Gaze and Finger-Swipe IntegrationSubjects: Human-Computer Interaction (cs.HC)
Smartphones with large screens provide users with increased display and interaction space but pose challenges in reaching certain areas with the thumb when using the device with one hand. To address this, we introduce GazeSwipe, a multimodal interaction technique that combines eye gaze with finger-swipe gestures, enabling intuitive and low-friction reach on mobile touchscreens. Specifically, we design a gaze estimation method that eliminates the need for explicit gaze calibration. Our approach also avoids the use of additional eye-tracking hardware by leveraging the smartphone's built-in front-facing camera. Considering the potential decrease in gaze accuracy without dedicated eye trackers, we use finger-swipe gestures to compensate for any inaccuracies in gaze estimation. Additionally, we introduce a user-unaware auto-calibration method that improves gaze accuracy during interaction. Through extensive experiments on smartphones and tablets, we compare our technique with various methods for touchscreen reachability and evaluate the performance of our auto-calibration strategy. The results demonstrate that our method achieves high success rates and is preferred by users. The findings also validate the effectiveness of the auto-calibration strategy.
- [128] arXiv:2503.21095 [pdf, html, other]
-
Title: Confidence Adjusted Surprise Measure for Active Resourceful Trials (CA-SMART): A Data-driven Active Learning Framework for Accelerating Material Discovery under Resource ConstraintsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
Accelerating the discovery and manufacturing of advanced materials with specific properties is a critical yet formidable challenge due to vast search space, high costs of experiments, and time-intensive nature of material characterization. In recent years, active learning, where a surrogate machine learning (ML) model mimics the scientific discovery process of a human scientist, has emerged as a promising approach to address these challenges by guiding experimentation toward high-value outcomes with a limited budget. Among the diverse active learning philosophies, the concept of surprise (capturing the divergence between expected and observed outcomes) has demonstrated significant potential to drive experimental trials and refine predictive models. Scientific discovery often stems from surprise thereby making it a natural driver to guide the search process. Despite its promise, prior studies leveraging surprise metrics such as Shannon and Bayesian surprise lack mechanisms to account for prior confidence, leading to excessive exploration of uncertain regions that may not yield useful information. To address this, we propose the Confidence-Adjusted Surprise Measure for Active Resourceful Trials (CA-SMART), a novel Bayesian active learning framework tailored for optimizing data-driven experimentation. On a high level, CA-SMART incorporates Confidence-Adjusted Surprise (CAS) to dynamically balance exploration and exploitation by amplifying surprises in regions where the model is more certain while discounting them in highly uncertain areas. We evaluated CA-SMART on two benchmark functions (Six-Hump Camelback and Griewank) and in predicting the fatigue strength of steel. The results demonstrate superior accuracy and efficiency compared to traditional surprise metrics, standard Bayesian Optimization (BO) acquisition functions and conventional ML methods.
- [129] arXiv:2503.21096 [pdf, html, other]
-
Title: Cloud Resource Allocation with Convex OptimizationSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
We present a convex optimization framework for overcoming the limitations of Kubernetes Cluster Autoscaler by intelligently allocating diverse cloud resources while minimizing costs and fragmentation. Current Kubernetes scaling mechanisms are restricted to homogeneous scaling of existing node types, limiting cost-performance optimization possibilities. Our matrix-based model captures resource demands, costs, and capacity constraints in a unified mathematical framework. A key contribution is our logarithmic approximation to the indicator function, which enables dynamic node type selection while maintaining problem convexity. Our approach balances cost optimization with operational complexity through interior-point methods. Experiments with real-world Kubernetes workloads demonstrate reduced costs and improved resource utilization compared to conventional Cluster Autoscaler strategies that can only scale up or down existing node pools.
- [130] arXiv:2503.21098 [pdf, other]
-
Title: Alleviating LLM-based Generative Retrieval Hallucination in Alipay SearchYedan Shen, Kaixin Wu, Yuechen Ding, Jingyuan Wen, Hong Liu, Mingjie Zhong, Zhouhan Lin, Jia Xu, Linjian MoComments: 4 pagesSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Generative retrieval (GR) has revolutionized document retrieval with the advent of large language models (LLMs), and LLM-based GR is gradually being adopted by the industry. Despite its remarkable advantages and potential, LLM-based GR suffers from hallucination and generates documents that are irrelevant to the query in some instances, severely challenging its credibility in practical applications. We thereby propose an optimized GR framework designed to alleviate retrieval hallucination, which integrates knowledge distillation reasoning in model training and incorporate decision agent to further improve retrieval precision. Specifically, we employ LLMs to assess and reason GR retrieved query-document (q-d) pairs, and then distill the reasoning data as transferred knowledge to the GR model. Moreover, we utilize a decision agent as post-processing to extend the GR retrieved documents through retrieval model and select the most relevant ones from multi perspectives as the final generative retrieval result. Extensive offline experiments on real-world datasets and online A/B tests on Fund Search and Insurance Search in Alipay demonstrate our framework's superiority and effectiveness in improving search quality and conversion gains.
- [131] arXiv:2503.21099 [pdf, html, other]
-
Title: Learning Class Prototypes for Unified Sparse Supervised 3D Object DetectionComments: Accepted by CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objects. Specifically, we first propose a prototype-based object mining module that converts the unlabeled object mining into a matching problem between class prototypes and unlabeled features. By using optimal transport matching results, we assign prototype labels to high-confidence features, thereby achieving the mining of unlabeled objects. We then present a multi-label cooperative refinement module to effectively recover missed detections through pseudo label quality control and prototype label cooperation. Experiments show that our method achieves state-of-the-art performance under the one object per scene sparse supervised setting across indoor and outdoor datasets. With only one labeled object per scene, our method achieves about 78%, 90%, and 96% performance compared to the fully supervised detector on ScanNet V2, SUN RGB-D, and KITTI, respectively, highlighting the scalability of our method. Code is available at this https URL.
- [132] arXiv:2503.21103 [pdf, html, other]
-
Title: Low Stein Discrepancy via Message-Passing Monte CarloComments: 8 pages, 2 figures, Accepted at the ICLR 2025 Workshop on Frontiers in Probabilistic InferenceSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)
Message-Passing Monte Carlo (MPMC) was recently introduced as a novel low-discrepancy sampling approach leveraging tools from geometric deep learning. While originally designed for generating uniform point sets, we extend this framework to sample from general multivariate probability distributions with known probability density function. Our proposed method, Stein-Message-Passing Monte Carlo (Stein-MPMC), minimizes a kernelized Stein discrepancy, ensuring improved sample quality. Finally, we show that Stein-MPMC outperforms competing methods, such as Stein Variational Gradient Descent and (greedy) Stein Points, by achieving a lower Stein discrepancy.
- [133] arXiv:2503.21104 [pdf, html, other]
-
Title: StyledStreets: Multi-style Street Simulator with Spatial and Temporal ConsistencyComments: 14 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Urban scene reconstruction requires modeling both static infrastructure and dynamic elements while supporting diverse environmental conditions. We present \textbf{StyledStreets}, a multi-style street simulator that achieves instruction-driven scene editing with guaranteed spatial and temporal consistency. Building on a state-of-the-art Gaussian Splatting framework for street scenarios enhanced by our proposed pose optimization and multi-view training, our method enables photorealistic style transfers across seasons, weather conditions, and camera setups through three key innovations: First, a hybrid embedding scheme disentangles persistent scene geometry from transient style attributes, allowing realistic environmental edits while preserving structural integrity. Second, uncertainty-aware rendering mitigates supervision noise from diffusion priors, enabling robust training across extreme style variations. Third, a unified parametric model prevents geometric drift through regularized updates, maintaining multi-view consistency across seven vehicle-mounted cameras.
Our framework preserves the original scene's motion patterns and geometric relationships. Qualitative results demonstrate plausible transitions between diverse conditions (snow, sandstorm, night), while quantitative evaluations show state-of-the-art geometric accuracy under style transfers. The approach establishes new capabilities for urban simulation, with applications in autonomous vehicle testing and augmented reality systems requiring reliable environmental consistency. Codes will be publicly available upon publication. - [134] arXiv:2503.21105 [pdf, html, other]
-
Title: AugWard: Augmentation-Aware Representation Learning for Accurate Graph ClassificationComments: Accepted to PAKDD 2025 (Oral Presentation)Subjects: Machine Learning (cs.LG)
How can we accurately classify graphs? Graph classification is a pivotal task in data mining with applications in social network analysis, web analysis, drug discovery, molecular property prediction, etc. Graph neural networks have achieved the state-of-the-art performance in graph classification, but they consistently struggle with overfitting. To mitigate overfitting, researchers have introduced various representation learning methods utilizing graph augmentation. However, existing methods rely on simplistic use of graph augmentation, which loses augmentation-induced differences and limits the expressiveness of representations.
In this paper, we propose AugWard (Augmentation-Aware Training with Graph Distance and Consistency Regularization), a novel graph representation learning framework that carefully considers the diversity introduced by graph augmentation. AugWard applies augmentation-aware training to predict the graph distance between the augmented graph and its original one, aligning the representation difference directly with graph distance at both feature and structure levels. Furthermore, AugWard employs consistency regularization to encourage the classifier to handle richer representations. Experimental results show that AugWard gives the state-of-the-art performance in supervised, semi-supervised graph classification, and transfer learning. - [135] arXiv:2503.21106 [pdf, html, other]
-
Title: Function Alignment: A New Theory for Mind and Intelligence, Part I: FoundationsComments: 12 pages, 2 figures. Part I of a multi-part position paper on a new theory of mindSubjects: Computation and Language (cs.CL)
This paper introduces function alignment, a novel theory of mind and intelligence that is both intuitively compelling and structurally grounded. It explicitly models how meaning, interpretation, and analogy emerge from interactions among layered representations, forming a coherent framework capable not only of modeling minds but also of serving as a blueprint for building them. One of the key theoretical insights derived from function alignment is bounded interpretability, which provides a unified explanation for previously fragmented ideas in cognitive science, such as bounded rationality, symbol grounding, and analogy-making. Beyond modeling, the function alignment framework bridges disciplines often kept apart, linking computational architecture, psychological theory, and even contemplative traditions such as Zen. Rather than building on any philosophical systems, it offers a structural foundation upon which multiple ways of understanding the mind may be reconstructed.
- [136] arXiv:2503.21109 [pdf, html, other]
-
Title: Optimizing Multi-DNN Inference on Mobile Devices through Heterogeneous Processor Co-ExecutionYunquan Gao, Zhiguo Zhang, Praveen Kumar Donta, Chinmaya Kumar Dehury, Xiujun Wang, Dusit Niyato, Qiyang ZhangComments: 14 pages, 12 figures, 5 tablesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Deep Neural Networks (DNNs) are increasingly deployed across diverse industries, driving demand for mobile device support. However, existing mobile inference frameworks often rely on a single processor per model, limiting hardware utilization and causing suboptimal performance and energy efficiency. Expanding DNN accessibility on mobile platforms requires adaptive, resource-efficient solutions to meet rising computational needs without compromising functionality. Parallel inference of multiple DNNs on heterogeneous processors remains challenging. Some works partition DNN operations into subgraphs for parallel execution across processors, but these often create excessive subgraphs based only on hardware compatibility, increasing scheduling complexity and memory overhead.
To address this, we propose an Advanced Multi-DNN Model Scheduling (ADMS) strategy for optimizing multi-DNN inference on mobile heterogeneous processors. ADMS constructs an optimal subgraph partitioning strategy offline, balancing hardware operation support and scheduling granularity, and uses a processor-state-aware algorithm to dynamically adjust workloads based on real-time conditions. This ensures efficient workload distribution and maximizes processor utilization. Experiments show ADMS reduces multi-DNN inference latency by 4.04 times compared to vanilla frameworks. - [137] arXiv:2503.21114 [pdf, html, other]
-
Title: Measuring and Analyzing Subjective Uncertainty in Scientific CommunicationsComments: Coming with Appendix and supplementary materialSubjects: Digital Libraries (cs.DL); Computation and Language (cs.CL)
Uncertainty of scientific findings are typically reported through statistical metrics such as $p$-values, confidence intervals, etc. The magnitude of this objective uncertainty is reflected in the language used by the authors to report their findings primarily through expressions carrying uncertainty-inducing terms or phrases. This language uncertainty is a subjective concept and is highly dependent on the writing style of the authors. There is evidence that such subjective uncertainty influences the impact of science on public audience. In this work, we turned our focus to scientists themselves, and measured/analyzed the subjective uncertainty and its impact within scientific communities across different disciplines. We showed that the level of this type of uncertainty varies significantly across different fields, years of publication and geographical locations. We also studied the correlation between subjective uncertainty and several bibliographical metrics, such as number/gender of authors, centrality of the field's community, citation count, etc. The underlying patterns identified in this work are useful in identification and documentation of linguistic norms in scientific communication in different communities/societies.
- [138] arXiv:2503.21115 [pdf, html, other]
-
Title: Leveraging Large Language Models for Risk Assessment in Hyperconnected Logistic Hub Network DeploymentSubjects: Computation and Language (cs.CL)
The growing emphasis on energy efficiency and environmental sustainability in global supply chains introduces new challenges in the deployment of hyperconnected logistic hub networks. In current volatile, uncertain, complex, and ambiguous (VUCA) environments, dynamic risk assessment becomes essential to ensure successful hub deployment. However, traditional methods often struggle to effectively capture and analyze unstructured information. In this paper, we design an Large Language Model (LLM)-driven risk assessment pipeline integrated with multiple analytical tools to evaluate logistic hub deployment. This framework enables LLMs to systematically identify potential risks by analyzing unstructured data, such as geopolitical instability, financial trends, historical storm events, traffic conditions, and emerging risks from news sources. These data are processed through a suite of analytical tools, which are automatically called by LLMs to support a structured and data-driven decision-making process for logistic hub selection. In addition, we design prompts that instruct LLMs to leverage these tools for assessing the feasibility of hub selection by evaluating various risk types and levels. Through risk-based similarity analysis, LLMs cluster logistic hubs with comparable risk profiles, enabling a structured approach to risk assessment. In conclusion, the framework incorporates scalability with long-term memory and enhances decision-making through explanation and interpretation, enabling comprehensive risk assessments for logistic hub deployment in hyperconnected supply chain networks.
- [139] arXiv:2503.21122 [pdf, html, other]
-
Title: One Snapshot is All You Need: A Generalized Method for mmWave Signal GenerationComments: IEEE INFOCOM 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Wireless sensing systems, particularly those using mmWave technology, offer distinct advantages over traditional vision-based approaches, such as enhanced privacy and effectiveness in poor lighting conditions. These systems, leveraging FMCW signals, have shown success in human-centric applications like localization, gesture recognition, and so on. However, comprehensive mmWave datasets for diverse applications are scarce, often constrained by pre-processed signatures (e.g., point clouds or RA heatmaps) and inconsistent annotation formats. To overcome these limitations, we propose mmGen, a novel and generalized framework tailored for full-scene mmWave signal generation. By constructing physical signal transmission models, mmGen synthesizes human-reflected and environment-reflected mmWave signals from the constructed 3D meshes. Additionally, we incorporate methods to account for material properties, antenna gains, and multipath reflections, enhancing the realism of the synthesized signals. We conduct extensive experiments using a prototype system with commercial mmWave devices and Kinect sensors. The results show that the average similarity of Range-Angle and micro-Doppler signatures between the synthesized and real-captured signals across three different environments exceeds 0.91 and 0.89, respectively, demonstrating the effectiveness and practical applicability of mmGen.
- [140] arXiv:2503.21124 [pdf, html, other]
-
Title: AdaMHF: Adaptive Multimodal Hierarchical Fusion for Survival PredictionComments: Accepted by ICME 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
The integration of pathologic images and genomic data for survival analysis has gained increasing attention with advances in multimodal learning. However, current methods often ignore biological characteristics, such as heterogeneity and sparsity, both within and across modalities, ultimately limiting their adaptability to clinical practice. To address these challenges, we propose AdaMHF: Adaptive Multimodal Hierarchical Fusion, a framework designed for efficient, comprehensive, and tailored feature extraction and fusion. AdaMHF is specifically adapted to the uniqueness of medical data, enabling accurate predictions with minimal resource consumption, even under challenging scenarios with missing modalities. Initially, AdaMHF employs an experts expansion and residual structure to activate specialized experts for extracting heterogeneous and sparse features. Extracted tokens undergo refinement via selection and aggregation, reducing the weight of non-dominant features while preserving comprehensive information. Subsequently, the encoded features are hierarchically fused, allowing multi-grained interactions across modalities to be captured. Furthermore, we introduce a survival prediction benchmark designed to resolve scenarios with missing modalities, mirroring real-world clinical conditions. Extensive experiments on TCGA datasets demonstrate that AdaMHF surpasses current state-of-the-art (SOTA) methods, showcasing exceptional performance in both complete and incomplete modality settings.
- [141] arXiv:2503.21125 [pdf, html, other]
-
Title: Omni-AD: Learning to Reconstruct Global and Local Features for Multi-class Anomaly DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
In multi-class unsupervised anomaly detection(MUAD), reconstruction-based methods learn to map input images to normal patterns to identify anomalous pixels. However, this strategy easily falls into the well-known "learning shortcut" issue when decoders fail to capture normal patterns and reconstruct both normal and abnormal samples naively. To address that, we propose to learn the input features in global and local manners, forcing the network to memorize the normal patterns more comprehensively. Specifically, we design a two-branch decoder block, named Omni-block. One branch corresponds to global feature learning, where we serialize two self-attention blocks but replace the query and (key, value) with learnable tokens, respectively, thus capturing global features of normal patterns concisely and thoroughly. The local branch comprises depth-separable convolutions, whose locality enables effective and efficient learning of local features for normal patterns. By stacking Omni-blocks, we build a framework, Omni-AD, to learn normal patterns of different granularity and reconstruct them progressively. Comprehensive experiments on public anomaly detection benchmarks show that our method outperforms state-of-the-art approaches in MUAD. Code is available at this https URL.
- [142] arXiv:2503.21126 [pdf, html, other]
-
Title: Bandwidth-Efficient Two-Server ORAMs with O(1) Client StorageComments: 19 pages, 10 figuresSubjects: Cryptography and Security (cs.CR)
Oblivious RAM (ORAM) allows a client to securely retrieve elements from outsourced servers without leakage about the accessed elements or their virtual addresses. Two-server ORAM, designed for secure two-party RAM computation, stores data across two non-colluding servers. However, many two-server ORAM schemes suffer from excessive local storage or high bandwidth costs. To serve lightweight clients, it is crucial for ORAM to achieve concretely efficient bandwidth while maintaining O(1) local storage. Hence, this paper presents two new client-friendly two-server ORAM schemes that achieve practical logarithmic bandwidth under O(1) local storage, while incurring linear symmetric key computations. The core design features a hierarchical structure and a pairwise-area setting for the elements and their tags. Accordingly, we specify efficient read-only and write-only private information retrieval (PIR) algorithms in our schemes to ensure obliviousness in accessing two areas respectively, so as to avoid the necessity of costly shuffle techniques in previous works. We empirically evaluate our schemes against LO13 (TCC'13), AFN17 (PKC'17), and KM19 (PKC'19) in terms of both bandwidth and time cost. The results demonstrate that our schemes reduce bandwidth by approximately 2-4x compared to LO13, and by 16-64x compared to AFN17 and KM19. For a database of size 2^14 blocks, our schemes are over 64x faster than KM19, while achieving similar performance to LO13 and AFN17 in the WAN setting, with a latency of around 1 second.
- [143] arXiv:2503.21127 [pdf, html, other]
-
Title: Collaborative Evolution: Multi-Round Learning Between Large and Small Language Models for Emergent Fake News DetectionSubjects: Computation and Language (cs.CL); Multimedia (cs.MM)
The proliferation of fake news on social media platforms has exerted a substantial influence on society, leading to discernible impacts and deleterious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from the necessity for extensive supervised training and the challenge of adapting to rapidly evolving circumstances. Large language models (LLMs), despite their robust zero-shot capabilities, have fallen short in effectively identifying fake news due to a lack of pertinent demonstrations and the dynamic nature of knowledge. In this paper, a novel framework Multi-Round Collaboration Detection (MRCD) is proposed to address these aforementioned limitations. The MRCD framework is capable of enjoying the merits from both LLMs and SLMs by integrating their generalization abilities and specialized functionalities, respectively. Our approach features a two-stage retrieval module that selects relevant and up-to-date demonstrations and knowledge, enhancing in-context learning for better detection of emerging news events. We further design a multi-round learning framework to ensure more reliable detection results. Our framework MRCD achieves SOTA results on two real-world datasets Pheme and Twitter16, with accuracy improvements of 7.4\% and 12.8\% compared to using only SLMs, which effectively addresses the limitations of current models and improves the detection of emergent fake news.
- [144] arXiv:2503.21130 [pdf, html, other]
-
Title: VideoMix: Aggregating How-To Videos for Task-Oriented LearningComments: In Proceedings of the 30th International Conference on Intelligent User Interfaces (IUI '25) 2025Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
Tutorial videos are a valuable resource for people looking to learn new tasks. People often learn these skills by viewing multiple tutorial videos to get an overall understanding of a task by looking at different approaches to achieve the task. However, navigating through multiple videos can be time-consuming and mentally demanding as these videos are scattered and not easy to skim. We propose VideoMix, a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task. Insights from our formative study (N=12) reveal that learners value understanding potential outcomes, required materials, alternative methods, and important details shared by different videos. Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips, enabling users to quickly digest and navigate the content. A comparative user study (N=12) demonstrated that VideoMix enabled participants to gain a more comprehensive understanding of tasks with greater efficiency than a baseline video interface, where videos are viewed independently. Our findings highlight the potential of a task-oriented, multi-video approach where videos are organized around a shared goal, offering an enhanced alternative to conventional video-based learning.
- [145] arXiv:2503.21135 [pdf, html, other]
-
Title: MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution AwarenessComments: 6 pages, 6 figures and 3 tablesSubjects: Machine Learning (cs.LG)
With the advances in artificial intelligence, Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs), and its demand for model compression is increasing. Quantization is an effective method that not only compresses the models but also significantly accelerates their performance. Existing quantization methods have gradually shifted the focus from parameter scaling to the analysis of data distributions. However, their analysis is designed for dense LLMs and relies on the simple one-model-all-data mapping, which is unsuitable for MoEs. This paper proposes a new quantization framework called MoQa. MoQa decouples the data-model distribution complexity of MoEs in multiple analysis stages, quantitively revealing the dynamics during sparse data activation, data-parameter mapping, and inter-expert correlations. Based on these, MoQa identifies particular experts' and parameters' significance with optimal data-model distribution awareness and proposes a series of fine-grained mix-quantization strategies adaptive to various data activation and expert combination scenarios. Moreover, MoQa discusses the limitations of existing quantization and analyzes the impact of each stage analysis, showing novel insights for MoE quantization. Experiments show that MoQa achieves a 1.69~2.18 perplexity decrease in language modeling tasks and a 1.58%~8.91% accuracy improvement in zero-shot inference tasks. We believe MoQa will play a role in future MoE construction, optimization, and compression.
- [146] arXiv:2503.21138 [pdf, html, other]
-
Title: A computational theory of evaluation for parameterisable subjectSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Evaluation is critical to advance decision making across domains, yet existing methodologies often struggle to balance theoretical rigor and practical scalability. In order to reduce the cost of experimental evaluation, we introduce a computational theory of evaluation for parameterisable subjects. We prove upper bounds of generalized evaluation error and generalized causal effect error of evaluation metric on subject. We also prove efficiency, and consistency to estimated causal effect of subject on metric by prediction. To optimize evaluation models, we propose a meta-learner to handle heterogeneous evaluation subjects space. Comparing with other computational approaches, our (conditional) evaluation model reduced 24.1%-99.0% evaluation errors across 12 scenes, including individual medicine, scientific simulation, business activities, and quantum trade. The evaluation time is reduced 3-7 order of magnitude comparing with experiments or simulations.
- [147] arXiv:2503.21140 [pdf, html, other]
-
Title: Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose EstimationJournal-ref: Published in CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Category-agnostic pose estimation aims to locate keypoints on query images according to a few annotated support images for arbitrary novel classes. Existing methods generally extract support features via heatmap pooling, and obtain interacted features from support and query via cross-attention. Hence, these works neglect to mine fine-grained and structure-aware (FGSA) features from both support and query images, which are crucial for pixel-level keypoint localization. To this end, we propose a novel yet concise framework, which recurrently mines FGSA features from both support and query images. Specifically, we design a FGSA mining module based on deformable attention mechanism. On the one hand, we mine fine-grained features by applying deformable attention head over multi-scale feature maps. On the other hand, we mine structure-aware features by offsetting the reference points of keypoints to their linked keypoints. By means of above module, we recurrently mine FGSA features from support and query images, and thus obtain better support features and query estimations. In addition, we propose to use mixup keypoints to pad various classes to a unified keypoint number, which could provide richer supervision than the zero padding used in existing works. We conduct extensive experiments and in-depth studies on large-scale MP-100 dataset, and outperform SOTA method dramatically (+3.2\%PCK@0.05). Code is avaiable at this https URL.
- [148] arXiv:2503.21141 [pdf, html, other]
-
Title: Safe Human Robot Navigation in Warehouse ScenarioSubjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
The integration of autonomous mobile robots (AMRs) in industrial environments, particularly warehouses, has revolutionized logistics and operational efficiency. However, ensuring the safety of human workers in dynamic, shared spaces remains a critical challenge. This work proposes a novel methodology that leverages control barrier functions (CBFs) to enhance safety in warehouse navigation. By integrating learning-based CBFs with the Open Robotics Middleware Framework (OpenRMF), the system achieves adaptive and safety-enhanced controls in multi-robot, multi-agent scenarios. Experiments conducted using various robot platforms demonstrate the efficacy of the proposed approach in avoiding static and dynamic obstacles, including human pedestrians. Our experiments evaluate different scenarios in which the number of robots, robot platforms, speed, and number of obstacles are varied, from which we achieve promising performance.
- [149] arXiv:2503.21144 [pdf, html, other]
-
Title: ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion ModelComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.
- [150] arXiv:2503.21145 [pdf, html, other]
-
Title: How to Secure Existing C and C++ Software without Memory SafetySubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
The most important security benefit of software memory safety is easy to state: for C and C++ software, attackers can exploit most bugs and vulnerabilities to gain full, unfettered control of software behavior, whereas this is not true for most bugs in memory-safe software.
Fortunately, this security benefit -- most bugs don't give attackers full control -- can be had for unmodified C/C++ software, without per-application effort.
This doesn't require trying to establish memory safety; instead, it is sufficient to eliminate most of the combinatorial ways in which software with corrupted memory can execute. To eliminate these interleavings, there already exist practical compiler and runtime mechanisms that incur little overhead and need no special hardware or platform support.
Each of the mechanisms described here is already in production use, at scale, on one or more platforms. By supporting their combined use in development toolchains, the security of all C and C++ software against remote code execution attacks can be rapidly, and dramatically, improved. - [151] arXiv:2503.21150 [pdf, html, other]
-
Title: The Devil is in Low-Level Features for Cross-Domain Few-Shot SegmentationComments: Accepted by CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cross-Domain Few-Shot Segmentation (CDFSS) is proposed to transfer the pixel-level segmentation capabilities learned from large-scale source-domain datasets to downstream target-domain datasets, with only a few annotated images per class. In this paper, we focus on a well-observed but unresolved phenomenon in CDFSS: for target domains, particularly those distant from the source domain, segmentation performance peaks at the very early epochs, and declines sharply as the source-domain training proceeds. We delve into this phenomenon for an interpretation: low-level features are vulnerable to domain shifts, leading to sharper loss landscapes during the source-domain training, which is the devil of CDFSS. Based on this phenomenon and interpretation, we further propose a method that includes two plug-and-play modules: one to flatten the loss landscapes for low-level features during source-domain training as a novel sharpness-aware minimization method, and the other to directly supplement target-domain information to the model during target-domain testing by low-level-based calibration. Extensive experiments on four target datasets validate our rationale and demonstrate that our method surpasses the state-of-the-art method in CDFSS signifcantly by 3.71% and 5.34% average MIoU in 1-shot and 5-shot scenarios, respectively.
- [152] arXiv:2503.21154 [pdf, html, other]
-
Title: Federated Learning with Differential Privacy: An Utility-Enhanced ApproachKanishka Ranaweera, Dinh C. Nguyen, Pubudu N. Pathirana, David Smith, Ming Ding, Thierry Rakotoarivelo, Aruna SeneviratneSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Federated learning has emerged as an attractive approach to protect data privacy by eliminating the need for sharing clients' data while reducing communication costs compared with centralized machine learning algorithms. However, recent studies have shown that federated learning alone does not guarantee privacy, as private data may still be inferred from the uploaded parameters to the central server. In order to successfully avoid data leakage, adopting differential privacy (DP) in the local optimization process or in the local update aggregation process has emerged as two feasible ways for achieving sample-level or user-level privacy guarantees respectively, in federated learning models. However, compared to their non-private equivalents, these approaches suffer from a poor utility. To improve the privacy-utility trade-off, we present a modification to these vanilla differentially private algorithms based on a Haar wavelet transformation step and a novel noise injection scheme that significantly lowers the asymptotic bound of the noise variance. We also present a holistic convergence analysis of our proposed algorithm, showing that our method yields better convergence performance than the vanilla DP algorithms. Numerical experiments on real-world datasets demonstrate that our method outperforms existing approaches in model utility while maintaining the same privacy guarantees.
- [153] arXiv:2503.21155 [pdf, html, other]
-
Title: Embedding Domain-Specific Knowledge from LLMs into the Feature Engineering PipelineComments: 9 pages, 4 figures, 5 tablesSubjects: Machine Learning (cs.LG)
Feature engineering is mandatory in the machine learning pipeline to obtain robust models. While evolutionary computation is well-known for its great results both in feature selection and feature construction, its methods are computationally expensive due to the large number of evaluations required to induce the final model. Part of the reason why these algorithms require a large number of evaluations is their lack of domain-specific knowledge, resulting in a lot of random guessing during evolution. In this work, we propose using Large Language Models (LLMs) as an initial feature construction step to add knowledge to the dataset. By doing so, our results show that the evolution can converge faster, saving us computational resources. The proposed approach only provides the names of the features in the dataset and the target objective to the LLM, making it usable even when working with datasets containing private data. While consistent improvements to test performance were only observed for one-third of the datasets (CSS, PM, and IM10), possibly due to problems being easily explored by LLMs, this approach only decreased the model performance in 1/77 test cases. Additionally, this work introduces the M6GP feature engineering algorithm to symbolic regression, showing it can improve the results of the random forest regressor and produce competitive results with its predecessor, M3GP.
- [154] arXiv:2503.21156 [pdf, html, other]
-
Title: A Theoretical Analysis of Analogy-Based Evolutionary Transfer OptimizationSubjects: Neural and Evolutionary Computing (cs.NE)
Evolutionary transfer optimization (ETO) has been gaining popularity in research over the years due to its outstanding knowledge transfer ability to address various challenges in optimization. However, a pressing issue in this field is that the invention of new ETO algorithms has far outpaced the development of fundamental theories needed to clearly understand the key factors contributing to the success of these algorithms for effective generalization. In response to this challenge, this study aims to establish theoretical foundations for analogy-based ETO, specifically to support various algorithms that frequently reference a key concept known as similarity. First, we introduce analogical reasoning and link its subprocesses to three key issues in ETO. Then, we develop theories for analogy-based knowledge transfer, rooted in the principles that underlie the subprocesses. Afterwards, we present two theorems related to the performance gain of analogy-based knowledge transfer, namely unconditionally nonnegative performance gain and conditionally positive performance gain, to theoretically demonstrate the effectiveness of various analogy-based ETO methods. Last but not least, we offer a novel insight into analogy-based ETO that interprets its conditional superiority over traditional evolutionary optimization through the lens of the no free lunch theorem for optimization.
- [155] arXiv:2503.21157 [pdf, other]
-
Title: Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?Comments: 11 pages, 8 figuresSubjects: Machine Learning (cs.LG)
This article surveys Evaluation models to automatically detect hallucinations in Retrieval-Augmented Generation (RAG), and presents a comprehensive benchmark of their performance across six RAG applications. Methods included in our study include: LLM-as-a-Judge, Prometheus, Lynx, the Hughes Hallucination Evaluation Model (HHEM), and the Trustworthy Language Model (TLM). These approaches are all reference-free, requiring no ground-truth answers/labels to catch incorrect LLM responses. Our study reveals that, across diverse RAG applications, some of these approaches consistently detect incorrect RAG responses with high precision/recall.
- [156] arXiv:2503.21158 [pdf, html, other]
-
Title: Integrating Travel Behavior Forecasting and Generative Modeling for Predicting Future Urban Mobility and Spatial TransformationsEugene Denteh, Andrews Danyo, Joshua Kofi Asamoah, Blessing Agyei Kyem, Twitchell Addai, Armstrong AboahSubjects: Computer Vision and Pattern Recognition (cs.CV)
Transportation planning plays a critical role in shaping urban development, economic mobility, and infrastructure sustainability. However, traditional planning methods often struggle to accurately predict long-term urban growth and transportation demands. This may sometimes result in infrastructure demolition to make room for current transportation planning demands. This study integrates a Temporal Fusion Transformer to predict travel patterns from demographic data with a Generative Adversarial Network to predict future urban settings through satellite imagery. The framework achieved a 0.76 R-square score in travel behavior prediction and generated high-fidelity satellite images with a Structural Similarity Index of 0.81. The results demonstrate that integrating predictive analytics and spatial visualization can significantly improve the decision-making process, fostering more sustainable and efficient urban development. This research highlights the importance of data-driven methodologies in modern transportation planning and presents a step toward optimizing infrastructure placement, capacity, and long-term viability.
- [157] arXiv:2503.21159 [pdf, html, other]
-
Title: Multi-Objective Optimization for Privacy-Utility Balance in Differentially Private Federated LearningKanishka Ranaweera, David Smith, Pubudu N. Pathirana, Ming Ding, Thierry Rakotoarivelo, Aruna SeneviratneSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Federated learning (FL) enables collaborative model training across distributed clients without sharing raw data, making it a promising approach for privacy-preserving machine learning. However, ensuring differential privacy (DP) in FL presents challenges due to the trade-off between model utility and privacy protection. Clipping gradients before aggregation is a common strategy to limit privacy loss, but selecting an optimal clipping norm is non-trivial, as excessively high values compromise privacy, while overly restrictive clipping degrades model performance. In this work, we propose an adaptive clipping mechanism that dynamically adjusts the clipping norm using a multi-objective optimization framework. By integrating privacy and utility considerations into the optimization objective, our approach balances privacy preservation with model accuracy. We theoretically analyze the convergence properties of our method and demonstrate its effectiveness through extensive experiments on MNIST, Fashion-MNIST, and CIFAR-10 datasets. Our results show that adaptive clipping consistently outperforms fixed-clipping baselines, achieving improved accuracy under the same privacy constraints. This work highlights the potential of dynamic clipping strategies to enhance privacy-utility trade-offs in differentially private federated learning.
- [158] arXiv:2503.21160 [pdf, other]
-
Title: A Data Balancing and Ensemble Learning Approach for Credit Card Fraud DetectionSubjects: Machine Learning (cs.LG)
This research introduces an innovative method for identifying credit card fraud by combining the SMOTE-KMEANS technique with an ensemble machine learning model. The proposed model was benchmarked against traditional models such as logistic regression, decision trees, random forests, and support vector machines. Performance was evaluated using metrics, including accuracy, recall, and area under the curve (AUC). The results demonstrated that the proposed model achieved superior performance, with an AUC of 0.96 when combined with the SMOTE-KMEANS algorithm. This indicates a significant improvement in detecting fraudulent transactions while maintaining high precision and recall. The study also explores the application of different oversampling techniques to enhance the performance of various classifiers. The findings suggest that the proposed method is robust and effective for classification tasks on balanced datasets. Future research directions include further optimization of the SMOTE-KMEANS approach and its integration into existing fraud detection systems to enhance financial security and consumer protection.
- [159] arXiv:2503.21162 [pdf, html, other]
-
Title: Network Density Analysis of Health Seeking Behavior in Metro Manila: A Retrospective Analysis on COVID-19 Google Trends DataComments: Pre-print conference submission to ICMHI 2025, which it has been accepted. This has 12 pages, and 2 figuresSubjects: Computers and Society (cs.CY); Information Retrieval (cs.IR)
This study examined the temporal aspect of COVID-19-related health-seeking behavior in Metro Manila, National Capital Region, Philippines through a network density analysis of Google Trends data. A total of 15 keywords across five categories (English symptoms, Filipino symptoms, face wearing, quarantine, and new normal) were examined using both 15-day and 30-day rolling windows from March 2020 to March 2021. The methodology involved constructing network graphs using distance correlation coefficients at varying thresholds (0.4, 0.5, 0.6, and 0.8) and analyzing the time-series data of network density and clustering coefficients. Results revealed three key findings: (1) an inverse relationship between the threshold values and network metrics, indicating that higher thresholds provide more meaningful keyword relationships; (2) exceptionally high network connectivity during the initial pandemic months followed by gradual decline; and (3) distinct patterns in keyword relationships, transitioning from policy-focused searches to more symptom-specific queries as the pandemic temporally progressed. The 30-day window analysis showed more stable, but less search activities compared to the 15-day windows, suggesting stronger correlations in immediate search behaviors. These insights are helpful for health communication because it emphasizes the need of a strategic and conscientious information dissemination from the government or the private sector based on the networked search behavior (e.g. prioritizing to inform select symptoms rather than an overview of what the coronavirus is).
- [160] arXiv:2503.21164 [pdf, other]
-
Title: Adversarial Wear and Tear: Exploiting Natural Damage for Generating Physical-World Adversarial ExamplesComments: 11 pages, 9 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The presence of adversarial examples in the physical world poses significant challenges to the deployment of Deep Neural Networks in safety-critical applications such as autonomous driving. Most existing methods for crafting physical-world adversarial examples are ad-hoc, relying on temporary modifications like shadows, laser beams, or stickers that are tailored to specific scenarios. In this paper, we introduce a new class of physical-world adversarial examples, AdvWT, which draws inspiration from the naturally occurring phenomenon of `wear and tear', an inherent property of physical objects. Unlike manually crafted perturbations, `wear and tear' emerges organically over time due to environmental degradation, as seen in the gradual deterioration of outdoor signboards. To achieve this, AdvWT follows a two-step approach. First, a GAN-based, unsupervised image-to-image translation network is employed to model these naturally occurring damages, particularly in the context of outdoor signboards. The translation network encodes the characteristics of damaged signs into a latent `damage style code'. In the second step, we introduce adversarial perturbations into the style code, strategically optimizing its transformation process. This manipulation subtly alters the damage style representation, guiding the network to generate adversarial images where the appearance of damages remains perceptually realistic, while simultaneously ensuring their effectiveness in misleading neural networks. Through comprehensive experiments on two traffic sign datasets, we show that AdvWT effectively misleads DNNs in both digital and physical domains. AdvWT achieves an effective attack success rate, greater robustness, and a more natural appearance compared to existing physical-world adversarial examples. Additionally, integrating AdvWT into training enhances a model's generalizability to real-world damaged signs.
- [161] arXiv:2503.21165 [pdf, html, other]
-
Title: Extending Silicon Lifetime: A Review of Design Techniques for Reliable Integrated CircuitsComments: This work is under review by ACMSubjects: Systems and Control (eess.SY); Hardware Architecture (cs.AR)
Reliability has become an increasing concern in modern computing. Integrated circuits (ICs) are the backbone of modern computing devices across industries, including artificial intelligence (AI), consumer electronics, healthcare, automotive, industrial, and aerospace. Moore Law has driven the semiconductor IC industry toward smaller dimensions, improved performance, and greater energy efficiency. However, as transistors shrink to atomic scales, aging-related degradation mechanisms such as Bias Temperature Instability (BTI), Hot Carrier Injection (HCI), Time-Dependent Dielectric Breakdown (TDDB), Electromigration (EM), and stochastic aging-induced variations have become major reliability threats. From an application perspective, applications like AI training and autonomous driving require continuous and sustainable operation to minimize recovery costs and enhance safety. Additionally, the high cost of chip replacement and reproduction underscores the need for extended lifespans. These factors highlight the urgency of designing more reliable ICs. This survey addresses the critical aging issues in ICs, focusing on fundamental degradation mechanisms and mitigation strategies. It provides a comprehensive overview of aging impact and the methods to counter it, starting with the root causes of aging and summarizing key monitoring techniques at both circuit and system levels. A detailed analysis of circuit-level mitigation strategies highlights the distinct aging characteristics of digital, analog, and SRAM circuits, emphasizing the need for tailored solutions. The survey also explores emerging software approaches in design automation, aging characterization, and mitigation, which are transforming traditional reliability optimization. Finally, it outlines the challenges and future directions for improving aging management and ensuring the long-term reliability of ICs across diverse applications.
- [162] arXiv:2503.21166 [pdf, html, other]
-
Title: Unveiling the Potential of Superexpressive Networks in Implicit Neural RepresentationsComments: Accepted at ICLR 2025 Workshop on Neural Network Weights as a New Data ModalitySubjects: Machine Learning (cs.LG)
In this study, we examine the potential of one of the ``superexpressive'' networks in the context of learning neural functions for representing complex signals and performing machine learning downstream tasks. Our focus is on evaluating their performance on computer vision and scientific machine learning tasks including signal representation/inverse problems and solutions of partial differential equations. Through an empirical investigation in various benchmark tasks, we demonstrate that superexpressive networks, as proposed by [Zhang et al. NeurIPS, 2022], which employ a specialized network structure characterized by having an additional dimension, namely width, depth, and ``height'', can surpass recent implicit neural representations that use highly-specialized nonlinear activation functions.
- [163] arXiv:2503.21168 [pdf, html, other]
-
Title: TAGA: A Tangent-Based Reactive Approach for Socially Compliant Robot Navigation Around Human GroupsComments: 6 pages, 3 figures. Submitted as a conference paper in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Robot navigation in densely populated environments presents significant challenges, particularly regarding the interplay between individual and group dynamics. Current navigation models predominantly address interactions with individual pedestrians while failing to account for human groups that naturally form in real-world settings. Conversely, the limited models implementing group-aware navigation typically prioritize group dynamics at the expense of individual interactions, both of which are essential for socially appropriate navigation. This research extends an existing simulation framework to incorporate both individual pedestrians and human groups. We present Tangent Action for Group Avoidance (TAGA), a modular reactive mechanism that can be integrated with existing navigation frameworks to enhance their group-awareness capabilities. TAGA dynamically modifies robot trajectories using tangent action-based avoidance strategies while preserving the underlying model's capacity to navigate around individuals. Additionally, we introduce Group Collision Rate (GCR), a novel metric to quantitatively assess how effectively robots maintain group integrity during navigation. Through comprehensive simulation-based benchmarking, we demonstrate that integrating TAGA with state-of-the-art navigation models (ORCA, Social Force, DS-RNN, and AG-RL) reduces group intrusions by 45.7-78.6% while maintaining comparable success rates and navigation efficiency. Future work will focus on real-world implementation and validation of this approach.
- [164] arXiv:2503.21169 [pdf, html, other]
-
Title: VADMamba: Exploring State Space Models for Fast Video Anomaly DetectionComments: Accpeted by ICME 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Video anomaly detection (VAD) methods are mostly CNN-based or Transformer-based, achieving impressive results, but the focus on detection accuracy often comes at the expense of inference speed. The emergence of state space models in computer vision, exemplified by the Mamba model, demonstrates improved computational efficiency through selective scans and showcases the great potential for long-range modeling. Our study pioneers the application of Mamba to VAD, dubbed VADMamba, which is based on multi-task learning for frame prediction and optical flow reconstruction. Specifically, we propose the VQ-Mamba Unet (VQ-MaU) framework, which incorporates a Vector Quantization (VQ) layer and Mamba-based Non-negative Visual State Space (NVSS) block. Furthermore, two individual VQ-MaU networks separately predict frames and reconstruct corresponding optical flows, further boosting accuracy through a clip-level fusion evaluation strategy. Experimental results validate the efficacy of the proposed VADMamba across three benchmark datasets, demonstrating superior performance in inference speed compared to previous work. Code is available at this https URL.
- [165] arXiv:2503.21172 [pdf, html, other]
-
Title: Model as a Game: On Numerical and Spatial Consistency for Generative GamesComments: Technical ReportSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent advances in generative models have significantly impacted game generation. However, despite producing high-quality graphics and adequately receiving player input, existing models often fail to maintain fundamental game properties such as numerical and spatial consistency. Numerical consistency ensures gameplay mechanics correctly reflect score changes and other quantitative elements, while spatial consistency prevents jarring scene transitions, providing seamless player experiences. In this paper, we revisit the paradigm of generative games to explore what truly constitutes a Model as a Game (MaaG) with a well-developed mechanism. We begin with an empirical study on ``Traveler'', a 2D game created by an LLM featuring minimalist rules yet challenging generative models in maintaining consistency. Based on the DiT architecture, we design two specialized modules: (1) a numerical module that integrates a LogicNet to determine event triggers, with calculations processed externally as conditions for image generation; and (2) a spatial module that maintains a map of explored areas, retrieving location-specific information during generation and linking new observations to ensure continuity. Experiments across three games demonstrate that our integrated modules significantly enhance performance on consistency metrics compared to baselines, while incurring minimal time overhead during inference.
- [166] arXiv:2503.21178 [pdf, html, other]
-
Title: Integrating Large Language Models For Monte Carlo Simulation of Chemical Reaction NetworksSadikshya Gyawali, Ashwini Mandal, Manish Dahal, Manish Awale, Sanjay Rijal, Shital Adhikari, Vaghawan OjhaComments: Accepted on MadeAI 2025 ConferenceSubjects: Artificial Intelligence (cs.AI)
Chemical reaction network is an important method for modeling and exploring complex biological processes, bio-chemical interactions and the behavior of different dynamics in system biology. But, formulating such reaction kinetics takes considerable time. In this paper, we leverage the efficiency of modern large language models to automate the stochastic monte carlo simulation of chemical reaction networks and enable the simulation through the reaction description provided in the form of natural languages. We also integrate this process into widely used simulation tool Copasi to further give the edge and ease to the modelers and researchers. In this work, we show the efficacy and limitations of the modern large language models to parse and create reaction kinetics for modelling complex chemical reaction processes.
- [167] arXiv:2503.21182 [pdf, html, other]
-
Title: Optimal Transportation for the Far-field Reflector ProblemSubjects: Numerical Analysis (math.NA)
The inverse reflector problem aims to design a freeform reflecting surface that can direct the light from a specified source to produce the desired illumination in the target area, which is significant in the field of geometrical non-imaging optics. Mathematically, it can be formulated as an optimization problem, which is exactly the optimal transportation problem (OT) when the target is in the far field. The gradient of OT is governed by the generalized Monge-Amp`ere equation that models the far-field reflector system. Based on the gradient, this work presents a Sobolev gradient descent method implemented within a finite element framework to solve the corresponding OT. Convergence of the method is established and numerical examples are provided to demonstrate the effectiveness of the method.
- [168] arXiv:2503.21187 [pdf, other]
-
Title: DGSUnet: An Improved Unet Model with DINO-Guided SAM2 for Multi-Scale Feature CollaborationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Despite the significant advancements in general image segmentation achieved by large-scale pre-trained foundation models (such as Meta's Segment Any-thing Model (SAM) series and DINOv2), their performance in specialized fields remains limited by two critical issues: the excessive training costs due to large model parameters, and the insufficient ability to represent specific domain characteristics. This paper proposes a multi-scale feature collabora-tion framework guided by DINOv2 for SAM2, with core innovations in three aspects: (1) Establishing a feature collaboration mechanism between DINOv2 and SAM2 backbones, where high-dimensional semantic features extracted by the self-supervised model guide multi-scale feature fusion; (2) Designing lightweight adapter modules and cross-modal, cross-layer feature fusion units to inject cross-domain knowledge while freezing the base model parameters; (3) Constructing a U-shaped network structure based on U-net, which utilizes attention mechanisms to achieve adaptive aggregation decoding of multi-granularity features. This framework surpasses existing state-of-the-art meth-ods in downstream tasks such as camouflage target detection and salient ob-ject detection, without requiring costly training processes. It provides a tech-nical pathway for efficient deployment of visual image segmentation, demon-strating significant application value in a wide range of downstream tasks and specialized fields within image this http URL page: this https URL
- [169] arXiv:2503.21188 [pdf, html, other]
-
Title: Are We Solving a Well-Defined Problem? A Task-Centric Perspective on Recommendation TasksComments: Work in progressSubjects: Information Retrieval (cs.IR)
Recommender systems (RecSys) leverage user interaction history to predict and suggest relevant items, shaping user experiences across various domains. While many studies adopt a general problem definition, i.e., to recommend preferred items to users based on past interactions, such abstraction often lacks the domain-specific nuances necessary for practical deployment. However, models are frequently evaluated using datasets from online recommender platforms, which inherently reflect these specificities. In this paper, we analyze RecSys task formulations, emphasizing key components such as input-output structures, temporal dynamics, and candidate item selection. All these factors directly impact offline evaluation. We further examine the complexities of user-item interactions, including decision-making costs, multi-step engagements, and unobservable interactions, which may influence model design and loss functions. Additionally, we explore the balance between task specificity and model generalizability, highlighting how well-defined task formulations serve as the foundation for robust evaluation and effective solution development. By clarifying task definitions and their implications, this work provides a structured perspective on RecSys research. The goal is to help researchers better navigate the field, particularly in understanding specificities of the RecSys tasks and ensuring fair and meaningful evaluations.
- [170] arXiv:2503.21189 [pdf, other]
-
Title: An NLP-Driven Approach Using Twitter Data for Tailored K-pop Artist RecommendationsComments: International Conference on Emotion Sensibility (ICES), 2023Subjects: Human-Computer Interaction (cs.HC)
The global rise of K-pop and the digital revolution have paved the way for new dimensions in artist recommendations. With platforms like Twitter serving as a hub for fans to interact, share and discuss K-pop, a vast amount of data is generated that can be analyzed to understand listener preferences. However, current recommendation systems often overlook K- pop's inherent diversity, treating it as a singular entity. This paper presents an innovative method that utilizes Natural Language Processing to analyze tweet content and discern individual listening habits and preferences. The mass of Twitter data is methodically categorized using fan clusters, facilitating granular and personalized artist recommendations. Our approach marries the advanced GPT-4 model with large-scale social media data, offering potential enhancements in accuracy for K-pop recommendation systems and promising an elevated, personalized fan experience. In conclusion, acknowledging the heterogeneity within fanbases and capitalizing on readily available social media data marks a significant stride towards advancing personalized music recommendation systems.
- [171] arXiv:2503.21190 [pdf, html, other]
-
Title: Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question AnsweringSubjects: Computer Vision and Pattern Recognition (cs.CV)
Social intelligence, the ability to interpret emotions, intentions, and behaviors, is essential for effective communication and adaptive responses. As robots and AI systems become more prevalent in caregiving, healthcare, and education, the demand for AI that can interact naturally with humans grows. However, creating AI that seamlessly integrates multiple modalities, such as vision and speech, remains a challenge. Current video-based methods for social intelligence rely on general video recognition or emotion recognition techniques, often overlook the unique elements inherent in human interactions. To address this, we propose the Looped Video Debating (LVD) framework, which integrates Large Language Models (LLMs) with visual information, such as facial expressions and body movements, to enhance the transparency and reliability of question-answering tasks involving human interaction videos. Our results on the Social-IQ 2.0 benchmark show that LVD achieves state-of-the-art performance without fine-tuning. Furthermore, supplementary human annotations on existing datasets provide insights into the model's accuracy, guiding future improvements in AI-driven social intelligence.
- [172] arXiv:2503.21191 [pdf, other]
-
Title: Designing a User Interface for Generative Design in Augmented Reality: A Step Towards More Visualization and Feed-ForwardingComments: Proceedings of HCI Korea 2024Subjects: Human-Computer Interaction (cs.HC)
Generative design, an AI-assisted technology for optimizing design through algorithmic processes, is propelling advancements across numerous fields. As the use of immersive environments such as Augmented Reality (AR) continues to rise, integrating generative design into such platforms presents a potent opportunity for innovation. However, a vital challenge that impedes this integration is the current absence of an efficient and user-friendly interface for designers to operate within these environments effectively. To bridge this gap, we introduce a novel UI system for generative design software in AR, which automates the process of generating the potential design constraints based on the users' inputs. This system allows users to construct a virtual environment, edit objects and constraints, and export the final data in CSV format. The interface enhances the user's design experience by enabling more intuitive interactions and providing immediate visual feedback. Deriving from participatory design principles, this research proposes a significant leap forward in the realms of generative design and immersive environments.
- [173] arXiv:2503.21193 [pdf, html, other]
-
Title: UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary LearningSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
We introduce UGen, a unified autoregressive multimodal model that demonstrates strong performance across text processing, image understanding, and image generation tasks simultaneously. UGen converts both texts and images into discrete token sequences and utilizes a single transformer to generate them uniformly in an autoregressive manner. To address the challenges associated with unified multimodal learning, UGen is trained using a novel mechanism, namely progressive vocabulary learning. In this process, visual token IDs are incrementally activated and integrated into the training phase, ultimately enhancing the effectiveness of unified multimodal learning. Experiments on comprehensive text and image tasks show that UGen achieves a significant overall performance improvement of 13.3% compared to the vanilla unified autoregressive method, and it also delivers competitive results across all tasks against several task-specific models.
- [174] arXiv:2503.21195 [pdf, other]
-
Title: Toward a Healthier Social Media Experience: Designing 'Inspiration' and 'Reality' Modes to Enhance Digital Well-Being for Generation ZComments: KOSES Autumn Conference 2024Subjects: Social and Information Networks (cs.SI)
This study presents a dual-mode interface design concept for social media platforms aimed at reducing social comparison in health-related content among Korean MZ (Millennials and Gen-Z) users. The proposed "Inspiration" and "Reality" modes allow users to toggle between curated, idealized posts and more realistic, candid content. This approach aims to alleviate negative psychological effects, such as decreased self-esteem and body dissatisfaction. The pre-study outlines the design framework and discusses potential implications for user satisfaction, perceived authenticity, and mental well-being.
- [175] arXiv:2503.21197 [pdf, html, other]
-
Title: WVSC: Wireless Video Semantic Communication with Multi-frame CompensationSubjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
Existing wireless video transmission schemes directly conduct video coding in pixel level, while neglecting the inner semantics contained in videos. In this paper, we propose a wireless video semantic communication framework, abbreviated as WVSC, which integrates the idea of semantic communication into wireless video transmission scenarios. WVSC first encodes original video frames as semantic frames and then conducts video coding based on such compact representations, enabling the video coding in semantic level rather than pixel level. Moreover, to further reduce the communication overhead, a reference semantic frame is introduced to substitute motion vectors of each frame in common video coding methods. At the receiver, multi-frame compensation (MFC) is proposed to produce compensated current semantic frame with a multi-frame fusion attention module. With both the reference frame transmission and MFC, the bandwidth efficiency improves with satisfying video transmission performance. Experimental results verify the performance gain of WVSC over other DL-based methods e.g. DVSC about 1 dB and traditional schemes about 2 dB in terms of PSNR.
- [176] arXiv:2503.21200 [pdf, html, other]
-
Title: Learning Generalizable Skills from Offline Multi-Task Data for Multi-Agent CooperationSubjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Learning cooperative multi-agent policy from offline multi-task data that can generalize to unseen tasks with varying numbers of agents and targets is an attractive problem in many scenarios. Although aggregating general behavior patterns among multiple tasks as skills to improve policy transfer is a promising approach, two primary challenges hinder the further advancement of skill learning in offline multi-task MARL. Firstly, extracting general cooperative behaviors from various action sequences as common skills lacks bringing cooperative temporal knowledge into them. Secondly, existing works only involve common skills and can not adaptively choose independent knowledge as task-specific skills in each task for fine-grained action execution. To tackle these challenges, we propose Hierarchical and Separate Skill Discovery (HiSSD), a novel approach for generalizable offline multi-task MARL through skill learning. HiSSD leverages a hierarchical framework that jointly learns common and task-specific skills. The common skills learn cooperative temporal knowledge and enable in-sample exploitation for offline multi-task MARL. The task-specific skills represent the priors of each task and achieve a task-guided fine-grained action execution. To verify the advancement of our method, we conduct experiments on multi-agent MuJoCo and SMAC benchmarks. After training the policy using HiSSD on offline multi-task data, the empirical results show that HiSSD assigns effective cooperative behaviors and obtains superior performance in unseen tasks.
- [177] arXiv:2503.21202 [pdf, html, other]
-
Title: System-wide Instrument Transformer Calibration and Line Parameter Estimation Using PMU DataSubjects: Systems and Control (eess.SY)
Uncalibrated instrument transformers (ITs) can degrade the performance of downstream applications that rely on the voltage and current measurements that ITs provide. It is also well-known that phasor measurement unit (PMU)-based system-wide IT calibration and line parameter estimation (LPE) are interdependent problems. In this paper, we present a statistical framework for solving the simultaneous LPE and IT calibration (SLIC) problem using synchrophasor data. The proposed approach not only avoids the need for a perfect IT by judiciously placing a revenue quality meter (which is an expensive but non-perfect IT), but also accounts for the variations typically occurring in the line parameters. The results obtained using the IEEE 118-bus system as well as actual power system data demonstrate the high accuracy, robustness, and practical utility of the proposed approach.
- [178] arXiv:2503.21204 [pdf, html, other]
-
Title: Dimensional optimization of single-DOF planar rigid link-flapping mechanisms for high lift and low powerSubjects: Robotics (cs.RO)
Rigid link flapping mechanisms remain the most practical choice for flapping wing micro-aerial vehicles (MAVs) to carry useful payloads and onboard batteries for free flight due to their long-term durability and reliability. However, to achieve high agility and maneuverability-like insects-MAVs with these mechanisms require significant weight reduction. One approach involves using single-DOF planar rigid linkages, which are rarely optimized dimensionally for high lift and low power so that smaller motors and batteries could be used. We integrated a mechanism simulator based on a quasistatic nonlinear finite element method with an unsteady vortex lattice method-based aerodynamic analysis tool within an optimization routine. We optimized three different mechanism topologies from the literature. As a result, significant power savings were observed up to 42% in some cases, due to increased amplitude and higher lift coefficients resulting from optimized asymmetric sweeping velocity profiles. We also conducted an uncertainty analysis that revealed the need for high manufacturing tolerances to ensure reliable mechanism performance. The presented unified computational tool also facilitates the optimal selection of MAV components based on the payload and flight time requirements.
- [179] arXiv:2503.21206 [pdf, html, other]
-
Title: PilotANN: Memory-Bounded GPU Acceleration for Vector SearchSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Approximate Nearest Neighbor Search (ANNS) has become fundamental to modern deep learning applications, having gained particular prominence through its integration into recent generative models that work with increasingly complex datasets and higher vector dimensions. Existing CPU-only solutions, even the most efficient graph-based ones, struggle to meet these growing computational demands, while GPU-only solutions face memory constraints. As a solution, we propose PilotANN, a hybrid CPU-GPU system for graph-based ANNS that utilizes both CPU's abundant RAM and GPU's parallel processing capabilities. Our approach decomposes the graph traversal process of top-$k$ search into three stages: GPU-accelerated subgraph traversal using SVD-reduced vectors, CPU refinement and precise search using complete vectors. Furthermore, we introduce fast entry selection to improve search starting points while maximizing GPU utilization. Experimental results demonstrate that PilotANN achieves $3.9 - 5.4 \times$ speedup in throughput on 100-million scale datasets, and is able to handle datasets up to $12 \times$ larger than the GPU memory. We offer a complete open-source implementation at this https URL.
- [180] arXiv:2503.21208 [pdf, html, other]
-
Title: An improved EfficientNetV2 for garbage classificationSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper presents an enhanced waste classification framework based on EfficientNetV2 to address challenges in data acquisition cost, generalization, and real-time performance. We propose a Channel-Efficient Attention (CE-Attention) module that mitigates feature loss during global pooling without introducing dimensional scaling, effectively enhancing critical feature extraction. Additionally, a lightweight multi-scale spatial feature extraction module (SAFM) is developed by integrating depthwise separable convolutions, significantly reducing model complexity. Comprehensive data augmentation strategies are further employed to improve generalization. Experiments on the Huawei Cloud waste classification dataset demonstrate that our method achieves a classification accuracy of 95.4\%, surpassing the baseline by 3.2\% and outperforming mainstream models. The results validate the effectiveness of our approach in balancing accuracy and efficiency for practical waste classification scenarios.
- [181] arXiv:2503.21210 [pdf, html, other]
-
Title: FakeReasoning: Towards Generalizable Forgery Detection and ReasoningSubjects: Computer Vision and Pattern Recognition (cs.CV)
Accurate and interpretable detection of AI-generated images is essential for mitigating risks associated with AI misuse. However, the substantial domain gap among generative models makes it challenging to develop a generalizable forgery detection model. Moreover, since every pixel in an AI-generated image is synthesized, traditional saliency-based forgery explanation methods are not well suited for this task. To address these challenges, we propose modeling AI-generated image detection and explanation as a Forgery Detection and Reasoning task (FDR-Task), leveraging vision-language models (VLMs) to provide accurate detection through structured and reliable reasoning over forgery attributes. To facilitate this task, we introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 100K images across 10 generative models, with 10 types of forgery reasoning annotations, enabling comprehensive evaluation of FDR-Task. Additionally, we propose FakeReasoning, a forgery detection and reasoning framework with two key components. First, Forgery-Aligned Contrastive Learning enhances VLMs' understanding of forgery-related semantics through both cross-modal and intra-modal contrastive learning between images and forgery attribute reasoning. Second, a Classification Probability Mapper bridges the optimization gap between forgery detection and language modeling by mapping the output logits of VLMs to calibrated binary classification probabilities. Experiments across multiple generative models demonstrate that FakeReasoning not only achieves robust generalization but also outperforms state-of-the-art methods on both detection and reasoning tasks.
- [182] arXiv:2503.21213 [pdf, html, other]
-
Title: Resource-Efficient Federated Fine-Tuning Large Language Models for Heterogeneous DataSubjects: Machine Learning (cs.LG)
Fine-tuning large language models (LLMs) via federated learning, i.e., FedLLM, has been proposed to adapt LLMs for various downstream applications in a privacy-preserving way. To reduce the fine-tuning costs on resource-constrained devices, FedLoRA is proposed to fine-tune only a small subset of model parameters by integrating low-rank adaptation (LoRA) into FedLLM. However, apart from resource constraints, there is still another critical challenge, i.e., data heterogeneity, severely hindering the implementation of FedLoRA in practical applications. Herein, inspired by the previous group-based federated learning paradigm, we propose a hierarchical FedLoRA framework, termed HierFedLoRA, to address these challenges. Specifically, HierFedLoRA partitions all devices into multiple near-IID groups and adjusts the intra-group aggregation frequency for each group to eliminate the negative effects of non-IID data. Meanwhile, to reduce the computation and communication cost, HierFedLoRA dynamically assigns diverse and suitable fine-tuning depth (i.e., the number of continuous fine-tuning layers from the output) for each group. HierFedLoRA explores jointly optimizing aggregation frequency and depth upon their coupled relationship to better enhance the performance of FedLoRA. Extensive experiments are conducted on a physical platform with 80 commercial devices. The results show that HierFedLoRA improves the final model accuracy by 1.6% to 4.2%, speeding up the fine-tuning process by at least 2.1$\times$, compared to the strong baselines.
- [183] arXiv:2503.21214 [pdf, html, other]
-
Title: VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel RepresentationAlan Dao (Gia Tuan Dao), Norapat BuppodomSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract "voxel semantics"-object identity, color, and location-from voxel data. Critically, instead of employing complex 3D networks, our method processes the voxel space by systematically slicing it along a primary axis (e.g., the Z-axis, analogous to CT scan slices). These 2D slices are then formatted and sequentially fed into the image encoder of a standard VLM. The model learns to aggregate information across slices and correlate spatial patterns with semantic concepts provided by the language component. This slice-based strategy aims to leverage the power of pre-trained 2D VLMs for efficient 3D semantic understanding directly from voxel representations.
- [184] arXiv:2503.21219 [pdf, html, other]
-
Title: GenFusion: Closing the Loop between Reconstruction and Generation via VideosSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g., scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach.
- [185] arXiv:2503.21222 [pdf, html, other]
-
Title: A Quantum Constraint Generation Framework for Binary Linear ProgramsSubjects: Data Structures and Algorithms (cs.DS); Quantum Physics (quant-ph)
We propose a new approach to utilize quantum computers for binary linear programming (BLP), which can be extended to general integer linear programs (ILP). Quantum optimization algorithms, hybrid or quantum-only, are currently general purpose, standalone solvers for ILP. However, to consider them practically useful, we expect them to overperform the current state of the art classical solvers. That expectation is unfair to quantum algorithms: in classical ILP solvers, after many decades of evolution, many different algorithms work together as a robust machine to get the best result. This is the approach we would like to follow now with our quantum 'solver' solutions. In this study we wrap any suitable quantum optimization algorithm into a quantum informed classical constraint generation framework. First we relax our problem by dropping all constraints and encode it into an Ising Hamiltonian for the quantum optimization subroutine. Then, by sampling from the solution state of the subroutine, we obtain information about constraint violations in the initial problem, from which we decide which coupling terms we need to introduce to the Hamiltonian. The coupling terms correspond to the constraints of the initial binary linear program. Then we optimize over the new Hamiltonian again, until we reach a feasible solution, or other stopping conditions hold. Since one can decide how many constraints they add to the Hamiltonian in a single step, our algorithm is at least as efficient as the (hybrid) quantum optimization algorithm it wraps. We support our claim with results on small scale minimum cost exact cover problem instances.
- [186] arXiv:2503.21223 [pdf, html, other]
-
Title: Rethinking Graph Structure Learning in the Era of LLMsComments: 17 pages, 8 figuresSubjects: Machine Learning (cs.LG)
Recently, the emergence of large language models (LLMs) has prompted researchers to explore the integration of language descriptions into graphs, aiming to enhance model encoding capabilities from a data-centric perspective. This graph representation is called text-attributed graphs (TAGs). A review of prior advancements highlights that graph structure learning (GSL) is a pivotal technique for improving data utility, making it highly relevant to efficient TAG learning. However, most GSL methods are tailored for traditional graphs without textual information, underscoring the necessity of developing a new GSL paradigm. Despite clear motivations, it remains challenging: (1) How can we define a reasonable optimization objective for GSL in the era of LLMs, considering the massive parameters in LLM? (2) How can we design an efficient model architecture that enables seamless integration of LLM for this optimization objective? For Question 1, we reformulate existing GSL optimization objectives as a tree optimization framework, shifting the focus from obtaining a well-trained edge predictor to a language-aware tree sampler. For Question 2, we propose decoupled and training-free model design principles for LLM integration, shifting the focus from computation-intensive fine-tuning to more efficient inference. Based on this, we propose Large Language and Tree Assistant (LLaTA), which leverages tree-based LLM in-context learning to enhance the understanding of topology and text, enabling reliable inference and generating improved graph structure. Extensive experiments on 10 TAG datasets demonstrate that LLaTA enjoys flexibility - incorporated with any backbone; scalability - outperforms other LLM-based GSL methods in terms of running efficiency; effectiveness - achieves SOTA performance.
- [187] arXiv:2503.21224 [pdf, html, other]
-
Title: Efficient Learning for Entropy-regularized Markov Decision Processes via Multilevel Monte CarloComments: 46 pages, 6 figuresSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
Designing efficient learning algorithms with complexity guarantees for Markov decision processes (MDPs) with large or continuous state and action spaces remains a fundamental challenge. We address this challenge for entropy-regularized MDPs with Polish state and action spaces, assuming access to a generative model of the environment. We propose a novel family of multilevel Monte Carlo (MLMC) algorithms that integrate fixed-point iteration with MLMC techniques and a generic stochastic approximation of the Bellman operator. We quantify the precise impact of the chosen approximate Bellman operator on the accuracy of the resulting MLMC estimator. Leveraging this error analysis, we show that using a biased plain MC estimate for the Bellman operator results in quasi-polynomial sample complexity, whereas an unbiased randomized multilevel approximation of the Bellman operator achieves polynomial sample complexity in expectation. Notably, these complexity bounds are independent of the dimensions or cardinalities of the state and action spaces, distinguishing our approach from existing algorithms whose complexities scale with the sizes of these spaces. We validate these theoretical performance guarantees through numerical experiments.
- [188] arXiv:2503.21225 [pdf, html, other]
-
Title: SEAGET: Seasonal and Active hours guided Graph Enhanced Transformer for the next POI recommendationComments: This paper has been accepted to Array (Q1, SCI, IF=2.7)Subjects: Social and Information Networks (cs.SI)
One of the most important challenges for improving personalized services in industries like tourism is predicting users' near-future movements based on prior behavior and current circumstances. Next POI (Point of Interest) recommendation is essential for helping users and service providers by providing personalized recommendations. The intricacy of this work, however, stems from the requirement to take into consideration several variables at once, such as user preferences, time contexts, and geographic locations. POI selection is also greatly influenced by elements like a POI's operational status during desired visit times, desirability for visiting during particular seasons, and its dynamic popularity over time. POI popularity is mostly determined by check-in frequency in recent studies, ignoring visitor volumes, operational constraints, and temporal dynamics. These restrictions result in recommendations that are less than ideal and do not take into account actual circumstances. We propose the Seasonal and Active hours-guided Graph-Enhanced Transformer (SEAGET) model as a solution to these problems. By integrating variations in the seasons, operational status, and temporal dynamics into a graph-enhanced transformer framework, SEAGET capitalizes on redefined POI popularity. This invention gives more accurate and context-aware next POI predictions, with potential applications for optimizing tourist experiences and enhancing location-based services in the tourism industry.
- [189] arXiv:2503.21226 [pdf, html, other]
-
Title: Frequency-Aware Gaussian Splatting DecompositionSubjects: Computer Vision and Pattern Recognition (cs.CV)
3D Gaussian Splatting (3D-GS) has revolutionized novel view synthesis with its efficient, explicit representation. However, it lacks frequency interpretability, making it difficult to separate low-frequency structures from fine details. We introduce a frequency-decomposed 3D-GS framework that groups 3D Gaussians that correspond to subbands in the Laplacian Pyrmaids of the input images. Our approach enforces coherence within each subband (i.e., group of 3D Gaussians) through dedicated regularization, ensuring well-separated frequency components. We extend color values to both positive and negative ranges, allowing higher-frequency layers to add or subtract residual details. To stabilize optimization, we employ a progressive training scheme that refines details in a coarse-to-fine manner. Beyond interpretability, this frequency-aware design unlocks a range of practical benefits. Explicit frequency separation enables advanced 3D editing and stylization, allowing precise manipulation of specific frequency bands. It also supports dynamic level-of-detail control for progressive rendering, streaming, foveated rendering and fast geometry interaction. Through extensive experiments, we demonstrate that our method provides improved control and flexibility for emerging applications in scene editing and interactive rendering. Our code will be made publicly available.
- [190] arXiv:2503.21227 [pdf, html, other]
-
Title: LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language ModelsComments: PreprintSubjects: Computation and Language (cs.CL)
Although applying Mixture of Experts to large language models for learning new tasks is widely regarded as an effective strategy for continuous learning, there still remain two major challenges: (1) As the number of tasks grows, simple parameter expansion strategies can lead to excessively large models. (2) Modifying the parameters of the existing router results in the erosion of previously acquired knowledge. In this paper, we present an innovative framework named LLaVA-CMoE, which is a continuous Mixture of Experts (MoE) architecture without any replay data. Specifically, we have developed a method called Probe-Guided Knowledge Extension (PGKE), which employs probe experts to assess whether additional knowledge is required for a specific layer. This approach enables the model to adaptively expand its network parameters based on task distribution, thereby significantly improving the efficiency of parameter expansion. Additionally, we introduce a hierarchical routing algorithm called Probabilistic Task Locator (PTL), where high-level routing captures inter-task information and low-level routing focuses on intra-task details, ensuring that new task experts do not interfere with existing ones. Our experiments shows that our efficient architecture has substantially improved model performance on the Coin benchmark while maintaining a reasonable parameter count.
- [191] arXiv:2503.21232 [pdf, html, other]
-
Title: Knowledge Graphs as World Models for Semantic Material-Aware Obstacle Handling in Autonomous VehiclesSubjects: Artificial Intelligence (cs.AI)
The inability of autonomous vehicles (AVs) to infer the material properties of obstacles limits their decision-making capacity. While AVs rely on sensor systems such as cameras, LiDAR, and radar to detect obstacles, this study suggests combining sensors with a knowledge graph (KG)-based world model to improve AVs' comprehension of physical material qualities. Beyond sensor data, AVs can infer qualities such as malleability, density, and elasticity using a semantic KG that depicts the relationships between obstacles and their attributes. Using the CARLA autonomous driving simulator, we evaluated AV performance with and without KG integration. The findings demonstrate that the KG-based method improves obstacle management, which allows AVs to use material qualities to make better decisions about when to change lanes or apply emergency braking. For example, the KG-integrated AV changed lanes for hard impediments like traffic cones and successfully avoided collisions with flexible items such as plastic bags by passing over them. Compared to the control system, the KG framework demonstrated improved responsiveness to obstacles by resolving conflicting sensor data, causing emergency stops for 13.3% more cases. In addition, our method exhibits a 6.6% higher success rate in lane-changing maneuvers in experimental scenarios, particularly for larger, high-impact obstacles. While we focus particularly on autonomous driving, our work demonstrates the potential of KG-based world models to improve decision-making in embodied AI systems and scale to other domains, including robotics, healthcare, and environmental simulation.
- [192] arXiv:2503.21234 [pdf, html, other]
-
Title: Continuous Data Assimilation for the Navier-Stokes Equations with Nonlinear Slip Boundary ConditionsSubjects: Numerical Analysis (math.NA)
This paper focuses on continuous data assimilation (CDA) for the Navier-Stokes equations with nonlinear slip boundary conditions. CDA methods are typically employed to recover the original system when initial data or viscosity coefficients are unknown, by incorporating a feedback control term generated by observational data over a time period. In this study, based on a regularized form derived from the variational inequalities of the Navier-Stokes equations with nonlinear slip boundary conditions, we first investigate the classical CDA problem when initial data is absent. After establishing the existence, uniqueness and regularity of the solution, we prove its exponential convergence with respect to the time. Additionally, we extend the CDA to address the problem of missing viscosity coefficients and analyze its convergence order, too. Furthermore, utilizing the predictive capabilities of partial evolutionary tensor neural networks (pETNNs) for time-dependent problems, we propose a novel CDA by replacing observational data with predictions got by pETNNs. Compared with the classical CDA, the new one can achieve similar approximation accuracy but need much less computational cost. Some numerical experiments are presented, which not only validate the theoretical results, but also demonstrate the efficiency of the CDA.
- [193] arXiv:2503.21235 [pdf, html, other]
-
Title: A Theoretical Framework for Distribution-Aware Dataset SearchJournal-ref: PODS 2025Subjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS)
Effective data discovery is a cornerstone of modern data-driven decision-making. Yet, identifying datasets with specific distributional characteristics, such as percentiles or preferences, remains challenging. While recent proposals have enabled users to search based on percentile predicates, much of the research in data discovery relies on heuristics. This paper presents the first theoretically backed framework that unifies data discovery under centralized and decentralized settings.
Let $\mathcal{P}=\{P_1,...,P_N\}$ be a repository of $N$ datasets, where $P_i\subset \mathbb{R}^d$, for $d=O(1)$ . We study the percentile indexing (Ptile) problem and the preference indexing (Pref) problem under the centralized and the federated setting. In the centralized setting we assume direct access to the datasets. In the federated setting we assume access to a synopsis of each dataset. The goal of Ptile is to construct a data structure such that given a predicate (rectangle $R$ and interval $\theta$) report all indexes $J$ such that $j\in J$ iff $|P_j\cap R|/|P_j|\in\theta$. The goal of Pref is to construct a data structure such that given a predicate (vector $v$ and interval $\theta$) report all indexes $J$ such that $j\in J$ iff $\omega(P_j,v)\in \theta$, where $\omega(P_j,v)$ is the inner-product of the $k$-th largest projection of $P_j$ on $v$. We first show that we cannot hope for near-linear data structures with polylogarithmic query time in the centralized setting. Next we show $\tilde{O}(N)$ space data structures that answer Ptile and Pref queries in $\tilde{O}(1+OUT)$ time, where $OUT$ is the output size. Each data structure returns a set of indexes $J$ such that i) for every $P_i$ that satisfies the predicate, $i\in J$ and ii) if $j\in J$ then $P_j$ satisfies the predicate up to an additive error $\varepsilon+2\delta$, where $\varepsilon\in(0,1)$ and $\delta$ is the error of synopses. - [194] arXiv:2503.21236 [pdf, other]
-
Title: Clean Image May be Dangerous: Data Poisoning Attacks Against Deep HashingComments: Accepted by TMMSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Large-scale image retrieval using deep hashing has become increasingly popular due to the exponential growth of image data and the remarkable feature extraction capabilities of deep neural networks (DNNs). However, deep hashing methods are vulnerable to malicious attacks, including adversarial and backdoor attacks. It is worth noting that these attacks typically involve altering the query images, which is not a practical concern in real-world scenarios. In this paper, we point out that even clean query images can be dangerous, inducing malicious target retrieval results, like undesired or illegal images. To the best of our knowledge, we are the first to study data \textbf{p}oisoning \textbf{a}ttacks against \textbf{d}eep \textbf{hash}ing \textbf{(\textit{PADHASH})}. Specifically, we first train a surrogate model to simulate the behavior of the target deep hashing model. Then, a strict gradient matching strategy is proposed to generate the poisoned images. Extensive experiments on different models, datasets, hash methods, and hash code lengths demonstrate the effectiveness and generality of our attack method.
- [195] arXiv:2503.21237 [pdf, html, other]
-
Title: Bias-Aware Agent: Enhancing Fairness in AI-Driven Knowledge RetrievalSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Advancements in retrieving accessible information have evolved faster in the last few years compared to the decades since the internet's creation. Search engines, like Google, have been the number one way to find relevant data. They have always relied on the user's abilities to find the best information in its billions of links and sources at everybody's fingertips. The advent of large language models (LLMs) has completely transformed the field of information retrieval. The LLMs excel not only at retrieving relevant knowledge but also at summarizing it effectively, making information more accessible and consumable for users. On top of it, the rise of AI Agents has introduced another aspect to information retrieval i.e. dynamic information retrieval which enables the integration of real-time data such as weather forecasts, and financial data with the knowledge base to curate context-aware knowledge. However, despite these advancements the agents remain susceptible to issues of bias and fairness, challenges deeply rooted within the knowledge base and training of LLMs. This study introduces a novel approach to bias-aware knowledge retrieval by leveraging agentic framework and the innovative use of bias detectors as tools to identify and highlight inherent biases in the retrieved content. By empowering users with transparency and awareness, this approach aims to foster more equitable information systems and promote the development of responsible AI.
- [196] arXiv:2503.21240 [pdf, html, other]
-
Title: The Promise and Pitfalls of WebAssembly: Perspectives from the IndustryComments: Accepted by FSE'25 Industry TrackSubjects: Software Engineering (cs.SE)
As JavaScript has been criticized for performance and security issues in web applications, WebAssembly (Wasm) was proposed in 2017 and is regarded as the complementation for JavaScript. Due to its advantages like compact-size, native-like speed, and portability, Wasm binaries are gradually used as the compilation target for industrial projects in other high-level programming languages and are responsible for computation-intensive tasks in browsers, e.g., 3D graphic rendering and video decoding. Intuitively, characterizing in-the-wild adopted Wasm binaries from different perspectives, like their metadata, relation with source programming language, existence of security threats, and practical purpose, is the prerequisite before delving deeper into the Wasm ecosystem and beneficial to its roadmap selection. However, currently, there is no work that conducts a large-scale measurement study on in-the-wild adopted Wasm binaries. To fill this gap, we collect the largest-ever dataset to the best of our knowledge, and characterize the status quo of them from industry perspectives. According to the different roles of people engaging in the community, i.e., web developers, Wasm maintainers, and researchers, we reorganized our findings to suggestions and best practices for them accordingly. We believe this work can shed light on the future direction of the web and Wasm.
- [197] arXiv:2503.21241 [pdf, html, other]
-
Title: Feature-Enhanced Machine Learning for All-Cause Mortality Prediction in Healthcare DataSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Accurate patient mortality prediction enables effective risk stratification, leading to personalized treatment plans and improved patient outcomes. However, predicting mortality in healthcare remains a significant challenge, with existing studies often focusing on specific diseases or limited predictor sets. This study evaluates machine learning models for all-cause in-hospital mortality prediction using the MIMIC-III database, employing a comprehensive feature engineering approach. Guided by clinical expertise and literature, we extracted key features such as vital signs (e.g., heart rate, blood pressure), laboratory results (e.g., creatinine, glucose), and demographic information. The Random Forest model achieved the highest performance with an AUC of 0.94, significantly outperforming other machine learning and deep learning approaches. This demonstrates Random Forest's robustness in handling high-dimensional, noisy clinical data and its potential for developing effective clinical decision support tools. Our findings highlight the importance of careful feature engineering for accurate mortality prediction. We conclude by discussing implications for clinical adoption and propose future directions, including enhancing model robustness and tailoring prediction models for specific diseases.
- [198] arXiv:2503.21244 [pdf, html, other]
-
Title: Improving $(α, f)$-Byzantine Resilience in Federated Learning via layerwise aggregation and cosine distanceComments: Submitted to Knowledge-Based SystemsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The rapid development of artificial intelligence systems has amplified societal concerns regarding their usage, necessitating regulatory frameworks that encompass data privacy. Federated Learning (FL) is posed as potential solution to data privacy challenges in distributed machine learning by enabling collaborative model training {without data sharing}. However, FL systems remain vulnerable to Byzantine attacks, where malicious nodes contribute corrupted model updates. While Byzantine Resilient operators have emerged as a widely adopted robust aggregation algorithm to mitigate these attacks, its efficacy diminishes significantly in high-dimensional parameter spaces, sometimes leading to poor performing models. This paper introduces Layerwise Cosine Aggregation, a novel aggregation scheme designed to enhance robustness of these rules in such high-dimensional settings while preserving computational efficiency. A theoretical analysis is presented, demonstrating the superior robustness of the proposed Layerwise Cosine Aggregation compared to original robust aggregation operators. Empirical evaluation across diverse image classification datasets, under varying data distributions and Byzantine attack scenarios, consistently demonstrates the improved performance of Layerwise Cosine Aggregation, achieving up to a 16% increase in model accuracy.
- [199] arXiv:2503.21246 [pdf, html, other]
-
Title: DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image AnimationComments: 11 pages, 10 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Human image animation has recently gained significant attention due to advancements in generative models. However, existing methods still face two major challenges: (1) architectural limitations, most models rely on U-Net, which underperforms compared to the MM-DiT; and (2) the neglect of textual information, which can enhance controllability. In this work, we introduce DynamiCtrl, a novel framework that not only explores different pose-guided control structures in MM-DiT, but also reemphasizes the crucial role of text in this task. Specifically, we employ a Shared VAE encoder for both reference images and driving pose videos, eliminating the need for an additional pose encoder and simplifying the overall framework. To incorporate pose features into the full attention blocks, we propose Pose-adaptive Layer Norm (PadaLN), which utilizes adaptive layer normalization to encode sparse pose features. The encoded features are directly added to the visual input, preserving the spatiotemporal consistency of the backbone while effectively introducing pose control into MM-DiT. Furthermore, within the full attention mechanism, we align textual and visual features to enhance controllability. By leveraging text, we not only enable fine-grained control over the generated content, but also, for the first time, achieve simultaneous control over both background and motion. Experimental results verify the superiority of DynamiCtrl on benchmark datasets, demonstrating its strong identity preservation, heterogeneous character driving, background controllability, and high-quality synthesis. The project page is available at this https URL.
- [200] arXiv:2503.21248 [pdf, html, other]
-
Title: ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task DecompositionYujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, Dongzhan ZhouSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.
- [201] arXiv:2503.21249 [pdf, html, other]
-
Title: Distributed Nonlinear Transform Source-Channel Coding for Wireless Correlated Image TransmissionSubjects: Information Theory (cs.IT)
This paper investigates distributed joint source-channel coding (JSCC) for correlated image semantic transmission over wireless channels. In this setup, correlated images at different transmitters are separately encoded and transmitted through dedicated channels for joint recovery at the receiver. We propose a novel distributed nonlinear transform source-channel coding (D-NTSCC) framework. Unlike existing learning-based approaches that implicitly learn source correlation in a purely data-driven manner, our method explicitly models the source correlation through joint distribution. Specifically, the correlated images are separately encoded into latent representations via an encoding transform function, followed by a JSCC encoder to produce channel input symbols. A learned joint entropy model is introduced to determine the transmission rates, which more accurately approximates the joint distribution of the latent representations and captures source dependencies, thereby improving rate-distortion performance. At the receiver, a JSCC decoder and a decoding transform function reconstruct the images from the received signals, each serving as side information for recovering the other image. Therein, a transformation module is designed to align the latent representations for maximal correlation learning. Furthermore, a loss function is derived to jointly optimize encoding, decoding, and the joint entropy model, ensuring that the learned joint entropy model approximates the true joint distribution. Experiments on multi-view datasets show that D-NTSCC outperforms state-of-the-art distributed schemes, demonstrating its effectiveness in exploiting source correlation.
- [202] arXiv:2503.21250 [pdf, other]
-
Title: Orange Quality Grading with Deep LearningSubjects: Computer Vision and Pattern Recognition (cs.CV)
Orange grading is a crucial step in the fruit industry, as it helps to sort oranges according to different criteria such as size, quality, ripeness, and health condition, ensuring safety for human consumption and better price allocation and client satisfaction. Automated grading enables faster processing, precision, and reduced human labor. In this paper, we implement a deep learning-based solution for orange grading via machine vision. Unlike typical grading systems that analyze fruits from a single view, we capture multiview images of each single orange in order to enable a richer representation. Afterwards, we compose the acquired images into one collage. This enables the analysis of the whole orange skin. We train a convolutional neural network (CNN) on the composed images to grade the oranges into three classes, namely good, bad, and undefined. We also evaluate the performance with two different CNNs (ResNet-18 and SqueezeNet). We show experimentally that multi-view grading is superior to single view grading.
- [203] arXiv:2503.21251 [pdf, html, other]
-
Title: Dual-Splitting Conformal Prediction for Multi-Step Time Series ForecastingComments: 28 pages, 13 figures, 3 tables. Submitted to Applied Soft Computing. With Editor This is the first public release of the workSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Time series forecasting is crucial for applications like resource scheduling and risk management, where multi-step predictions provide a comprehensive view of future trends. Uncertainty Quantification (UQ) is a mainstream approach for addressing forecasting uncertainties, with Conformal Prediction (CP) gaining attention due to its model-agnostic nature and statistical guarantees. However, most variants of CP are designed for single-step predictions and face challenges in multi-step scenarios, such as reliance on real-time data and limited scalability. This highlights the need for CP methods specifically tailored to multi-step forecasting. We propose the Dual-Splitting Conformal Prediction (DSCP) method, a novel CP approach designed to capture inherent dependencies within time-series data for multi-step forecasting. Experimental results on real-world datasets from four different domains demonstrate that the proposed DSCP significantly outperforms existing CP variants in terms of the Winkler Score, achieving a performance improvement of up to 23.59% compared to state-of-the-art methods. Furthermore, we deployed the DSCP approach for renewable energy generation and IT load forecasting in power management of a real-world trajectory-based application, achieving an 11.25% reduction in carbon emissions through predictive optimization of data center operations and controls.
- [204] arXiv:2503.21254 [pdf, html, other]
-
Title: Vision-to-Music Generation: A SurveyZhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue LiaoSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies on vision-to-music generation from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and promising directions for future research. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications. To follow latest works and foster further innovation in this field, we are continuously maintaining a GitHub repository at this https URL.
- [205] arXiv:2503.21257 [pdf, html, other]
-
Title: OminiAdapt: Learning Cross-Task Invariance for Robust and Environment-Aware Robotic ManipulationSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
With the rapid development of embodied intelligence, leveraging large-scale human data for high-level imitation learning on humanoid robots has become a focal point of interest in both academia and industry. However, applying humanoid robots to precision operation domains remains challenging due to the complexities they face in perception and control processes, the long-standing physical differences in morphology and actuation mechanisms between humanoid robots and humans, and the lack of task-relevant features obtained from egocentric vision. To address the issue of covariate shift in imitation learning, this paper proposes an imitation learning algorithm tailored for humanoid robots. By focusing on the primary task objectives, filtering out background information, and incorporating channel feature fusion with spatial attention mechanisms, the proposed algorithm suppresses environmental disturbances and utilizes a dynamic weight update strategy to significantly improve the success rate of humanoid robots in accomplishing target tasks. Experimental results demonstrate that the proposed method exhibits robustness and scalability across various typical task scenarios, providing new ideas and approaches for autonomous learning and control in humanoid robots. The project will be open-sourced on GitHub.
- [206] arXiv:2503.21258 [pdf, html, other]
-
Title: Learn by Reasoning: Analogical Weight Generation for Few-Shot Class-Incremental LearningSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Few-shot class-incremental Learning (FSCIL) enables models to learn new classes from limited data while retaining performance on previously learned classes. Traditional FSCIL methods often require fine-tuning parameters with limited new class data and suffer from a separation between learning new classes and utilizing old knowledge. Inspired by the analogical learning mechanisms of the human brain, we propose a novel analogical generative method. Our approach includes the Brain-Inspired Analogical Generator (BiAG), which derives new class weights from existing classes without parameter fine-tuning during incremental stages. BiAG consists of three components: Weight Self-Attention Module (WSA), Weight & Prototype Analogical Attention Module (WPAA), and Semantic Conversion Module (SCM). SCM uses Neural Collapse theory for semantic conversion, WSA supplements new class weights, and WPAA computes analogies to generate new class weights. Experiments on miniImageNet, CUB-200, and CIFAR-100 datasets demonstrate that our method achieves higher final and average accuracy compared to SOTA methods.
- [207] arXiv:2503.21259 [pdf, html, other]
-
Title: Reducing CT Metal Artifacts by Learning Latent Space Alignment with Gemstone Spectral Imaging DataSubjects: Computer Vision and Pattern Recognition (cs.CV)
Metal artifacts in CT slices have long posed challenges in medical diagnostics. These artifacts degrade image quality, resulting in suboptimal visualization and complicating the accurate interpretation of tissues adjacent to metal implants. To address these issues, we introduce the Latent Gemstone Spectral Imaging (GSI) Alignment Framework, which effectively reduces metal artifacts while avoiding the introduction of noise information. Our work is based on a key finding that even artifact-affected ordinary CT sequences contain sufficient information to discern detailed structures. The challenge lies in the inability to clearly represent this information. To address this issue, we developed an Alignment Framework that adjusts the representation of ordinary CT images to match GSI CT sequences. GSI is an advanced imaging technique using multiple energy levels to mitigate artifacts caused by metal implants. By aligning the representation to GSI data, we can effectively suppress metal artifacts while clearly revealing detailed structure, without introducing extraneous information into CT sequences. To facilitate the application, we propose a new dataset, Artifacts-GSI, captured from real patients with metal implants, and establish a new benchmark based on this dataset. Experimental results show that our method significantly reduces metal artifacts and greatly enhances the readability of CT slices. All our code and data are available at: this https URL
- [208] arXiv:2503.21261 [pdf, html, other]
-
Title: HOT: Hadamard-based Optimized TrainingComments: Accepted in CVPR 2025Subjects: Machine Learning (cs.LG)
It has become increasingly important to optimize backpropagation to reduce memory usage and computational overhead. Achieving this goal is highly challenging, as multiple objectives must be considered jointly while maintaining training quality. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail to identify lightweight techniques that offer the best benefits. Based on this analysis, we introduce a novel method, Hadamard-based Optimized Training (HOT). In this approach, we apply Hadamard-based optimizations, such as Hadamard quantization and Hadamard low-rank approximation, selectively and with awareness of the suitability of each optimization for different backward paths. Additionally, we introduce two enhancements: activation buffer compression and layer-wise quantizer selection. Our extensive analysis shows that HOT achieves up to 75% memory savings and a 2.6 times acceleration on real GPUs, with negligible accuracy loss compared to FP32 precision.
- [209] arXiv:2503.21262 [pdf, html, other]
-
Title: vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Capturing long-range dependencies efficiently is essential for visual recognition tasks, yet existing methods face limitations. Convolutional neural networks (CNNs) struggle with restricted receptive fields, while Vision Transformers (ViTs) achieve global context and long-range modeling at a high computational cost. State-space models (SSMs) offer an alternative, but their application in vision remains underexplored. This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness. At its core, the Gamba bottleneck block that includes, Gamba Cell, an adaptation of Mamba for 2D spatial structures, alongside a Multi-Head Self-Attention (MHSA) mechanism and a Gated Fusion Module for effective feature representation. The interplay of these components ensures that vGamba leverages the low computational demands of SSMs while maintaining the accuracy of attention mechanisms for modeling long-range dependencies in vision tasks. Additionally, the Fusion module enables seamless interaction between these components. Extensive experiments on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.
- [210] arXiv:2503.21263 [pdf, other]
-
Title: Cultivating Game Sense for Yourself: Making VLMs Gaming ExpertsSubjects: Computation and Language (cs.CL)
Developing agents capable of fluid gameplay in first/third-person games without API access remains a critical challenge in Artificial General Intelligence (AGI). Recent efforts leverage Vision Language Models (VLMs) as direct controllers, frequently pausing the game to analyze screens and plan action through language reasoning. However, this inefficient paradigm fundamentally restricts agents to basic and non-fluent interactions: relying on isolated VLM reasoning for each action makes it impossible to handle tasks requiring high reactivity (e.g., FPS shooting) or dynamic adaptability (e.g., ACT combat). To handle this, we propose a paradigm shift in gameplay agent design: instead of directly controlling gameplay, VLM develops specialized execution modules tailored for tasks like shooting and combat. These modules handle real-time game interactions, elevating VLM to a high-level developer. Building upon this paradigm, we introduce GameSense, a gameplay agent framework where VLM develops task-specific game sense modules by observing task execution and leveraging vision tools and neural network training pipelines. These modules encapsulate action-feedback logic, ranging from direct action rules to neural network-based decisions. Experiments demonstrate that our framework is the first to achieve fluent gameplay in diverse genres, including ACT, FPS, and Flappy Bird, setting a new benchmark for game-playing agents.
- [211] arXiv:2503.21267 [pdf, other]
-
Title: Tackling paper mills requires us to prevent future contamination and clean up the past -- the case of the journal BioengineeredComments: 16 pages, 2 figures, 1 table, 1 supplementary file (online, link given at the end of the manuscript)Subjects: Digital Libraries (cs.DL)
Introduction: Taylor & Francis journal Bioengineered has been targeted by paper mills. The goal of this study is to identify problematic articles published in Bioengineered during the period when the journal was affected by paper mills (2021-2022) and compare this to the number of problematic articles we can identify in the years prior (2019-2020).
Methods: Dimensions was used to search for articles that contained the terms mouse OR mice OR rat OR rats in title or abstract, published in Bioengineered between January 1st 2010 to December 31st 2024. All articles were assessed by eye and by using software to detect inappropriate image duplication and manipulation. An article was classified as problematic if it contained inappropriate image duplication or manipulation or had been previously retracted. Problematic articles were reported on PubPeer by the authors, if they had not been reported previously. All included articles were assess for post-publication editorial decisions.
Results: We have excluded all articles published in 2024 from further analysis, as these were all retraction notices. We assessed the remaining 878 articles, of which 226 (25.7%) were identified as problematic. Of the problematic articles, 35 had been previously retracted One article was retracted, which was later nullified. One article received a correction. None of the included articles received an expression of concern or the Taylor & Francis under investigation pop-up.
Conclusions: Taylor & Francis lack of editorial action has left the scientific community vulnerable to reading and citing hundreds of problematic articles published in Bioengineered. To uphold scientific integrity, Taylor & Francis should use the findings of this study as a starting point to systematically identify all compromised articles in Bioengineered and take appropriate editorial action. - [212] arXiv:2503.21268 [pdf, html, other]
-
Title: ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World CoordinateMing Yan, Xincheng Lin, Yuhua Luo, Shuqi Fan, Yudi Dai, Qixin Zhong, Lincai Zhong, Yuexin Ma, Lan Xu, Chenglu Wen, Siqi Shen, Cheng WangComments: CVPR2025, project in \href{this link}{this http URL}Subjects: Computer Vision and Pattern Recognition (cs.CV)
Human Motion Recovery (HMR) research mainly focuses on ground-based motions such as running. The study on capturing climbing motion, an off-ground motion, is sparse. This is partly due to the limited availability of climbing motion datasets, especially large-scale and challenging 3D labeled datasets. To address the insufficiency of climbing motion datasets, we collect AscendMotion, a large-scale well-annotated, and challenging climbing motion dataset. It consists of 412k RGB, LiDAR frames, and IMU measurements, including the challenging climbing motions of 22 skilled climbing coaches across 12 different rock walls. Capturing the climbing motions is challenging as it requires precise recovery of not only the complex pose but also the global position of climbers. Although multiple global HMR methods have been proposed, they cannot faithfully capture climbing motions. To address the limitations of HMR methods for climbing, we propose ClimbingCap, a motion recovery method that reconstructs continuous 3D human climbing motion in a global coordinate system. One key insight is to use the RGB and LiDAR modalities to separately reconstruct motions in camera coordinates and global coordinates and to optimize them jointly. We demonstrate the quality of the AscendMotion dataset and present promising results from ClimbingCap. The AscendMotion dataset and source code release publicly at \href{this link}{this http URL}
- [213] arXiv:2503.21269 [pdf, html, other]
-
Title: Delving Deep into Semantic Relation DistillationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Knowledge distillation has become a cornerstone technique in deep learning, facilitating the transfer of knowledge from complex models to lightweight counterparts. Traditional distillation approaches focus on transferring knowledge at the instance level, but fail to capture nuanced semantic relationships within the data. In response, this paper introduces a novel methodology, Semantics-based Relation Knowledge Distillation (SeRKD), which reimagines knowledge distillation through a semantics-relation lens among each sample. By leveraging semantic components, \ie, superpixels, SeRKD enables a more comprehensive and context-aware transfer of knowledge, which skillfully integrates superpixel-based semantic extraction with relation-based knowledge distillation for a sophisticated model compression and distillation. Particularly, the proposed method is naturally relevant in the domain of Vision Transformers (ViTs), where visual tokens serve as fundamental units of representation. Experimental evaluations on benchmark datasets demonstrate the superiority of SeRKD over existing methods, underscoring its efficacy in enhancing model performance and generalization capabilities.
- [214] arXiv:2503.21272 [pdf, html, other]
-
Title: Reinforced Model MergingSubjects: Artificial Intelligence (cs.AI)
The success of large language models has garnered widespread attention for model merging techniques, especially training-free methods which combine model capabilities within the parameter space. However, two challenges remain: (1) uniform treatment of all parameters leads to performance degradation; (2) search-based algorithms are often inefficient. In this paper, we present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks. These components interact to execute layer-wise merging actions, aiming to search the optimal merging architecture. Notably, RMM operates without any gradient computations on the original models, rendering it feasible for edge devices. Furthermore, by utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times. Extensive experiments demonstrate that RMM achieves state-of-the-art performance across various vision and NLP datasets and effectively overcomes the limitations of the existing baseline methods. Our code is available at this https URL.
- [215] arXiv:2503.21274 [pdf, other]
-
Title: Interactive Databases for the Life SciencesComments: 18 pages, 2 figuresSubjects: Databases (cs.DB); Quantitative Methods (q-bio.QM)
In the past few decades, the life sciences have experienced an unprecedented accumulation of data, ranging from genomic sequences and proteomic profiles to heavy-content imaging, clinical assays, and commercial biological products for research. Traditional static databases have been invaluable in providing standardized and structured information. However, they fall short when it comes to facilitating exploratory data interrogation, real-time query, multidimensional comparison and dynamic visualization. Interactive databases aiming at supporting user-driven data queries and visualization offer promising new avenues for making the best use of the vast and heterogeneous data streams collected in biological research. This article discusses the potential of interactive databases, highlighting the importance of implementing this model in the life sciences, while going through the state-of-the-art in database design, technical choices behind modern data management systems, and emerging needs in multidisciplinary research. Special attention is given to data interrogation strategies, user interface design, and comparative analysis capabilities, along with challenges such as data standardization and scalability in data-heavy applications. Conceptual features for developing interactive databases along diverse life science domains are then presented in the user case of cell line selection for in vitro research to bridge the gap between research data generation, actionable biological insight, subsequent meaningful experimental design, and clinical relevance.
- [216] arXiv:2503.21277 [pdf, html, other]
-
Title: Zero-Shot Visual Concept Blending Without Text GuidanceSubjects: Computer Vision and Pattern Recognition (cs.CV)
We propose a novel, zero-shot image generation technique called "Visual Concept Blending" that provides fine-grained control over which features from multiple reference images are transferred to a source image. If only a single reference image is available, it is difficult to isolate which specific elements should be transferred. However, using multiple reference images, the proposed approach distinguishes between common and unique features by selectively incorporating them into a generated output. By operating within a partially disentangled Contrastive Language-Image Pre-training (CLIP) embedding space (from IP-Adapter), our method enables the flexible transfer of texture, shape, motion, style, and more abstract conceptual transformations without requiring additional training or text prompts. We demonstrate its effectiveness across a diverse range of tasks, including style transfer, form metamorphosis, and conceptual transformations, showing how subtle or abstract attributes (e.g., brushstroke style, aerodynamic lines, and dynamism) can be seamlessly combined into a new image. In a user study, participants accurately recognized which features were intended to be transferred. Its simplicity, flexibility, and high-level control make Visual Concept Blending valuable for creative fields such as art, design, and content creation, where combining specific visual qualities from multiple inspirations is crucial.
- [217] arXiv:2503.21279 [pdf, other]
-
Title: Asynchronous BFT Consensus Made WirelessComments: Accepted to IEEE ICDCS 2025, 11 pages, 13 figuresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR)
Asynchronous Byzantine fault-tolerant (BFT) consensus protocols, known for their robustness in unpredictable environments without relying on timing assumptions, are becoming increasingly vital for wireless applications. While these protocols have proven effective in wired networks, their adaptation to wireless environments presents significant challenges. Asynchronous BFT consensus, characterized by its N parallel consensus components (e.g., asynchronous Byzantine agreement, reliable broadcast), suffers from high message complexity, leading to network congestion and inefficiency, especially in resource-constrained wireless networks. Asynchronous Byzantine agreement (ABA) protocols, a foundational component of asynchronous BFT, require careful balancing of message complexity and cryptographic overhead to achieve efficient implementation in wireless settings. Additionally, the absence of dedicated testbeds for asynchronous wireless BFT consensus protocols hinders development and performance evaluation. To address these challenges, we propose a consensus batching protocol (ConsensusBatcher), which supports both vertical and horizontal batching of multiple parallel consensus components. We leverage ConsensusBatcher to adapt three asynchronous BFT consensus protocols (HoneyBadgerBFT, BEAT, and Dumbo) from wired networks to resource-constrained wireless networks. To evaluate the performance of ConsensusBatcher-enabled consensus protocols in wireless environments, we develop and open-source a testbed for deployment and performance assessment of these protocols. Using this testbed, we demonstrate that ConsensusBatcher-based consensus reduces latency by 48% to 59% and increases throughput by 48% to 62% compared to baseline consensus protocols.
- [218] arXiv:2503.21281 [pdf, html, other]
-
Title: Output-Feedback Boundary Control of Thermally and Flow-Induced Vibrations in Slender Timoshenko BeamsSubjects: Robotics (cs.RO)
This work is motivated by the engineering challenge of suppressing vibrations in turbine blades of aero engines, which often operate under extreme thermal conditions and high-Mach aerodynamic environments that give rise to complex vibration phenomena, commonly referred to as thermally-induced and flow-induced vibrations. Using Hamilton's variational principle, the system is modeled as a rotating slender Timoshenko beam under thermal and aerodynamic loads, described by a mixed hyperbolic-parabolic PDE system where instabilities occur both within the PDE domain and at the uncontrolled boundary, and the two types of PDEs are cascaded in the domain. For such a system, we present the state-feedback control design based on the PDE backstepping method. Recognizing that the distributed temperature gradients and structural vibrations in the Timoshenko beam are typically unmeasurable in practice, we design a state observer for the mixed hyperbolic-parabolic PDE system. Based on this observer, an output-feedback controller is then built to regulate the overall system using only available boundary measurements. In the closed-loop system, the state of the uncontrolled boundary, i.e., the furthest state from the control input, is proved to be exponentially convergent to zero, and all signals are proved as uniformly ultimately bounded. The proposed control design is validated on an aero-engine flexible blade under extreme thermal and aerodynamic conditions.
- [219] arXiv:2503.21284 [pdf, html, other]
-
Title: Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image CompressionComments: Accepted to IEEE Transactions on Multimedia 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweight multi-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit this http URL source code is available at \href{this https URL}{this https URL}.
- [220] arXiv:2503.21288 [pdf, html, other]
-
Title: Haptic bilateral teleoperation system for free-hand dental proceduresComments: 12 pages, 12 figuresSubjects: Robotics (cs.RO)
Free-hand dental procedures are typically repetitive, time-consuming and require high precision and manual dexterity. Dental robots can play a key role in improving procedural accuracy and safety, enhancing patient comfort, and reducing operator workload. However, robotic solutions for free-hand procedures remain limited or completely lacking, and their acceptance is still low. To address this gap, we develop a haptic bilateral teleoperation system (HBTS) for free-hand dental procedures. The system includes a dedicated mechanical end-effector, compatible with standard clinical tools, and equipped with an endoscopic camera for improved visibility of the intervention site. By ensuring motion and force correspondence between the operator's actions and the robot's movements, monitored through visual feedback, we enhance the operator's sensory awareness and motor accuracy. Furthermore, recognizing the need to ensure procedural safety, we limit interaction forces by scaling the motion references provided to the admittance controller based solely on measured contact forces. This ensures effective force limitation in all contact states without requiring prior knowledge of the environment. The proposed HBTS is validated in a dental scaling procedure using a dental phantom. The results show that the system improves the naturalness, safety, and accuracy of teleoperation, highlighting its potential to enhance free-hand dental procedures.
- [221] arXiv:2503.21289 [pdf, other]
-
Title: Declarative Traffic Engineering for Low-Latency and Reliable NetworkingJacopo Massa, Stefano Forti, Federica Paganelli, Patrizio Dazzi, Antonio Brogi, Alexander Clemm, Toerless EckertSubjects: Networking and Internet Architecture (cs.NI)
Cloud-Edge applications like industrial control systems and connected vehicles demand stringent end-to-end latency guarantees. Among existing data plane candidate solutions for bounded latency networking, the guaranteed Latency-Based Forwarding (gLBF) approach ensures punctual delivery of traffic flows by managing per-hop delays to meet specific latency targets, while not requiring that per-flow states are maintained at each hop. However, as a forwarding plane mechanism, gLBF does not define the control mechanisms for determining feasible forwarding paths and per-hop latency budgets for packets to fulfil end-to-end latency objectives. In this work, we propose such a control mechanism implemented in Prolog that complies with gLBF specifications, called declarative gLBF (dgLBF). The declarative nature of Prolog allows our prototype to be concise (~120 lines of code) and easy to extend. We show how the core dgLBF implementation is extended to add reliability mechanisms, path protection, and fate-sharing avoidance to enhance fault tolerance and robustness. Finally, we evaluate the system's performance through simulative experiments under different network topologies and with increasing traffic load to simulate saturated network conditions, scaling up to 6000 flows. Our results show a quasi-linear degradation in placement times and system resilience under heavy traffic.
- [222] arXiv:2503.21291 [pdf, other]
-
Title: An analysis of higher-order kinematics formalisms for an innovative surgical parallel robotCalin Vaida, Iosif Birlescu, Bogdan Gherman, Daniel Condurache, Damien Chablat (LS2N, LS2N - équipe RoMas), Doina PislaJournal-ref: Mechanism and Machine Theory, 2025, 209, pp.105986.Subjects: Robotics (cs.RO)
The paper presents a novel modular hybrid parallel robot for pancreatic surgery and its higher-order kinematics derived based on various formalisms. The classical vector, homogeneous transformation matrices and dual quaternion approaches are studied for the kinematic functions using both classical differentiation and multidual algebra. The algorithms for inverse kinematics for all three studied formalisms are presented for both differentiation and multidual algebra approaches. Furthermore, these algorithms are compared based on numerical stability, execution times and number and type of mathematical functions and operators contained in each algorithm. A statistical analysis shows that there is significant improvement in execution time for the algorithms implemented using multidual algebra, while the numerical stability is appropriate for all algorithms derived based on differentiation and multidual algebra. While the implementation of the kinematic algorithms using multidual algebra shows positive results when benchmarked on a standard PC, further work is required to evaluate the multidual algorithms on hardware/software used for the modular parallel robot command and control.
- [223] arXiv:2503.21293 [pdf, html, other]
-
Title: Lidar-only Odometry based on Multiple Scan-to-Scan Alignments over a Moving WindowSubjects: Robotics (cs.RO)
Lidar-only odometry considers the pose estimation of a mobile robot based on the accumulation of motion increments extracted from consecutive lidar scans. Many existing approaches to the problem use a scan-to-map registration, which neglects the accumulation of errors within the maintained map due to drift. Other methods use a refinement step that jointly optimizes the local map on a feature basis. We propose a solution that avoids this by using multiple independent scan-to-scan Iterative Closest Points (ICP) registrations to previous scans in order to derive constraints for a pose graph. The optimization of the pose graph then not only yields an accurate estimate for the latest pose, but also enables the refinement of previous scans in the optimization window. By avoiding the need to recompute the scan-to-scan alignments, the computational load is minimized. Extensive evaluation on the public KITTI and MulRan datasets as well as on a custom automotive lidar dataset is carried out. Results show that the proposed approach achieves state-of-the-art estimation accuracy, while alleviating the mentioned issues.
- [224] arXiv:2503.21295 [pdf, html, other]
-
Title: R-PRM: Reasoning-Driven Process Reward ModelingComments: The project is available at this https URLSubjects: Computation and Language (cs.CL)
Large language models (LLMs) inevitably make mistakes when performing step-by-step mathematical reasoning. Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step. However, existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy, which is further exacerbated by the scarcity of annotated data. To address these issues, we propose Reasoning-Driven Process Reward Modeling (R-PRM). First, we leverage stronger LLMs to generate seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities and enabling comprehensive step-by-step evaluation. Second, we further enhance performance through preference optimization, without requiring additional annotated data. Third, we introduce inference-time scaling to fully harness the model's reasoning potential. Extensive experiments demonstrate R-PRM's effectiveness: on ProcessBench and PRMBench, it surpasses strong baselines by 11.9 and 8.5 points in F1 scores, respectively. When applied to guide mathematical reasoning, R-PRM achieves consistent accuracy improvements of over 8.5 points across six challenging datasets. Further analysis reveals that R-PRM exhibits more comprehensive evaluation and stronger generalization capabilities, thereby highlighting its significant potential.
- [225] arXiv:2503.21297 [pdf, html, other]
-
Title: MLDSE: Scaling Design Space Exploration Infrastructure for Multi-Level HardwareSubjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
To efficiently support large-scale NNs, multi-level hardware, leveraging advanced integration and interconnection technologies, has emerged as a promising solution to counter the slowdown of Moore's law. However, the vast design space of such hardware, coupled with the complexity of their spatial hierarchies and organizations, introduces significant challenges for design space exploration (DSE). Existing DSE tools, which rely on predefined hardware templates to explore parameters for specific architectures, fall short in exploring diverse organizations, spatial hierarchies, and architectural polymorphisms inherent in multi-level hardware. To address these limitations, we present Multi-Level Design Space Exploror (MLDSE), a novel infrastructure for domain-specific DSE of multi-level hardware. MLDSE introduces three key innovations from three basic perspectives of DSE: 1) Modeling: MLDSE introduces a hardware intermediate representation (IR) that can recursively model diverse multi-level hardware with composable elements at various granularities. 2) Mapping: MLDSE provides a comprehensive spatiotemporal mapping IR and mapping primitives, facilitating the mapping strategy exploration on multi-level hardware, especially synchronization and cross-level communication; 3) Simulation: MLDSE supports universal simulator generation based on task-level event-driven simulation mechanism. It features a hardware-consistent scheduling algorithm that can handle general task-level resource contention. Through experiments on LLM workloads, we demonstrate MLDSE's unique capability to perform three-tier DSE spanning architecture, hardware parameter, and mapping.
- [226] arXiv:2503.21300 [pdf, other]
-
Title: Optimizing Resource Allocation and Scheduling towards FRMCS and GSM-R networks coexistence in Railway SystemsMohamed Aziz Aboud (LIGM, ETS), Nawel Zangar, Rami Langar (LIGM), Marion Berbineau (COSYS-LEOST), Jerome MadecJournal-ref: Global Information Infrastructure Symposium, GIIS'25, IEEE Xplore, 2025Subjects: Networking and Internet Architecture (cs.NI)
The actual railway communication system used in Europe for high-speed trains (HST) is called the GSM-R system, which is a communication system based on 2G infrastructure. This system is meant to be replaced by a new system based on 5G NR infrastructure called the Future Railway Mobile Communication System (FRMCS) by 2030. For the next years, both systems will probably coexist in the same frequency band since the migration from GSM-R to FRMCS is planned to be done progressively until the GSM-R system is completely shut down, mainly due to safety and budget constraints. In this paper, we study the resource allocation for the FRMCS system sharing the same frequency band as the already deployed GSM-R system. We formulate the resource allocation problem as an integer linear problem (ILP), known to be this http URL solve it in a reasonable time, we propose a scheduling algorithm, called Intelligent Traffic Scheduling Preemptor (ITSP), that allocates resources for the different FRMCS traffic types considered (critical traffic and performance traffic) in the same frequency band with the GSM-R system. Our algorithm is channel quality Indicator (CQI) aware and uses the preemption mechanism in 5G NR standards to optimize the resource allocation for the FRMCS system without impacting the actual GSM-R resource allocation in the context of the white space concept.
- [227] arXiv:2503.21305 [pdf, html, other]
-
Title: DeBackdoor: A Deductive Framework for Detecting Backdoor Attacks on Deep Models with Limited DataSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Backdoor attacks are among the most effective, practical, and stealthy attacks in deep learning. In this paper, we consider a practical scenario where a developer obtains a deep model from a third party and uses it as part of a safety-critical system. The developer wants to inspect the model for potential backdoors prior to system deployment. We find that most existing detection techniques make assumptions that are not applicable to this scenario. In this paper, we present a novel framework for detecting backdoors under realistic restrictions. We generate candidate triggers by deductively searching over the space of possible triggers. We construct and optimize a smoothed version of Attack Success Rate as our search objective. Starting from a broad class of template attacks and just using the forward pass of a deep model, we reverse engineer the backdoor attack. We conduct extensive evaluation on a wide range of attacks, models, and datasets, with our technique performing almost perfectly across these settings.
- [228] arXiv:2503.21307 [pdf, html, other]
-
Title: InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token CompressionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Most multimodal large language models (MLLMs) treat visual tokens as "a sequence of text", integrating them with text tokens into a large language model (LLM). However, a great quantity of visual tokens significantly increases the demand for computational resources and time. In this paper, we propose InternVL-X, which outperforms the InternVL model in both performance and efficiency by incorporating three visual token compression methods. First, we propose a novel vision-language projector, PVTC. This component integrates adjacent visual embeddings to form a local query and utilizes the transformed CLS token as a global query, then performs point-to-region cross-attention through these local and global queries to more effectively convert visual features. Second, we present a layer-wise visual token compression module, LVTC, which compresses tokens in the LLM shallow layers and then expands them through upsampling and residual connections in the deeper layers. This significantly enhances the model computational efficiency. Futhermore, we propose an efficient high resolution slicing method, RVTC, which dynamically adjusts the number of visual tokens based on image area or length filtering. RVTC greatly enhances training efficiency with only a slight reduction in performance. By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.
- [229] arXiv:2503.21309 [pdf, html, other]
-
Title: FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image RetrievalSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Composed Image Retrieval (CIR) facilitates image retrieval through a multimodal query consisting of a reference image and modification text. The reference image defines the retrieval context, while the modification text specifies desired alterations. However, existing CIR datasets predominantly employ coarse-grained modification text (CoarseMT), which inadequately captures fine-grained retrieval intents. This limitation introduces two key challenges: (1) ignoring detailed differences leads to imprecise positive samples, and (2) greater ambiguity arises when retrieving visually similar images. These issues degrade retrieval accuracy, necessitating manual result filtering or repeated queries. To address these limitations, we develop a robust fine-grained CIR data annotation pipeline that minimizes imprecise positive samples and enhances CIR systems' ability to discern modification intents accurately. Using this pipeline, we refine the FashionIQ and CIRR datasets to create two fine-grained CIR datasets: Fine-FashionIQ and Fine-CIRR. Furthermore, we introduce FineCIR, the first CIR framework explicitly designed to parse the modification text. FineCIR effectively captures fine-grained modification semantics and aligns them with ambiguous visual entities, enhancing retrieval precision. Extensive experiments demonstrate that FineCIR consistently outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR benchmark datasets. Our FineCIR code and fine-grained CIR datasets are available at this https URL.
- [230] arXiv:2503.21313 [pdf, html, other]
-
Title: HORT: Monocular Hand-held Objects Reconstruction with TransformersComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
- [231] arXiv:2503.21315 [pdf, html, other]
-
Title: Tricking Retrievers with Influential Tokens: An Efficient Black-Box Corpus Poisoning AttackComments: Accepted to NAACL 2025 Main TrackSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Retrieval-augmented generation (RAG) systems enhance large language models by incorporating external knowledge, addressing issues like outdated internal knowledge and hallucination. However, their reliance on external knowledge bases makes them vulnerable to corpus poisoning attacks, where adversarial passages can be injected to manipulate retrieval results. Existing methods for crafting such passages, such as random token replacement or training inversion models, are often slow and computationally expensive, requiring either access to retriever's gradients or large computational resources. To address these limitations, we propose Dynamic Importance-Guided Genetic Algorithm (DIGA), an efficient black-box method that leverages two key properties of retrievers: insensitivity to token order and bias towards influential tokens. By focusing on these characteristics, DIGA dynamically adjusts its genetic operations to generate effective adversarial passages with significantly reduced time and memory usage. Our experimental evaluation shows that DIGA achieves superior efficiency and scalability compared to existing methods, while maintaining comparable or better attack success rates across multiple datasets.
- [232] arXiv:2503.21317 [pdf, other]
-
Title: Surface guided analysis of breast changes during post-operative radiotherapy by using a functional map frameworkPierre Galmiche (ICube), Hyewon Seo (ICube), Yvan Pin, Philippe Meyer (ICANS, ICube), Georges Noël (ICANS, ICube), Michel de Mathelin (ICube)Subjects: Computational Geometry (cs.CG)
The treatment of breast cancer using radiotherapy involves uncertainties regarding breast positioning. As the studies progress, more is known about the expected breast positioning errors, which are taken into account in the Planning Target Volume (PTV) in the form of the margin around the clinical target volume. However, little is known about the non-rigid deformations of the breast in the course of radiotherapy, which is a non-negligible factor to the treatment. Purpose: Taking into account such inter-fractional breast deformations would help develop a promising future direction, such as patient-specific adjustable irradiation plannings. Methods: In this study, we develop a geometric approach to analyze inter-fractional breast deformation throughout the radiotherapy treatment. Our data consists of 3D surface scans of patients acquired during radiotherapy sessions using a handheld scanner. We adapt functional map framework to compute inter-and intra-patient non-rigid correspondences, which are then used to analyze intra-patient changes and inter-patient variability. Results: The qualitative shape collection analysis highlight deformations in the contralateral breast and armpit areas, along with positioning shifts on the head or abdominal regions. We also perform extrinsic analysis, where we align surface acquisitions of the treated breast with the CT-derived skin surface to assess displacements and volume changes in the treated area. On average, displacements within the treated breast exhibit amplitudes of 1-2 mm across sessions, with higher values observed at the time of the 25 th irradiation session. Volume changes, inferred from surface variations, reached up to 10%, with values ranging between 2% and 5% over the course of treatment. Conclusions: We propose a comprehensive workflow for analyzing and modeling breast deformations during radiotherapy using surface acquisitions, incorporating a novel inter-collection shape matching approach to model shape variability within a i shared space across multiple patient shape collections. We validate our method using 3D surface data acquired from patients during External Beam Radiotherapy (EBRT) sessions, demonstrating its effectiveness. The clinical trial data used in this paper is registered under the this http URL ID NCT03801850.
- [233] arXiv:2503.21318 [pdf, html, other]
-
Title: Explicit error bounds and guaranteed convergence of the Koopman-Hill projection stability method for linear time-periodic dynamicsComments: preprint, 34 pages, 10 figuresSubjects: Numerical Analysis (math.NA); Dynamical Systems (math.DS)
The Koopman-Hill projection method is used to approximate the fundamental solution matrix of linear time-periodic ordinary differential equations, possibly stemming from linearization around a periodic solution of a nonlinear dynamical system. By expressing both the true fundamental solution and its approximation as series, we derive an upper bound for the approximation error that decays exponentially with the size of the Hill matrix. Exponential decay of the Fourier coefficients of the system dynamics is key to guarantee convergence. The paper also analyzes a subharmonic formulation that improves the convergence rate. Two numerical examples, including a Duffing oscillator, illustrate the theoretical findings.
- [234] arXiv:2503.21322 [pdf, html, other]
-
Title: HyperGraphRAG: Retrieval-Augmented Generation with Hypergraph-Structured Knowledge RepresentationHaoran Luo, Haihong E, Guanting Chen, Yandan Zheng, Xiaobao Wu, Yikai Guo, Qika Lin, Yu Feng, Zemin Kuang, Meina Song, Yifan Zhu, Luu Anh TuanComments: PreprintSubjects: Artificial Intelligence (cs.AI)
While standard Retrieval-Augmented Generation (RAG) based on chunks, GraphRAG structures knowledge as graphs to leverage the relations among entities. However, previous GraphRAG methods are limited by binary relations: one edge in the graph only connects two entities, which cannot well model the n-ary relations among more than two entities that widely exist in reality. To address this limitation, we propose HyperGraphRAG, a novel hypergraph-based RAG method that represents n-ary relational facts via hyperedges, modeling the complicated n-ary relations in the real world. To retrieve and generate over hypergraphs, we introduce a complete pipeline with a hypergraph construction method, a hypergraph retrieval strategy, and a hypergraph-guided generation mechanism. Experiments across medicine, agriculture, computer science, and law demonstrate that HyperGraphRAG outperforms standard RAG and GraphRAG in accuracy and generation quality.
- [235] arXiv:2503.21323 [pdf, other]
-
Title: DuckSegmentation: A segmentation model based on the AnYue Hemp Duck DatasetSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The modernization of smart farming is a way to improve agricultural production efficiency, and improve the agricultural production environment. Although many large models have achieved high accuracy in the task of object recognition and segmentation, they cannot really be put into use in the farming industry due to their own poor interpretability and limitations in computational volume. In this paper, we built AnYue Shelduck Dateset, which contains a total of 1951 Shelduck datasets, and performed target detection and segmentation annotation with the help of professional annotators. Based on AnYue ShelduckDateset, this paper describes DuckProcessing, an efficient and powerful module for duck identification based on real shelduckfarms. First of all, using the YOLOv8 module designed to divide the mahjong between them, Precision reached 98.10%, Recall reached 96.53% and F1 score reached 0.95 on the test set. Again using the DuckSegmentation segmentation model, DuckSegmentation reached 96.43% mIoU. Finally, the excellent DuckSegmentation was used as the teacher model, and through knowledge distillation, Deeplabv3 r50 was used as the student model, and the final student model achieved 94.49% mIoU on the test set. The method provides a new way of thinking in practical sisal duck smart farming.
- [236] arXiv:2503.21329 [pdf, html, other]
-
Title: When is a Bottom-Up Deterministic Tree Translation Top-Down Deterministic?Subjects: Formal Languages and Automata Theory (cs.FL)
We consider two natural subclasses of deterministic top-down tree-to-tree transducers, namely, linear and uniform-copying transducers. For both classes we show that it is decidable whether the translation of a transducer with look-ahead can be realized by a transducer from the same class without look-ahead. The transducers constructed in this way, may still make use of inspection, i.e., have an additional tree automaton restricting the domain. We provide a second procedure which decides whether inspection can be removed. The procedure relies on a precise abstract interpretation of inspection requirements and a dedicated earliest normal form for linear as well as uniform-copying transducers which can be constructed in polynomial time. As a consequence, equivalence of these transducers can be decided in polynomial time. Applying these results to deterministic bottom-up tree transducers, we obtain that it is decidable whether or not their translations can be realized by deterministic linear or uniform-copying top-down transducers without look-ahead (but with inspection) -- or without both look-ahead and inspection. Look-ahead removal has been known to be a notoriously difficult problem. To the best of our knowledge, this paper is the first to present look-ahead removal for natural and known subclasses of top-down tree transducers.
- [237] arXiv:2503.21330 [pdf, html, other]
-
Title: Large Language Models for Traffic and Transportation Research: Methodologies, State of the Art, and Future OpportunitiesYimo Yan, Yejia Liao, Guanhao Xu, Ruili Yao, Huiying Fan, Jingran Sun, Xia Wang, Jonathan Sprinkle, Ziyan An, Meiyi Ma, Xi Cheng, Tong Liu, Zemian Ke, Bo Zou, Matthew Barth, Yong-Hong KuoSubjects: Computational Engineering, Finance, and Science (cs.CE)
The rapid rise of Large Language Models (LLMs) is transforming traffic and transportation research, with significant advancements emerging between the years 2023 and 2025 -- a period marked by the inception and swift growth of adopting and adapting LLMs for various traffic and transportation applications. However, despite these significant advancements, a systematic review and synthesis of the existing studies remain lacking. To address this gap, this paper provides a comprehensive review of the methodologies and applications of LLMs in traffic and transportation, highlighting their ability to process unstructured textual data to advance transportation research. We explore key applications, including autonomous driving, travel behavior prediction, and general transportation-related queries, alongside methodologies such as zero- or few-shot learning, prompt engineering, and fine-tuning. Our analysis identifies critical research gaps. From the methodological perspective, many research gaps can be addressed by integrating LLMs with existing tools and refining LLM architectures. From the application perspective, we identify numerous opportunities for LLMs to tackle a variety of traffic and transportation challenges, building upon existing research. By synthesizing these findings, this review not only clarifies the current state of LLM adoption and adaptation in traffic and transportation but also proposes future research directions, paving the way for smarter and more sustainable transportation systems.
- [238] arXiv:2503.21332 [pdf, html, other]
-
Title: ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on FeedbackSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Summarization refinement faces challenges when extending to multi-dimension. In this paper, we introduce ReFeed, a powerful summarization refinement pipeline that enhances multiple dimensions through reflective reasoning on feedback. To achieve this, we release SumFeed-CoT, a large-scale Long-CoT-based dataset optimized for training a lightweight model with reflective reasoning. Our experiments reveal how the number of dimensions, feedback exposure, and reasoning policy influence refinement performance, highlighting reflective reasoning and simultaneously addressing multiple feedback is crucial to mitigate trade-off between dimensions. Furthermore, ReFeed is robust to noisy feedback and feedback order. Lastly, our finding emphasizes that creating data with a proper goal and guideline constitutes a fundamental pillar of effective reasoning. The dataset and model will be released.
- [239] arXiv:2503.21335 [pdf, html, other]
-
Title: A Low-Power Streaming Speech Enhancement Accelerator For Edge DevicesJournal-ref: in IEEE Open Journal of Circuits and Systems, vol. 5, pp. 128-140, 2024Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9\% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.
- [240] arXiv:2503.21337 [pdf, html, other]
-
Title: A 71.2-$μ$W Speech Recognition Accelerator with Recurrent Spiking Neural NetworkJournal-ref: in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 71, no. 7, pp. 3203-3213, July 2024Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
This paper introduces a 71.2-$\mu$W speech recognition accelerator designed for edge devices' real-time applications, emphasizing an ultra low power design. Achieved through algorithm and hardware co-optimizations, we propose a compact recurrent spiking neural network with two recurrent layers, one fully connected layer, and a low time step (1 or 2). The 2.79-MB model undergoes pruning and 4-bit fixed-point quantization, shrinking it by 96.42\% to 0.1 MB. On the hardware front, we take advantage of \textit{mixed-level pruning}, \textit{zero-skipping} and \textit{merged spike} techniques, reducing complexity by 90.49\% to 13.86 MMAC/S. The \textit{parallel time-step execution} addresses inter-time-step data dependencies and enables weight buffer power savings through weight sharing. Capitalizing on the sparse spike activity, an input broadcasting scheme eliminates zero computations, further saving power. Implemented on the TSMC 28-nm process, the design operates in real time at 100 kHz, consuming 71.2 $\mu$W, surpassing state-of-the-art designs. At 500 MHz, it has 28.41 TOPS/W and 1903.11 GOPS/mm$^2$ in energy and area efficiency, respectively.
- [241] arXiv:2503.21338 [pdf, html, other]
-
Title: UGNA-VPR: A Novel Training Paradigm for Visual Place Recognition Based on Uncertainty-Guided NeRF AugmentationComments: Accepted to IEEE Robotics and Automation Letters (RA-L)Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Visual place recognition (VPR) is crucial for robots to identify previously visited locations, playing an important role in autonomous navigation in both indoor and outdoor environments. However, most existing VPR datasets are limited to single-viewpoint scenarios, leading to reduced recognition accuracy, particularly in multi-directional driving or feature-sparse scenes. Moreover, obtaining additional data to mitigate these limitations is often expensive. This paper introduces a novel training paradigm to improve the performance of existing VPR networks by enhancing multi-view diversity within current datasets through uncertainty estimation and NeRF-based data augmentation. Specifically, we initially train NeRF using the existing VPR dataset. Then, our devised self-supervised uncertainty estimation network identifies places with high uncertainty. The poses of these uncertain places are input into NeRF to generate new synthetic observations for further training of VPR networks. Additionally, we propose an improved storage method for efficient organization of augmented and original training data. We conducted extensive experiments on three datasets and tested three different VPR backbone networks. The results demonstrate that our proposed training paradigm significantly improves VPR performance by fully utilizing existing data, outperforming other training approaches. We further validated the effectiveness of our approach on self-recorded indoor and outdoor datasets, consistently demonstrating superior results. Our dataset and code have been released at \href{this https URL}{this https URL}.
- [242] arXiv:2503.21346 [pdf, html, other]
-
Title: Scalable Expectation Estimation with Subtractive Mixture ModelsSubjects: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
Many Monte Carlo (MC) and importance sampling (IS) methods use mixture models (MMs) for their simplicity and ability to capture multimodal distributions. Recently, subtractive mixture models (SMMs), i.e. MMs with negative coefficients, have shown greater expressiveness and success in generative modeling. However, their negative parameters complicate sampling, requiring costly auto-regressive techniques or accept-reject algorithms that do not scale in high dimensions. In this work, we use the difference representation of SMMs to construct an unbiased IS estimator ($\Delta\text{Ex}$) that removes the need to sample from the SMM, enabling high-dimensional expectation estimation with SMMs. In our experiments, we show that $\Delta\text{Ex}$ can achieve comparable estimation quality to auto-regressive sampling while being considerably faster in MC estimation. Moreover, we conduct initial experiments with $\Delta\text{Ex}$ using hand-crafted proposals, gaining first insights into how to construct safe proposals for $\Delta\text{Ex}$.
- [243] arXiv:2503.21347 [pdf, html, other]
-
Title: Residual Learning Inspired Crossover Operator and Strategy Enhancements for Evolutionary MultitaskingComments: 9 pages, 4 figuresSubjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
In evolutionary multitasking, strategies such as crossover operators and skill factor assignment are critical for effective knowledge transfer. Existing improvements to crossover operators primarily focus on low-dimensional variable combinations, such as arithmetic crossover or partially mapped crossover, which are insufficient for modeling complex high-dimensional this http URL, static or semi-dynamic crossover strategies fail to adapt to the dynamic dependencies among tasks. In addition, current Multifactorial Evolutionary Algorithm frameworks often rely on fixed skill factor assignment strategies, lacking flexibility. To address these limitations, this paper proposes the Multifactorial Evolutionary Algorithm-Residual Learning (MFEA-RL) method based on residual learning. The method employs a Very Deep Super-Resolution (VDSR) model to generate high-dimensional residual representations of individuals, enhancing the modeling of complex relationships within dimensions. A ResNet-based mechanism dynamically assigns skill factors to improve task adaptability, while a random mapping mechanism efficiently performs crossover operations and mitigates the risk of negative transfer. Theoretical analysis and experimental results show that MFEA-RL outperforms state-of-the-art multitasking algorithms. It excels in both convergence and adaptability on standard evolutionary multitasking benchmarks, including CEC2017-MTSO and WCCI2020-MTSO. Additionally, its effectiveness is validated through a real-world application scenario.
- [244] arXiv:2503.21349 [pdf, html, other]
-
Title: Fine-Tuning LLMs on Small Medical Datasets: Text Classification and Normalization Effectiveness on Cardiology reports and Discharge recordsComments: 4 pages, 2 tables,Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
We investigate the effectiveness of fine-tuning large language models (LLMs) on small medical datasets for text classification and named entity recognition tasks. Using a German cardiology report dataset and the i2b2 Smoking Challenge dataset, we demonstrate that fine-tuning small LLMs locally on limited training data can improve performance achieving comparable results to larger models. Our experiments show that fine-tuning improves performance on both tasks, with notable gains observed with as few as 200-300 training examples. Overall, the study highlights the potential of task-specific fine-tuning of LLMs for automating clinical workflows and efficiently extracting structured data from unstructured medical text.
- [245] arXiv:2503.21350 [pdf, html, other]
-
Title: A Data-Driven Method for INS/DVL AlignmentSubjects: Robotics (cs.RO); Software Engineering (cs.SE)
Autonomous underwater vehicles (AUVs) are sophisticated robotic platforms crucial for a wide range of applications. The accuracy of AUV navigation systems is critical to their success. Inertial sensors and Doppler velocity logs (DVL) fusion is a promising solution for long-range underwater navigation. However, the effectiveness of this fusion depends heavily on an accurate alignment between the inertial sensors and the DVL. While current alignment methods show promise, there remains significant room for improvement in terms of accuracy, convergence time, and alignment trajectory efficiency. In this research we propose an end-to-end deep learning framework for the alignment process. By leveraging deep-learning capabilities, such as noise reduction and capture of nonlinearities in the data, we show using simulative data, that our proposed approach enhances both alignment accuracy and reduces convergence time beyond current model-based methods.
- [246] arXiv:2503.21352 [pdf, other]
-
Title: Using large language models to produce literature reviews: Usages and systematic biases of microphysics parametrizations in 2699 publicationsSubjects: Artificial Intelligence (cs.AI); Applications (stat.AP)
Large language models afford opportunities for using computers for intensive tasks, realizing research opportunities that have not been considered before. One such opportunity could be a systematic interrogation of the scientific literature. Here, we show how a large language model can be used to construct a literature review of 2699 publications associated with microphysics parametrizations in the Weather and Research Forecasting (WRF) model, with the goal of learning how they were used and their systematic biases, when simulating precipitation. The database was constructed of publications identified from Web of Science and Scopus searches. The large language model GPT-4 Turbo was used to extract information about model configurations and performance from the text of 2699 publications. Our results reveal the landscape of how nine of the most popular microphysics parameterizations have been used around the world: Lin, Ferrier, WRF Single-Moment, Goddard Cumulus Ensemble, Morrison, Thompson, and WRF Double-Moment. More studies used one-moment parameterizations before 2020 and two-moment parameterizations after 2020. Seven out of nine parameterizations tended to overestimate precipitation. However, systematic biases of parameterizations differed in various regions. Except simulations using the Lin, Ferrier, and Goddard parameterizations that tended to underestimate precipitation over almost all locations, the remaining six parameterizations tended to overestimate, particularly over China, southeast Asia, western United States, and central Africa. This method could be used by other researchers to help understand how the increasingly massive body of scientific literature can be harnessed through the power of artificial intelligence to solve their research problems.
- [247] arXiv:2503.21356 [pdf, html, other]
-
Title: Investigating the Duality of Interpretability and Explainability in Machine LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The rapid evolution of machine learning (ML) has led to the widespread adoption of complex "black box" models, such as deep neural networks and ensemble methods. These models exhibit exceptional predictive performance, making them invaluable for critical decision-making across diverse domains within society. However, their inherently opaque nature raises concerns about transparency and interpretability, making them untrustworthy decision support systems. To alleviate such a barrier to high-stakes adoption, research community focus has been on developing methods to explain black box models as a means to address the challenges they pose. Efforts are focused on explaining these models instead of developing ones that are inherently interpretable. Designing inherently interpretable models from the outset, however, can pave the path towards responsible and beneficial applications in the field of ML. In this position paper, we clarify the chasm between explaining black boxes and adopting inherently interpretable models. We emphasize the imperative need for model interpretability and, following the purpose of attaining better (i.e., more effective or efficient w.r.t. predictive performance) and trustworthy predictors, provide an experimental evaluation of latest hybrid learning methods that integrates symbolic knowledge into neural network predictors. We demonstrate how interpretable hybrid models could potentially supplant black box ones in different domains.
- [248] arXiv:2503.21360 [pdf, html, other]
-
Title: From User Preferences to Optimization Constraints Using Large Language ModelsManuela Sanguinetti, Alessandra Perniciano, Luca Zedda, Andrea Loddo, Cecilia Di Ruberto, Maurizio AtzoriSubjects: Computation and Language (cs.CL)
This work explores using Large Language Models (LLMs) to translate user preferences into energy optimization constraints for home appliances. We describe a task where natural language user utterances are converted into formal constraints for smart appliances, within the broader context of a renewable energy community (REC) and in the Italian scenario. We evaluate the effectiveness of various LLMs currently available for Italian in translating these preferences resorting to classical zero-shot, one-shot, and few-shot learning settings, using a pilot dataset of Italian user requests paired with corresponding formal constraint representation. Our contributions include establishing a baseline performance for this task, publicly releasing the dataset and code for further research, and providing insights on observed best practices and limitations of LLMs in this particular domain
- [249] arXiv:2503.21361 [pdf, other]
-
Title: Computing adjoint mismatch of linear mapsSubjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
This paper considers the problem of detecting adjoint mismatch for two linear maps. To clarify, this means that we aim to calculate the operator norm for the difference of two linear maps, where for one we only have a black-box implementation for the evaluation of the map, and for the other we only have a black-box for the evaluation of the adjoint map. We give two stochastic algorithms for which we prove the almost sure convergence to the operator norm. The algorithm is a random search method for a generalization of the Rayleigh quotient and uses optimal step sizes. Additionally, a convergence analysis is done for the corresponding singular vector and the respective eigenvalue equation.
- [250] arXiv:2503.21364 [pdf, html, other]
-
Title: LandMarkSystem Technical ReportSubjects: Computer Vision and Pattern Recognition (cs.CV)
3D reconstruction is vital for applications in autonomous driving, virtual reality, augmented reality, and the metaverse. Recent advancements such as Neural Radiance Fields(NeRF) and 3D Gaussian Splatting (3DGS) have transformed the field, yet traditional deep learning frameworks struggle to meet the increasing demands for scene quality and scale. This paper introduces LandMarkSystem, a novel computing framework designed to enhance multi-scale scene reconstruction and rendering. By leveraging a componentized model adaptation layer, LandMarkSystem supports various NeRF and 3DGS structures while optimizing computational efficiency through distributed parallel computing and model parameter offloading. Our system addresses the limitations of existing frameworks, providing dedicated operators for complex 3D sparse computations, thus facilitating efficient training and rapid inference over extensive scenes. Key contributions include a modular architecture, a dynamic loading strategy for limited resources, and proven capabilities across multiple representative this http URL comprehensive solution aims to advance the efficiency and effectiveness of 3D reconstruction this http URL facilitate further research and collaboration, the source code and documentation for the LandMarkSystem project are publicly available in an open-source repository, accessing the repository at: this https URL.
- [251] arXiv:2503.21365 [pdf, html, other]
-
Title: CA+: Cognition Augmented Counselor Agent Framework for Long-term Dynamic Client EngagementSubjects: Human-Computer Interaction (cs.HC)
Current AI counseling systems struggle with maintaining effective long-term client engagement. Through formative research with counselors and a systematic literature review, we identified five key design considerations for AI counseling interactions. Based on these insights, we propose CA+, a Cognition Augmented counselor framework enhancing contextual understanding through three components:
(1) Therapy Strategies Module: Implements hierarchical Goals-Session-Action planning with bidirectional adaptation based on client feedback; (2) Communication Form Module: Orchestrates parallel guidance and empathy pathways for balanced therapeutic progress and emotional resonance; (3) Information Management: Utilizes client profile and therapeutic knowledge databases for dynamic, context-aware interventions.
A three-day longitudinal study with 24 clients demonstrates CA+'s significant improvements in client engagement, perceived empathy, and overall satisfaction compared to a baseline system. Besides, two licensed counselors confirm its high professionalism. Our research demonstrates the potential for enhancing LLM engagement in psychological counseling dialogues through cognitive theory, which may inspire further innovations in computational interaction in the future. - [252] arXiv:2503.21367 [pdf, html, other]
-
Title: Multimodal surface defect detection from wooden logs for sawing optimizationSubjects: Computer Vision and Pattern Recognition (cs.CV)
We propose a novel, good-quality, and less demanding method for detecting knots on the surface of wooden logs using multimodal data fusion. Knots are a primary factor affecting the quality of sawn timber, making their detection fundamental to any timber grading or cutting optimization system. While X-ray computed tomography provides accurate knot locations and internal structures, it is often too slow or expensive for practical use. An attractive alternative is to use fast and cost-effective log surface measurements, such as laser scanners or RGB cameras, to detect surface knots and estimate the internal structure of wood. However, due to the small size of knots and noise caused by factors, such as bark and other natural variations, detection accuracy often remains low when only one measurement modality is used. In this paper, we demonstrate that by using a data fusion pipeline consisting of separate streams for RGB and point cloud data, combined by a late fusion module, higher knot detection accuracy can be achieved compared to using either modality alone. We further propose a simple yet efficient sawing angle optimization method that utilizes surface knot detections and cross-correlation to minimize the amount of unwanted arris knots, demonstrating its benefits over randomized sawing angles.
- [253] arXiv:2503.21377 [pdf, html, other]
-
Title: Unsupervised Real-World Denoising: Sparsity is All You NeedSubjects: Computer Vision and Pattern Recognition (cs.CV)
Supervised training for real-world denoising presents challenges due to the difficulty of collecting large datasets of paired noisy and clean images. Recent methods have attempted to address this by utilizing unpaired datasets of clean and noisy images. Some approaches leverage such unpaired data to train denoisers in a supervised manner by generating synthetic clean-noisy pairs. However, these methods often fall short due to the distribution gap between synthetic and real noisy images. To mitigate this issue, we propose a solution based on input sparsification, specifically using random input masking. Our method, which we refer to as Mask, Inpaint and Denoise (MID), trains a denoiser to simultaneously denoise and inpaint synthetic clean-noisy pairs. On one hand, input sparsification reduces the gap between synthetic and real noisy images. On the other hand, an inpainter trained in a supervised manner can still accurately reconstruct sparse inputs by predicting missing clean pixels using the remaining unmasked pixels. Our approach begins with a synthetic Gaussian noise sampler and iteratively refines it using a noise dataset derived from the denoiser's predictions. The noise dataset is created by subtracting predicted pseudo-clean images from real noisy images at each iteration. The core intuition is that improving the denoiser results in a more accurate noise dataset and, consequently, a better noise sampler. We validate our method through extensive experiments on real-world noisy image datasets, demonstrating competitive performance compared to existing unsupervised denoising methods.
- [254] arXiv:2503.21378 [pdf, html, other]
-
Title: Retrieving Time-Series Differences Using Natural Language QueriesSubjects: Computation and Language (cs.CL)
Effectively searching time-series data is essential for system analysis; however, traditional methods often require domain expertise to define search criteria. Recent advancements have enabled natural language-based search, but these methods struggle to handle differences between time-series data. To address this limitation, we propose a natural language query-based approach for retrieving pairs of time-series data based on differences specified in the query. Specifically, we define six key characteristics of differences, construct a corresponding dataset, and develop a contrastive learning-based model to align differences between time-series data with query texts. Experimental results demonstrate that our model achieves an overall mAP score of 0.994 in retrieving time-series pairs.
- [255] arXiv:2503.21380 [pdf, html, other]
-
Title: Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language ModelsHaoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, Lei Fang, Ji-Rong WenComments: Technical Report on Slow Thinking with LLMs: Evaluation BenchmarkSubjects: Computation and Language (cs.CL)
In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark at the STILL project: this https URL.
- [256] arXiv:2503.21383 [pdf, other]
-
Title: Controlling Large Language Model with Latent ActionsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Adapting Large Language Models (LLMs) to downstream tasks using Reinforcement Learning (RL) has proven to be an effective approach. However, LLMs do not inherently define the structure of an agent for RL training, particularly in terms of defining the action space. This paper studies learning a compact latent action space to enhance the controllability and exploration of RL for LLMs. We propose Controlling Large Language Models with Latent Actions (CoLA), a framework that integrates a latent action space into pre-trained LLMs. We apply CoLA to the Llama-3.1-8B model. Our experiments demonstrate that, compared to RL with token-level actions, CoLA's latent action enables greater semantic diversity in text generation. For enhancing downstream tasks, we show that CoLA with RL achieves a score of 42.4 on the math500 benchmark, surpassing the baseline score of 38.2, and reaches 68.2 when augmented with a Monte Carlo Tree Search variant. Furthermore, CoLA with RL consistently improves performance on agent-based tasks without degrading the pre-trained LLM's capabilities, unlike the baseline. Finally, CoLA reduces computation time by half in tasks involving enhanced thinking prompts for LLMs by RL. These results highlight CoLA's potential to advance RL-based adaptation of LLMs for downstream applications.
- [257] arXiv:2503.21392 [pdf, html, other]
-
Title: HybridoNet-Adapt: A Domain-Adapted Framework for Accurate Lithium-Ion Battery RUL PredictionSubjects: Artificial Intelligence (cs.AI)
Accurate prediction of the remaining useful life (RUL) in Lithium-ion battery (LIB) health management systems is crucial for ensuring reliability and safety. Current methods typically assume that training and testing data share the same distribution, overlooking the benefits of incorporating diverse data sources to enhance model performance. To address this limitation, we introduce a data-independent RUL prediction framework along with its domain adaptation (DA) approach, which leverages heterogeneous data sources for improved target predictions. Our approach integrates comprehensive data preprocessing, including feature extraction, denoising, and normalization, with a data-independent prediction model that combines Long Short-Term Memory (LSTM), Multihead Attention, and a Neural Ordinary Differential Equation (NODE) block, termed HybridoNet. The domain-adapted version, HybridoNet Adapt, is trained using a novel technique inspired by the Domain-Adversarial Neural Network (DANN) framework, a regression ensemble method, and Maximum Mean Discrepancy (MMD) to learn domain-invariant features from labeled cycling data in the source and target domains. Experimental results demonstrate that our approach outperforms state-of-the-art techniques, providing reliable RUL predictions for real-world applications.
- [258] arXiv:2503.21393 [pdf, html, other]
-
Title: An evaluation of LLMs and Google Translate for translation of selected Indian languages via sentiment and semantic analysesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language models (LLMs) have been prominent for language translation, including low-resource languages. There has been limited study about the assessment of the quality of translations generated by LLMs, including Gemini, GPT and Google Translate. In this study, we address this limitation by using semantic and sentiment analysis of selected LLMs for Indian languages, including Sanskrit, Telugu and Hindi. We select prominent texts that have been well translated by experts and use LLMs to generate their translations to English, and then we provide a comparison with selected expert (human) translations. Our findings suggest that while LLMs have made significant progress in translation accuracy, challenges remain in preserving sentiment and semantic integrity, especially in figurative and philosophical contexts. The sentiment analysis revealed that GPT-4o and GPT-3.5 are better at preserving the sentiments for the Bhagavad Gita (Sanskrit-English) translations when compared to Google Translate. We observed a similar trend for the case of Tamas (Hindi-English) and Maha P (Telugu-English) translations. GPT-4o performs similarly to GPT-3.5 in the translation in terms of sentiments for the three languages. We found that LLMs are generally better at translation for capturing sentiments when compared to Google Translate.
- [259] arXiv:2503.21394 [pdf, html, other]
-
Title: Composable Prompting Workspaces for Creative Writing: Exploration and Iteration Using Dynamic WidgetsComments: 11 pages, 9 figures, 2 tables, ACM CHI 2025 LBWSubjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
Generative AI models offer many possibilities for text creation and transformation. Current graphical user interfaces (GUIs) for prompting them lack support for iterative exploration, as they do not represent prompts as actionable interface objects. We propose the concept of a composable prompting canvas for text exploration and iteration using dynamic widgets. Users generate widgets through system suggestions, prompting, or manually to capture task-relevant facets that affect the generated text. In a comparative study with a baseline (conversational UI), 18 participants worked on two writing tasks, creating diverse prompting environments with custom widgets and spatial layouts. They reported having more control over the generated text and preferred our system over the baseline. Our design significantly outperformed the baseline on the Creativity Support Index, and participants felt the results were worth the effort. This work highlights the need for GUIs that support user-driven customization and (re-)structuring to increase both the flexibility and efficiency of prompting.
- [260] arXiv:2503.21397 [pdf, other]
-
Title: ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth NetworksComments: CVPR2025Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Out-of-distribution (OOD) detection in deep learning has traditionally been framed as a binary task, where samples are either classified as belonging to the known classes or marked as OOD, with little attention given to the semantic relationships between OOD samples and the in-distribution (ID) classes. We propose a framework for detecting and classifying OOD samples in a given class hierarchy. Specifically, we aim to predict OOD data to their correct internal nodes of the class hierarchy, whereas the known ID classes should be predicted as their corresponding leaf nodes. Our approach leverages the class hierarchy to create a probabilistic model and we implement this model by using networks trained for ID classification at multiple hierarchy depths. We conduct experiments on three datasets with predefined class hierarchies and show the effectiveness of our method. Our code is available at this https URL.
- [261] arXiv:2503.21400 [pdf, html, other]
-
Title: Lattice Based Crypto breaks in a Superposition of SpacetimesSubjects: Computational Complexity (cs.CC); Cryptography and Security (cs.CR)
We explore the computational implications of a superposition of spacetimes, a phenomenon hypothesized in quantum gravity theories. This was initiated by Shmueli (2024) where the author introduced the complexity class $\mathbf{BQP^{OI}}$ consisting of promise problems decidable by quantum polynomial time algorithms with access to an oracle for computing order interference. In this work, it was shown that the Graph Isomorphism problem and the Gap Closest Vector Problem (with approximation factor $\mathcal{O}(n^{3/2})$) are in $\mathbf{BQP^{OI}}$. We extend this result by showing that the entire complexity class $\mathbf{SZK}$ (Statistical Zero Knowledge) is contained within $\mathbf{BQP^{OI}}$. This immediately implies that the security of numerous lattice based cryptography schemes will be compromised in a computational model based on superposition of spacetimes, since these often rely on the hardness of the Learning with Errors problem, which is in $\mathbf{SZK}$.
- [262] arXiv:2503.21401 [pdf, html, other]
-
Title: AcL: Action Learner for Fault-Tolerant Quadruped Locomotion ControlTianyu Xu (1), Yaoyu Cheng (2), Pinxi Shen (2), Lin Zhao (1) (1)Electrical, Computer Engineering, National University of Singapore, Singapore, (2)Mechanical Engineering, National University of Singapore, SingaporeSubjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Quadrupedal robots can learn versatile locomotion skills but remain vulnerable when one or more joints lose power. In contrast, dogs and cats can adopt limping gaits when injured, demonstrating their remarkable ability to adapt to physical conditions. Inspired by such adaptability, this paper presents Action Learner (AcL), a novel teacher-student reinforcement learning framework that enables quadrupeds to autonomously adapt their gait for stable walking under multiple joint faults. Unlike conventional teacher-student approaches that enforce strict imitation, AcL leverages teacher policies to generate style rewards, guiding the student policy without requiring precise replication. We train multiple teacher policies, each corresponding to a different fault condition, and subsequently distill them into a single student policy with an encoder-decoder architecture. While prior works primarily address single-joint faults, AcL enables quadrupeds to walk with up to four faulty joints across one or two legs, autonomously switching between different limping gaits when faults occur. We validate AcL on a real Go2 quadruped robot under single- and double-joint faults, demonstrating fault-tolerant, stable walking, smooth gait transitions between normal and lamb gaits, and robustness against external disturbances.
- [263] arXiv:2503.21406 [pdf, html, other]
-
Title: Neuro-Symbolic Imitation Learning: Discovering Symbolic Abstractions for Skill LearningComments: IEEE International Conference on Robotics and Automation (ICRA) 2025Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Imitation learning is a popular method for teaching robots new behaviors. However, most existing methods focus on teaching short, isolated skills rather than long, multi-step tasks. To bridge this gap, imitation learning algorithms must not only learn individual skills but also an abstract understanding of how to sequence these skills to perform extended tasks effectively. This paper addresses this challenge by proposing a neuro-symbolic imitation learning framework. Using task demonstrations, the system first learns a symbolic representation that abstracts the low-level state-action space. The learned representation decomposes a task into easier subtasks and allows the system to leverage symbolic planning to generate abstract plans. Subsequently, the system utilizes this task decomposition to learn a set of neural skills capable of refining abstract plans into actionable robot commands. Experimental results in three simulated robotic environments demonstrate that, compared to baselines, our neuro-symbolic approach increases data efficiency, improves generalization capabilities, and facilitates interpretability.
- [264] arXiv:2503.21407 [pdf, html, other]
-
Title: Age of Information in Short Packet Multi-Connectivity LinksSubjects: Information Theory (cs.IT)
In this paper, we investigate multi-connectivity (MC) schemes in the context of status update systems with short payloads. As a performance metric, we use the age of information (AoI). Due to short payloads, transmission errors must be taken into account. In addition to the well-known schemes of packet duplication, message splitting, and multiplexing, we propose a codeword splitting scheme, where each status update is jointly encoded across multiple channels. We derive closed-form expressions of the average AoI for the different schemes and optimize their corresponding parameters, such as blocklengths, message splits, and the cyclic schedule for the multiplexing scheme. Analytical comparisons and numerical evaluations show that the codeword splitting scheme achieves the lowest average AoI when joint encoding and decoding are possible. In scenarios where joint encoding is not feasible, whether message splitting or multiplexing results in a lower average AoI depends on the specific parameters.
- [265] arXiv:2503.21408 [pdf, html, other]
-
Title: VALLR: Visual ASR Language Model for Lip ReadingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.
- [266] arXiv:2503.21409 [pdf, html, other]
-
Title: Efficient Algorithms for Minimizing the Kirchhoff Index via Adding EdgesComments: Accepted by IEEE Transactions on Knowledge and Data EngineeringSubjects: Discrete Mathematics (cs.DM)
The Kirchhoff index, which is the sum of the resistance distance between every pair of nodes in a network, is a key metric for gauging network performance, where lower values signify enhanced performance. In this paper, we study the problem of minimizing the Kirchhoff index by adding edges. We first provide a greedy algorithm for solving this problem and give an analysis of its quality based on the bounds of the submodularity ratio and the curvature. Then, we introduce a gradient-based greedy algorithm as a new paradigm to solve this problem. To accelerate the computation cost, we leverage geometric properties, convex hull approximation, and approximation of the projected coordinate of each point. To further improve this algorithm, we use pre-pruning and fast update techniques, making it particularly suitable for large networks. Our proposed algorithms have nearly-linear time complexity. We provide extensive experiments on ten real networks to evaluate the quality of our algorithms. The results demonstrate that our proposed algorithms outperform the state-of-the-art methods in terms of efficiency and effectiveness. Moreover, our algorithms are scalable to large graphs with over 5 million nodes and 12 million edges.
- [267] arXiv:2503.21410 [pdf, html, other]
-
Title: Diffusion Image PriorSubjects: Computer Vision and Pattern Recognition (cs.CV)
Zero-shot image restoration (IR) methods based on pretrained diffusion models have recently achieved significant success. These methods typically require at least a parametric form of the degradation model. However, in real-world scenarios, the degradation may be too complex to define explicitly. To handle this general case, we introduce the Diffusion Image Prior (DIIP). We take inspiration from the Deep Image Prior (DIP)[16], since it can be used to remove artifacts without the need for an explicit degradation model. However, in contrast to DIP, we find that pretrained diffusion models offer a much stronger prior, despite being trained without knowledge from corrupted data. We show that, the optimization process in DIIP first reconstructs a clean version of the image before eventually overfitting to the degraded input, but it does so for a broader range of degradations than DIP. In light of this result, we propose a blind image restoration (IR) method based on early stopping, which does not require prior knowledge of the degradation model. We validate DIIP on various degradation-blind IR tasks, including JPEG artifact removal, waterdrop removal, denoising and super-resolution with state-of-the-art results.
- [268] arXiv:2503.21411 [pdf, html, other]
-
Title: Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and RoadmapSubjects: Artificial Intelligence (cs.AI)
Modern transportation systems face pressing challenges due to increasing demand, dynamic environments, and heterogeneous information integration. The rapid evolution of Large Language Models (LLMs) offers transformative potential to address these challenges. Extensive knowledge and high-level capabilities derived from pretraining evolve the default role of LLMs as text generators to become versatile, knowledge-driven task solvers for intelligent transportation systems. This survey first presents LLM4TR, a novel conceptual framework that systematically categorizes the roles of LLMs in transportation into four synergetic dimensions: information processors, knowledge encoders, component generators, and decision facilitators. Through a unified taxonomy, we systematically elucidate how LLMs bridge fragmented data pipelines, enhance predictive analytics, simulate human-like reasoning, and enable closed-loop interactions across sensing, learning, modeling, and managing tasks in transportation systems. For each role, our review spans diverse applications, from traffic prediction and autonomous driving to safety analytics and urban mobility optimization, highlighting how emergent capabilities of LLMs such as in-context learning and step-by-step reasoning can enhance the operation and management of transportation systems. We further curate practical guidance, including available resources and computational guidelines, to support real-world deployment. By identifying challenges in existing LLM-based solutions, this survey charts a roadmap for advancing LLM-driven transportation research, positioning LLMs as central actors in the next generation of cyber-physical-social mobility ecosystems. Online resources can be found in the project page: this https URL.
- [269] arXiv:2503.21412 [pdf, html, other]
-
Title: Federated Intelligence: When Large AI Models Meet Federated Fine-Tuning and Collaborative Reasoning at the Network EdgeComments: 8 pages, 6 figuresJournal-ref: IEEE Internet of Things Magazine, 2025Subjects: Artificial Intelligence (cs.AI)
Large artificial intelligence (AI) models exhibit remarkable capabilities in various application scenarios, but deploying them at the network edge poses significant challenges due to issues such as data privacy, computational resources, and latency. In this paper, we explore federated fine-tuning and collaborative reasoning techniques to facilitate the implementation of large AI models in resource-constrained wireless networks. Firstly, promising applications of large AI models within specific domains are discussed. Subsequently, federated fine-tuning methods are proposed to adapt large AI models to specific tasks or environments at the network edge, effectively addressing the challenges associated with communication overhead and enhancing communication efficiency. These methodologies follow clustered, hierarchical, and asynchronous paradigms to effectively tackle privacy issues and eliminate data silos. Furthermore, to enhance operational efficiency and reduce latency, efficient frameworks for model collaborative reasoning are developed, which include decentralized horizontal collaboration, cloud-edge-end vertical collaboration, and multi-access collaboration. Next, simulation results demonstrate the effectiveness of our proposed methods in reducing the fine-tuning loss of large AI models across various downstream tasks. Finally, several open challenges and research opportunities are outlined.
- [270] arXiv:2503.21415 [pdf, other]
-
Title: Workshop Scientific HPC in the pre-Exascale era (part of ITADATA 2024) ProceedingsNicola Bena, Claudia Diamantini, Michela Natilli, Luigi Romano, Giovanni Stilo, Valentina Pansanella, Claudio A. Ardagna, Anna Monreale, Roberto Trasarti, Valentina Cesare, Gianluca Mittone, Emanuele De Rubeis, Alberto VecchiatoSubjects: Databases (cs.DB); Machine Learning (cs.LG)
The proceedings of Workshop Scientific HPC in the pre-Exascale era (SHPC), held in Pisa, Italy, September 18, 2024, are part of 3rd Italian Conference on Big Data and Data Science (ITADATA2024) proceedings (arXiv: 2503.14937).
The main objective of SHPC workshop was to discuss how the current most critical questions in HPC emerge in astrophysics, cosmology, and other scientific contexts and experiments. In particular, SHPC workshop focused on:
$\bullet$ Scientific (mainly in astrophysical and medical fields) applications toward (pre-)Exascale computing
$\bullet$ Performance portability
$\bullet$ Green computing
$\bullet$ Machine learning
$\bullet$ Big Data management
$\bullet$ Programming on heterogeneous architectures
$\bullet$ Programming on accelerators
$\bullet$ I/O techniques - [271] arXiv:2503.21419 [pdf, html, other]
-
Title: Neuroplasticity in Artificial Intelligence -- An Overview and Inspirations on Drop In \& Out LearningSubjects: Artificial Intelligence (cs.AI)
Artificial Intelligence (AI) has achieved new levels of performance and spread in public usage with the rise of deep neural networks (DNNs). Initially inspired by human neurons and their connections, NNs have become the foundation of AI models for many advanced architectures. However, some of the most integral processes in the human brain, particularly neurogenesis and neuroplasticity in addition to the more spread neuroapoptosis have largely been ignored in DNN architecture design. Instead, contemporary AI development predominantly focuses on constructing advanced frameworks, such as large language models, which retain a static structure of neural connections during training and inference. In this light, we explore how neurogenesis, neuroapoptosis, and neuroplasticity can inspire future AI advances. Specifically, we examine analogous activities in artificial NNs, introducing the concepts of ``dropin'' for neurogenesis and revisiting ``dropout'' and structural pruning for neuroapoptosis. We additionally suggest neuroplasticity combining the two for future large NNs in ``life-long learning'' settings following the biological inspiration. We conclude by advocating for greater research efforts in this interdisciplinary domain and identifying promising directions for future exploration.
- [272] arXiv:2503.21423 [pdf, other]
-
Title: Resilience and Volatility in Academic Publishing, The Case of the University of Maribor 2004-2023Subjects: Digital Libraries (cs.DL); Social and Information Networks (cs.SI)
This article investigates the dynamics of academic publishing resilience and volatility at Slovenia's University of Maribor (UM) from 2004 to 2023. This period was marked by significant economic pressures and policy shifts, including changes to higher education legislation and university funding. Using UM's employment data and OpenAlex publication records, the study examines the relationship between employed researcher numbers and unique authors publishing under the UM affiliation. Despite a substantial decrease in researcher employment during the 2009-2013 economic recession and austerity phase, the number of unique authors publishing with UM affiliation surprisingly increased. This growth was driven by factors such as a shift towards project-based funding, contributions from an expanding doctoral student cohort, and increased international collaborations. Analysis of author turnover reveals a notable contrast: high short-term volatility (annual churn rates of ~40-50%) versus significant mid-term stability (5-year churn rates of ~8-10%). Survival analysis confirms this trend, showing high initial attrition among publishing authors but long-term persistence for a core group. Furthermore, co-authorship network analysis indicates the UM research network has become more resilient over time. A critical finding is a fundamental shift in network structure around 2016, transitioning from dissassortative to assortative mixing, signaling profound changes in collaboration dynamics. The findings carry implications for research policy and university management, highlighting the necessity of balancing short-term performance indicators with the long-term stability and resilience essential for a thriving research community.
- [273] arXiv:2503.21424 [pdf, html, other]
-
Title: Scaling Automated Database System TestingSubjects: Software Engineering (cs.SE); Databases (cs.DB)
Recently, various automated testing approaches have been proposed that use specialized test oracles to find hundreds of logic bugs in mature, widely-used Database Management Systems (DBMSs). These test oracles require database and query generators, which must account for the often significant differences between the SQL dialects of these systems. Since it can take weeks to implement such generators, many DBMS developers are unlikely to invest the time to adopt such automated testing approaches. In short, existing approaches fail to scale to the plethora of DBMSs. In this work, we present both a vision and a platform, SQLancer++, to apply test oracles to any SQL-based DBMS that supports a subset of common SQL features. Our technical core contribution is a novel architecture for an adaptive SQL statement generator. This adaptive SQL generator generates SQL statements with various features, some of which might not be supported by the given DBMS, and then learns through interaction with the DBMS, which of these are understood by the DBMS. Thus, over time, the generator will generate mostly valid SQL statements. We evaluated SQLancer++ across 17 DBMSs and discovered a total of 195 unique, previously unknown bugs, of which 180 were fixed after we reported them. While SQLancer++ is the first major step towards scaling automated DBMS testing, various follow-up challenges remain.
- [274] arXiv:2503.21425 [pdf, html, other]
-
Title: STAMICS: Splat, Track And Map with Integrated Consistency and Semantics for Dense RGB-D SLAMSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Simultaneous Localization and Mapping (SLAM) is a critical task in robotics, enabling systems to autonomously navigate and understand complex environments. Current SLAM approaches predominantly rely on geometric cues for mapping and localization, but they often fail to ensure semantic consistency, particularly in dynamic or densely populated scenes. To address this limitation, we introduce STAMICS, a novel method that integrates semantic information with 3D Gaussian representations to enhance both localization and mapping accuracy. STAMICS consists of three key components: a 3D Gaussian-based scene representation for high-fidelity reconstruction, a graph-based clustering technique that enforces temporal semantic consistency, and an open-vocabulary system that allows for the classification of unseen objects. Extensive experiments show that STAMICS significantly improves camera pose estimation and map quality, outperforming state-of-the-art methods while reducing reconstruction errors. Code will be public available.
- [275] arXiv:2503.21426 [pdf, html, other]
-
Title: AdvSGM: Differentially Private Graph Learning via Adversarial Skip-gram ModelComments: Accepted by ICDE 2025Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
The skip-gram model (SGM), which employs a neural network to generate node vectors, serves as the basis for numerous popular graph embedding techniques. However, since the training datasets contain sensitive linkage information, the parameters of a released SGM may encode private information and pose significant privacy risks. Differential privacy (DP) is a rigorous standard for protecting individual privacy in data analysis. Nevertheless, when applying differential privacy to skip-gram in graphs, it becomes highly challenging due to the complex link relationships, which potentially result in high sensitivity and necessitate substantial noise injection. To tackle this challenge, we present AdvSGM, a differentially private skip-gram for graphs via adversarial training. Our core idea is to leverage adversarial training to privatize skip-gram while improving its utility. Towards this end, we develop a novel adversarial training module by devising two optimizable noise terms that correspond to the parameters of a skip-gram. By fine-tuning the weights between modules within AdvSGM, we can achieve differentially private gradient updates without additional noise injection. Extensive experimental results on six real-world graph datasets show that AdvSGM preserves high data utility across different downstream tasks.
- [276] arXiv:2503.21431 [pdf, html, other]
-
Title: Nearest Neighbour Equilibrium ClusteringComments: Currently being considered for publication by IEEESubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
A novel and intuitive nearest neighbours based clustering algorithm is introduced, in which a cluster is defined in terms of an equilibrium condition which balances its size and cohesiveness. The formulation of the equilibrium condition allows for a quantification of the strength of alignment of each point to a cluster, with these cluster alignment strengths leading naturally to a model selection criterion which renders the proposed approach fully automatable. The algorithm is simple to implement and computationally efficient, and produces clustering solutions of extremely high quality in comparison with relevant benchmarks from the literature. R code to implement the approach is available from this https URL.
- [277] arXiv:2503.21433 [pdf, other]
-
Title: On Learning-Based Traffic Monitoring With a Swarm of DronesComments: Extended version of the paper accepted for presentation at the 23rd IEEE European Control Conference (ECC 2025), Thessaloniki, GreeceSubjects: Systems and Control (eess.SY)
Efficient traffic monitoring is crucial for managing urban transportation networks, especially under congested and dynamically changing traffic conditions. Drones offer a scalable and cost-effective alternative to fixed sensor networks. However, deploying fleets of low-cost drones for traffic monitoring poses challenges in adaptability, scalability, and real-time operation. To address these issues, we propose a learning-based framework for decentralized traffic monitoring with drone swarms, targeting the uneven and unpredictable distribution of monitoring needs across urban areas. Our approach introduces a semi-decentralized reinforcement learning model, which trains a single Q-function using the collective experience of the swarm. This model supports full scalability, flexible deployment, and, when hardware allows, the online adaptation of each drone's action-selection mechanism. We first train and evaluate the model in a synthetic traffic environment, followed by a case study using real traffic data from Shenzhen, China, to validate its performance and demonstrate its potential for real-world applications in complex urban monitoring tasks.
- [278] arXiv:2503.21435 [pdf, html, other]
-
Title: Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language ModelsSubjects: Artificial Intelligence (cs.AI)
Graph Neural Networks (GNNs), as the dominant paradigm for graph-structured learning, have long faced dual challenges of exponentially escalating computational complexity and inadequate cross-scenario generalization capability. With the rapid advancement of multimodal learning, Vision-Language Models (VLMs) have demonstrated exceptional cross-modal relational reasoning capabilities and generalization capacities, thereby opening up novel pathways for overcoming the inherent limitations of conventional graph learning paradigms. However, current research predominantly concentrates on investigating the single-graph reasoning capabilities of VLMs, which fundamentally fails to address the critical requirement for coordinated reasoning across multiple heterogeneous graph data in real-world application scenarios. To address these limitations, we propose the first multi-graph joint reasoning benchmark for VLMs. Our benchmark encompasses four graph categories: knowledge graphs, flowcharts, mind maps, and route maps,with each graph group accompanied by three progressively challenging instruction-response pairs. Leveraging this benchmark, we conducted comprehensive capability assessments of state-of-the-art VLMs and performed fine-tuning on open-source models. This study not only addresses the underexplored evaluation gap in multi-graph reasoning for VLMs but also empirically validates their generalization superiority in graph-structured learning.
- [279] arXiv:2503.21436 [pdf, html, other]
-
Title: Stochastic Engrams for Efficient Continual Learning with Binarized Neural NetworksSubjects: Machine Learning (cs.LG)
The ability to learn continuously in artificial neural networks (ANNs) is often limited by catastrophic forgetting, a phenomenon in which new knowledge becomes dominant. By taking mechanisms of memory encoding in neuroscience (aka. engrams) as inspiration, we propose a novel approach that integrates stochastically-activated engrams as a gating mechanism for metaplastic binarized neural networks (mBNNs). This method leverages the computational efficiency of mBNNs combined with the robustness of probabilistic memory traces to mitigate forgetting and maintain the model's reliability. Previously validated metaplastic optimization techniques have been incorporated to enhance synaptic stability further. Compared to baseline binarized models and benchmark fully connected continual learning approaches, our method is the only strategy capable of reaching average accuracies over 20% in class-incremental scenarios and achieving comparable domain-incremental results to full precision state-of-the-art methods. Furthermore, we achieve a significant reduction in peak GPU and RAM usage, under 5% and 20%, respectively. Our findings demonstrate (A) an improved stability vs. plasticity trade-off, (B) a reduced memory intensiveness, and (C) an enhanced performance in binarized architectures. By uniting principles of neuroscience and efficient computing, we offer new insights into the design of scalable and robust deep learning systems.
- [280] arXiv:2503.21438 [pdf, html, other]
-
Title: Dual-Task Learning for Dead Tree Detection and Segmentation with Hybrid Self-Attention U-Nets in Aerial ImageryComments: 11 pages, 4 figures, 4 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Mapping standing dead trees is critical for assessing forest health, monitoring biodiversity, and mitigating wildfire risks, for which aerial imagery has proven useful. However, dense canopy structures, spectral overlaps between living and dead vegetation, and over-segmentation errors limit the reliability of existing methods. This study introduces a hybrid postprocessing framework that refines deep learning-based tree segmentation by integrating watershed algorithms with adaptive filtering, enhancing boundary delineation, and reducing false positives in complex forest environments. Tested on high-resolution aerial imagery from boreal forests, the framework improved instance-level segmentation accuracy by 41.5% and reduced positional errors by 57%, demonstrating robust performance in densely vegetated regions. By balancing detection accuracy and over-segmentation artifacts, the method enabled the precise identification of individual dead trees, which is critical for ecological monitoring. The framework's computational efficiency supports scalable applications, such as wall-to-wall tree mortality mapping over large geographic regions using aerial or satellite imagery. These capabilities directly benefit wildfire risk assessment (identifying fuel accumulations), carbon stock estimation (tracking emissions from decaying biomass), and precision forestry (targeting salvage loggings). By bridging advanced remote sensing techniques with practical forest management needs, this work advances tools for large-scale ecological conservation and climate resilience planning.
- [281] arXiv:2503.21439 [pdf, other]
-
Title: Improved Runtime Analysis of a Multi-Valued Compact Genetic Algorithm on Two Generalized OneMax ProblemsComments: To appear at GECCO 2025Subjects: Neural and Evolutionary Computing (cs.NE)
Recent research in the runtime analysis of estimation of distribution algorithms (EDAs) has focused on univariate EDAs for multi-valued decision variables. In particular, the runtime of the multi-valued cGA (r-cGA) and UMDA on multi-valued functions has been a significant area of study. Adak and Witt (PPSN 2024) and Hamano et al. (ECJ 2024) independently performed a first runtime analysis of the r-cGA on the r-valued OneMax function (r-OneMax). Adak and Witt also introduced a different r-valued OneMax function called G-OneMax. However, for that function, only empirical results were provided so far due to the increased complexity of its runtime analysis, since r-OneMax involves categorical values of two types only, while G-OneMax encompasses all possible values.
In this paper, we present the first theoretical runtime analysis of the r-cGA on the G-OneMax function. We demonstrate that the runtime is O(nr^3 log^2 n log r) with high probability. Additionally, we refine the previously established runtime analysis of the r-cGA on r-OneMax, improving the previous bound to O(nr log n log r), which improves the state of the art by an asymptotic factor of log n and is tight for the binary case. Moreover, we for the first time include the case of frequency borders. - [282] arXiv:2503.21440 [pdf, html, other]
-
Title: On the Maiorana-McFarland Class ExtensionsSubjects: Cryptography and Security (cs.CR); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
The closure $\mathcal{M}_{m}^{\#}$ and the extension $\widehat{\mathcal{M}}_{m}$ of the Maiorana--McFarland class $\mathcal{M}_{m}$ in $m = 2n$ variables relative to the extended-affine equivalence and the bent function construction $f \oplus \mathrm{Ind}_{U}$ are considered, where $U$ is an affine subspace of $\mathbb{F}_{2}^{m}$ of dimension $m/2$. We obtain an explicit formula for $|\widehat{\mathcal{M}}_{m}|$ and an upper bound for $|\widehat{\mathcal{M}}_{m}^{\#}|$. Asymptotically tight bounds for $|\mathcal{M}_{m}^{\#}|$ are proved as well, for instance, $|\mathcal{M}_{8}^{\#}| \approx 2^{77.865}$. Metric properties of $\mathcal{M}_{m}$ and $\mathcal{M}_{m}^{\#}$ are also investigated. We find the number of all closest bent functions to the set $\mathcal{M}_{m}$ and provide an upper bound of the same number for $\mathcal{M}_{m}^{\#}$. The average number $E(\mathcal{M}_{m})$ of $m/2$-dimensional affine subspaces of $\mathbb{F}_{2}^{m}$ such that a function from $\mathcal{M}_{m}$ is affine on each of them is calculated. We obtain that similarly defined $E(\mathcal{M}_{m}^{\#})$ satisfies $E(\mathcal{M}_{m}^{\#}) < E(\mathcal{M}_{m})$ and $E(\mathcal{M}_{m}^{\#}) = E(\mathcal{M}_{m}) - o(1)$.
- [283] arXiv:2503.21441 [pdf, html, other]
-
Title: A Tolerant Independent Set TesterComments: To appear in STOC 2025Subjects: Data Structures and Algorithms (cs.DS)
We give nearly optimal bounds on the sample complexity of $(\widetilde{\Omega}(\epsilon),\epsilon)$-tolerant testing the $\rho$-independent set property in the dense graph setting. In particular, we give an algorithm that inspects a random subgraph on $\widetilde{O}(\rho^3/\epsilon^2)$ vertices and, for some constant $c,$ distinguishes between graphs that have an induced subgraph of size $\rho n$ with fewer than $\frac{\epsilon}{c \log^4(1/\epsilon)} n^2$ edges from graphs for which every induced subgraph of size $\rho n$ has at least $\epsilon n^2$ edges. Our sample complexity bound matches, up to logarithmic factors, the recent upper bound by Blais and Seth (2023) for the non-tolerant testing problem, which is known to be optimal for the non-tolerant testing problem based on a lower bound by Feige, Langberg and Schechtman (2004).
Our main technique is a new graph container lemma for sparse subgraphs instead of independent sets. We also show that our new lemma can be used to generalize one of the classic applications of the container method, that of counting independent sets in regular graphs, to counting sparse subgraphs in regular graphs. - [284] arXiv:2503.21442 [pdf, html, other]
-
Title: RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian SplattingSubjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
We consider the problem of adding dynamic rain effects to in-the-wild scenes in a physically-correct manner. Recent advances in scene modeling have made significant progress, with NeRF and 3DGS techniques emerging as powerful tools for reconstructing complex scenes. However, while effective for novel view synthesis, these methods typically struggle with challenging scene editing tasks, such as physics-based rain simulation. In contrast, traditional physics-based simulations can generate realistic rain effects, such as raindrops and splashes, but they often rely on skilled artists to carefully set up high-fidelity scenes. This process lacks flexibility and scalability, limiting its applicability to broader, open-world environments. In this work, we introduce RainyGS, a novel approach that leverages the strengths of both physics-based modeling and 3DGS to generate photorealistic, dynamic rain effects in open-world scenes with physical accuracy. At the core of our method is the integration of physically-based raindrop and shallow water simulation techniques within the fast 3DGS rendering framework, enabling realistic and efficient simulations of raindrop behavior, splashes, and reflections. Our method supports synthesizing rain effects at over 30 fps, offering users flexible control over rain intensity -- from light drizzles to heavy downpours. We demonstrate that RainyGS performs effectively for both real-world outdoor scenes and large-scale driving scenarios, delivering more photorealistic and physically-accurate rain effects compared to state-of-the-art methods. Project page can be found at this https URL
- [285] arXiv:2503.21444 [pdf, html, other]
-
Title: Automated Analysis of Pricings in SaaS-based Information SystemsComments: 16 pages, accepted in CAISE'25Subjects: Software Engineering (cs.SE)
Software as a Service (SaaS) pricing models, encompassing features, usage limits, plans, and add-ons, have grown exponentially in complexity, evolving from offering tens to thousands of configuration options. This rapid expansion poses significant challenges for the development and operation of SaaS-based Information Systems (IS), as manual management of such configurations becomes time-consuming, error-prone, and ultimately unsustainable. The emerging paradigm of Pricing-driven DevOps aims to address these issues by automating pricing management tasks, such as transforming human-oriented pricings into machine-oriented (iPricing) or finding the optimal subscription that matches the requirements of a certain user, ultimately reducing human intervention. This paper advances the field by proposing seven analysis operations that partially or fully support these pricing management tasks, thus serving as a foundation for defining new, more specialized operations. To achieve this, we mapped iPricings into Constraint Satisfaction Optimization Problems (CSOP), an approach successfully used in similar domains, enabling us to implement and apply these operations to uncover latent, yet non-trivial insights from complex pricing models. The proposed approach has been implemented in a reference framework using MiniZinc, and tested with over 150 pricing models, identifying errors in 35 pricings of the benchmark. Results demonstrate its effectiveness in identifying errors and its potential to streamline Pricing-driven DevOps.
- [286] arXiv:2503.21448 [pdf, html, other]
-
Title: HORIZON: a Classification and Comparison Framework for Pricing-driven Feature TogglingComments: 15 pages, submitted to ICWE'25Subjects: Software Engineering (cs.SE)
Software as a Service (SaaS) has seen rapid growth in recent years, thanks to its ability to adapt to diverse user needs through subscription-based models. However, as pricing models enhance the customization of subscriptions, managing the associated constraints within a system's codebase becomes increasingly challenging. In response, Pricing-driven Development and Operation has emerged to integrate pricing considerations across the software lifecycle. Among its most challenging objectives is regulating feature access according to users' subscriptions -- a process that requires managing a multitude of conditions throughout the system's codebase. Feature toggles have traditionally been employed to manage dynamic system behavior, but their application to pricing-driven constraints presents unique challenges. When used to enforce subscription-based restrictions, toggles must adapt -- among other factors -- to individual user's use of features, ensuring that subscription limits are not exceeded. Despite the increasing significance of this problem, current industrial solutions lack explicit support for pricing-driven feature toggling, and existing academic contributions remain constrained to specific architectures. This paper contributes to fill this gap by introducing HORIZON, a classification and comparison framework for feature toggling tools tailored to pricing-driven environments. Its utility is demonstrated by categorizing the solutions identified in the literature as promising for such environments, revealing both their strengths and limitations, and thereby pinpointing critical avenues for improvement. In doing so, HORIZON not only provides a comprehensive view of the current landscape but also lays the groundwork for a focused research agenda, guiding the development of more robust and adaptable solutions for streamlining SaaS development and operations driven by pricings.
- [287] arXiv:2503.21449 [pdf, html, other]
-
Title: Towards Generating Realistic 3D Semantic Training Data for Autonomous DrivingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still however a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at this https URL.
- [288] arXiv:2503.21450 [pdf, html, other]
-
Title: CMADiff: Cross-Modal Aligned Diffusion for Controllable Protein GenerationChangjian Zhou, Yuexi Qiu, Tongtong Ling, Jiafeng Li, Shuanghe Liu, Xiangjing Wang, Jia Song, Wensheng XiangSubjects: Computational Engineering, Finance, and Science (cs.CE); Biomolecules (q-bio.BM)
AI-assisted protein design has emerged as a critical tool for advancing biotechnology, as deep generative models have demonstrated their reliability in this domain. However, most existing models primarily utilize protein sequence or structural data for training, neglecting the physicochemical properties of this http URL, they are deficient to control the generation of proteins in intuitive conditions. To address these limitations,we propose CMADiff here, a novel framework that enables controllable protein generation by aligning the physicochemical properties of protein sequences with text-based descriptions through a latent diffusion process. Specifically, CMADiff employs a Conditional Variational Autoencoder (CVAE) to integrate physicochemical features as conditional input, forming a robust latent space that captures biological traits. In this latent space, we apply a conditional diffusion process, which is guided by BioAligner, a contrastive learning-based module that aligns text descriptions with protein features, enabling text-driven control over protein sequence generation. Validated by a series of evaluations including AlphaFold3, the experimental results indicate that CMADiff outperforms protein sequence generation benchmarks and holds strong potential for future applications. The implementation and code are available at this https URL.
- [289] arXiv:2503.21452 [pdf, html, other]
-
Title: Numerical solution of locally loaded Volterra integral equationsComments: 7 pages, 2 figuresSubjects: Numerical Analysis (math.NA); Mathematical Physics (math-ph)
Volterra's integral equations with local and nonlocal loads represent the novel class of integral equations that have attracted considerable attention in recent years. These equations are a generalisation of the classic Volterra integral equations, which were first introduced by Vito Volterra in the late 19th century. The loaded Volterra integral equations are characterised by the presence of a load which complicates the process of their theoretical and numerical study. Sometimes these equation are called the equations with ``frozen'' argument. The present work is devoted to the study of Volterra equations with locally loaded integral operators. The existence and uniquness theorems are proved. Among the main contributions is the collocation method for approximate solution of such equations based on the piecewise linear approximation. To confirm the convergence of the method, a number of numerical results for solving model problems are provided.
- [290] arXiv:2503.21453 [pdf, other]
-
Title: OCEP: An Ontology-Based Complex Event Processing Framework for Healthcare Decision Support in Big Data AnalyticsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
The exponential expansion of real-time data streams across multiple domains needs the development of effective event detection, correlation, and decision-making systems. However, classic Complex Event Processing (CEP) systems struggle with semantic heterogeneity, data interoperability, and knowledge driven event reasoning in Big Data environments. To solve these challenges, this research work presents an Ontology based Complex Event Processing (OCEP) framework, which utilizes semantic reasoning and Big Data Analytics to improve event driven decision support. The proposed OCEP architecture utilizes ontologies to support reasoning to event streams. It ensures compatibility with different data sources and lets you find the events based on the context. The Resource Description Framework (RDF) organizes event data, and SPARQL query enables rapid event reasoning and retrieval. The approach is implemented within the Hadoop environment, which consists of Hadoop Distributed File System (HDFS) for scalable storage and Apache Kafka for real-time CEP based event execution. We perform a real-time healthcare analysis and case study to validate the model, utilizing IoT sensor data for illness monitoring and emergency responses. This OCEP framework successfully integrates several event streams, leading to improved early disease detection and aiding doctors in decision-making. The result shows that OCEP predicts event detection with an accuracy of 85%. This research work utilizes an OCEP to solve the problems with semantic interoperability and correlation of complex events in Big Data analytics. The proposed architecture presents an intelligent, scalable and knowledge driven event processing framework for healthcare based decision support.
- [291] arXiv:2503.21455 [pdf, html, other]
-
Title: Code Review Comprehension: Reviewing Strategies Seen Through Code Comprehension TheoriesSubjects: Software Engineering (cs.SE)
Despite the popularity and importance of modern code review, the understanding of the cognitive processes that enable reviewers to analyze code and provide meaningful feedback is lacking. To address this gap, we observed and interviewed ten experienced reviewers while they performed 25 code reviews from their review queue. Since comprehending code changes is essential to perform code review and the primary challenge for reviewers, we focused our analysis on this cognitive process. Using Letovsky's model of code comprehension, we performed a theory-driven thematic analysis to investigate how reviewers apply code comprehension to navigate changes and provide feedback. Our findings confirm that code comprehension is fundamental to code review. We extend Letovsky's model to propose the Code Review Comprehension Model and demonstrate that code review, like code comprehension, relies on opportunistic strategies. These strategies typically begin with a context-building phase, followed by code inspection involving code reading, testing, and discussion management. To interpret and evaluate the proposed change, reviewers construct a mental model of the change as an extension of their understanding of the overall software system and contrast mental representations of expected and ideal solutions against the actual implementation. Based on our findings, we discuss how review tools and practices can better support reviewers in employing their strategies and in forming understanding. Data and material: this https URL
- [292] arXiv:2503.21457 [pdf, html, other]
-
Title: FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMsXiaoqin Wang, Xusen Ma, Xianxu Hou, Meidan Ding, Yudong Li, Junliang Chen, Wenting Chen, Xiaoyang Peng, Linlin ShenComments: Accepted by CVPR2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at this https URL.
- [293] arXiv:2503.21458 [pdf, html, other]
-
Title: DATA-WA: Demand-based Adaptive Task Assignment with Dynamic Worker Availability WindowsSubjects: Machine Learning (cs.LG); Databases (cs.DB)
With the rapid advancement of mobile networks and the widespread use of mobile devices, spatial crowdsourcing, which involves assigning location-based tasks to mobile workers, has gained significant attention. However, most existing research focuses on task assignment at the current moment, overlooking the fluctuating demand and supply between tasks and workers over time. To address this issue, we introduce an adaptive task assignment problem, which aims to maximize the number of assigned tasks by dynamically adjusting task assignments in response to changing demand and supply. We develop a spatial crowdsourcing framework, namely demand-based adaptive task assignment with dynamic worker availability windows, which consists of two components including task demand prediction and task assignment. In the first component, we construct a graph adjacency matrix representing the demand dependency relationships in different regions and employ a multivariate time series learning approach to predict future task demands. In the task assignment component, we adjust tasks to workers based on these predictions, worker availability windows, and the current task assignments, where each worker has an availability window that indicates the time periods they are available for task assignments. To reduce the search space of task assignments and be efficient, we propose a worker dependency separation approach based on graph partition and a task value function with reinforcement learning. Experiments on real data demonstrate that our proposals are both effective and efficient.
- [294] arXiv:2503.21459 [pdf, other]
-
Title: RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video NarrativesComments: Accepted at CVPR 2025; Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalable semi-automatic annotation framework leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning 14M frames and 414K social comments, resulting in a dataset with 13.2K videos, 674 tags and 260K high-quality QA pairs. We evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose) on our road event understanding benchmark. We also demonstrate RoadSocial's utility in improving road event understanding capabilities of general-purpose Video LLMs.
- [295] arXiv:2503.21460 [pdf, html, other]
-
Title: Large Language Model Agent: A Survey on Methodology, Applications and ChallengesJunyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, Ming ZhangComments: 329 papers surveyed, resources are at this https URLSubjects: Computation and Language (cs.CL)
The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at this https URL.
- [296] arXiv:2503.21463 [pdf, html, other]
-
Title: Unveiling Latent Information in Transaction Hashes: Hypergraph Learning for Ethereum Ponzi Scheme DetectionSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
With the widespread adoption of Ethereum, financial frauds such as Ponzi schemes have become increasingly rampant in the blockchain ecosystem, posing significant threats to the security of account assets. Existing Ethereum fraud detection methods typically model account transactions as graphs, but this approach primarily focuses on binary transactional relationships between accounts, failing to adequately capture the complex multi-party interaction patterns inherent in Ethereum. To address this, we propose a hypergraph modeling method for the Ponzi scheme detection method in Ethereum, called HyperDet. Specifically, we treat transaction hashes as hyperedges that connect all the relevant accounts involved in a transaction. Additionally, we design a two-step hypergraph sampling strategy to significantly reduce computational complexity. Furthermore, we introduce a dual-channel detection module, including the hypergraph detection channel and the hyper-homo graph detection channel, to be compatible with existing detection methods. Experimental results show that, compared to traditional homogeneous graph-based methods, the hyper-homo graph detection channel achieves significant performance improvements, demonstrating the superiority of hypergraph in Ponzi scheme detection. This research offers innovations for modeling complex relationships in blockchain data.
- [297] arXiv:2503.21464 [pdf, html, other]
-
Title: Harnessing Chain-of-Thought Metadata for Task Routing and Adversarial Prompt DetectionSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Performance (cs.PF)
In this work, we propose a metric called Number of Thoughts (NofT) to determine the difficulty of tasks pre-prompting and support Large Language Models (LLMs) in production contexts. By setting thresholds based on the number of thoughts, this metric can discern the difficulty of prompts and support more effective prompt routing. A 2% decrease in latency is achieved when routing prompts from the MathInstruct dataset through quantized, distilled versions of Deepseek with 1.7 billion, 7 billion, and 14 billion parameters. Moreover, this metric can be used to detect adversarial prompts used in prompt injection attacks with high efficacy. The Number of Thoughts can inform a classifier that achieves 95% accuracy in adversarial prompt detection. Our experiments ad datasets used are available on our GitHub page: this https URL.
- [298] arXiv:2503.21465 [pdf, html, other]
-
Title: Retinal Fundus Multi-Disease Image Classification using Hybrid CNN-Transformer-Ensemble ArchitecturesComments: 17 pages, 3 figures, 7 tables. Conference paper presented at the International Health Informatics Conference (IHIC 2023)Journal-ref: In: Proceedings of the International Health Informatics Conference (IHIC 2023). Lecture Notes in Networks and Systems, vol. 1113, Springer, Singapore, pp. 103-120 (2025)Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Our research is motivated by the urgent global issue of a large population affected by retinal diseases, which are evenly distributed but underserved by specialized medical expertise, particularly in non-urban areas. Our primary objective is to bridge this healthcare gap by developing a comprehensive diagnostic system capable of accurately predicting retinal diseases solely from fundus images. However, we faced significant challenges due to limited, diverse datasets and imbalanced class distributions. To overcome these issues, we have devised innovative strategies. Our research introduces novel approaches, utilizing hybrid models combining deeper Convolutional Neural Networks (CNNs), Transformer encoders, and ensemble architectures sequentially and in parallel to classify retinal fundus images into 20 disease labels. Our overarching goal is to assess these advanced models' potential in practical applications, with a strong focus on enhancing retinal disease diagnosis accuracy across a broader spectrum of conditions. Importantly, our efforts have surpassed baseline model results, with the C-Tran ensemble model emerging as the leader, achieving a remarkable model score of 0.9166, surpassing the baseline score of 0.9. Additionally, experiments with the IEViT model showcased equally promising outcomes with improved computational efficiency. We've also demonstrated the effectiveness of dynamic patch extraction and the integration of domain knowledge in computer vision tasks. In summary, our research strives to contribute significantly to retinal disease diagnosis, addressing the critical need for accessible healthcare solutions in underserved regions while aiming for comprehensive and accurate disease prediction.
- [299] arXiv:2503.21468 [pdf, html, other]
-
Title: Improvement Graph Convolution Collaborative Filtering with Weighted addition inputSubjects: Information Retrieval (cs.IR)
Graph Neural Networks have been extensively applied in the field of machine learning to find features of graphs, and recommendation systems are no exception. The ratings of users on considered items can be represented by graphs which are input for many efficient models to find out the characteristics of the users and the items. From these insights, relevant items are recommended to users. However, user's decisions on the items have varying degrees of effects on different users, and this information should be learned so as not to be lost in the process of information mining.
In this publication, we propose to build an additional graph showing the recommended weight of an item to a target user to improve the accuracy of GNN models. Although the users' friendships were not recorded, their correlation was still evident through the commonalities in consumption behavior. We build a model WiGCN (Weighted input GCN) to describe and experiment on well-known datasets. Conclusions will be stated after comparing our results with state-of-the-art such as GCMC, NGCF and LightGCN. The source code is also included at this https URL. - [300] arXiv:2503.21471 [pdf, html, other]
-
Title: CombiGCN: An effective GCN model for Recommender SystemSubjects: Information Retrieval (cs.IR)
Graph Neural Networks (GNNs) have opened up a potential line of research for collaborative filtering (CF). The key power of GNNs is based on injecting collaborative signal into user and item embeddings which will contain information about user-item interactions after that. However, there are still some unsatisfactory points for a CF model that GNNs could have done better. The way in which the collaborative signal are extracted through an implicit feedback matrix that is essentially built on top of the message-passing architecture of GNNs, and it only helps to update the embedding based on the value of the items (or users) embeddings neighboring. By identifying the similarity weight of users through their interaction history, a key concept of CF, we endeavor to build a user-user weighted connection graph based on their similarity weight.
In this study, we propose a recommendation framework, CombiGCN, in which item embeddings are only linearly propagated on the user-item interaction graph, while user embeddings are propagated simultaneously on both the user-user weighted connection graph and user-item interaction graph graphs with Light Graph Convolution (LGC) and combined in a simpler method by using the weighted sum of the embeddings for each layer. We also conducted experiments comparing CombiGCN with several state-of-the-art models on three real-world datasets. - [301] arXiv:2503.21474 [pdf, html, other]
-
Title: The Procedural Content Generation Benchmark: An Open-source Testbed for Generative Challenges in GamesAhmed Khalifa, Roberto Gallotta, Matthew Barthet, Antonios Liapis, Julian Togelius, Georgios N. YannakakisComments: 12 pages, 4 figures, 2 tables, published at FDG2025Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This paper introduces the Procedural Content Generation Benchmark for evaluating generative algorithms on different game content creation tasks. The benchmark comes with 12 game-related problems with multiple variants on each problem. Problems vary from creating levels of different kinds to creating rule sets for simple arcade games. Each problem has its own content representation, control parameters, and evaluation metrics for quality, diversity, and controllability. This benchmark is intended as a first step towards a standardized way of comparing generative algorithms. We use the benchmark to score three baseline algorithms: a random generator, an evolution strategy, and a genetic algorithm. Results show that some problems are easier to solve than others, as well as the impact the chosen objective has on quality, diversity, and controllability of the generated artifacts.
- [302] arXiv:2503.21476 [pdf, html, other]
-
Title: Robust DNN Partitioning and Resource Allocation Under Uncertain Inference TimeSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG)
In edge intelligence systems, deep neural network (DNN) partitioning and data offloading can provide real-time task inference for resource-constrained mobile devices. However, the inference time of DNNs is typically uncertain and cannot be precisely determined in advance, presenting significant challenges in ensuring timely task processing within deadlines. To address the uncertain inference time, we propose a robust optimization scheme to minimize the total energy consumption of mobile devices while meeting task probabilistic deadlines. The scheme only requires the mean and variance information of the inference time, without any prediction methods or distribution functions. The problem is formulated as a mixed-integer nonlinear programming (MINLP) that involves jointly optimizing the DNN model partitioning and the allocation of local CPU/GPU frequencies and uplink bandwidth. To tackle the problem, we first decompose the original problem into two subproblems: resource allocation and DNN model partitioning. Subsequently, the two subproblems with probability constraints are equivalently transformed into deterministic optimization problems using the chance-constrained programming (CCP) method. Finally, the convex optimization technique and the penalty convex-concave procedure (PCCP) technique are employed to obtain the optimal solution of the resource allocation subproblem and a stationary point of the DNN model partitioning subproblem, respectively. The proposed algorithm leverages real-world data from popular hardware platforms and is evaluated on widely used DNN models. Extensive simulations show that our proposed algorithm effectively addresses the inference time uncertainty with probabilistic deadline guarantees while minimizing the energy consumption of mobile devices.
- [303] arXiv:2503.21477 [pdf, html, other]
-
Title: Fine-Grained Behavior and Lane Constraints Guided Trajectory Prediction MethodComments: This work has been submitted to the IEEE TIM for possible publicationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Trajectory prediction, as a critical component of autonomous driving systems, has attracted the attention of many researchers. Existing prediction algorithms focus on extracting more detailed scene features or selecting more reasonable trajectory destinations. However, in the face of dynamic and evolving future movements of the target vehicle, these algorithms cannot provide a fine-grained and continuous description of future behaviors and lane constraints, which degrades the prediction accuracy. To address this challenge, we present BLNet, a novel dualstream architecture that synergistically integrates behavioral intention recognition and lane constraint modeling through parallel attention mechanisms. The framework generates fine-grained behavior state queries (capturing spatial-temporal movement patterns) and lane queries (encoding lane topology constraints), supervised by two auxiliary losses, respectively. Subsequently, a two-stage decoder first produces trajectory proposals, then performs point-level refinement by jointly incorporating both the continuity of passed lanes and future motion features. Extensive experiments on two large datasets, nuScenes and Argoverse, show that our network exhibits significant performance gains over existing direct regression and goal-based algorithms.
- [304] arXiv:2503.21480 [pdf, html, other]
-
Title: OmniVox: Zero-Shot Emotion Recognition with Omni-LLMsComments: Submitted to COLM 2025. PreprintSubjects: Computation and Language (cs.CL)
The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.
- [305] arXiv:2503.21483 [pdf, html, other]
-
Title: BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingComments: Accepted to CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is constrained by limited context windows. Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content, diminishing their effectiveness in real-world scenarios. In this paper, we introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies. First, to enable a more realistic evaluation of VLMs in long-form video understanding, we propose a multi-source retrieval evaluation setting. Our findings reveal that uniform sampling performs poorly in noisy contexts, underscoring the importance of selecting the right frames. Second, we explore several frame selection strategies based on query-frame similarity and analyze their effectiveness at inference time. Our results show that inverse transform sampling yields the most significant performance improvement, increasing accuracy on the Video-MME benchmark from 53.8% to 56.1% and MLVU benchmark from 58.9% to 63.4%. Our code is available at this https URL.
- [306] arXiv:2503.21486 [pdf, html, other]
-
Title: Invert2Restore: Zero-Shot Degradation-Blind Image RestorationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Two of the main challenges of image restoration in real-world scenarios are the accurate characterization of an image prior and the precise modeling of the image degradation operator. Pre-trained diffusion models have been very successfully used as image priors in zero-shot image restoration methods. However, how to best handle the degradation operator is still an open problem. In real-world data, methods that rely on specific parametric assumptions about the degradation model often face limitations in their applicability. To address this, we introduce Invert2Restore, a zero-shot, training-free method that operates in both fully blind and partially blind settings -- requiring no prior knowledge of the degradation model or only partial knowledge of its parametric form without known parameters. Despite this, Invert2Restore achieves high-fidelity results and generalizes well across various types of image degradation. It leverages a pre-trained diffusion model as a deterministic mapping between normal samples and undistorted image samples. The key insight is that the input noise mapped by a diffusion model to a degraded image lies in a low-probability density region of the standard normal distribution. Thus, we can restore the degraded image by carefully guiding its input noise toward a higher-density region. We experimentally validate Invert2Restore across several image restoration tasks, demonstrating that it achieves state-of-the-art performance in scenarios where the degradation operator is either unknown or partially known.
- [307] arXiv:2503.21487 [pdf, html, other]
-
Title: On Tensor-based Polynomial Hamiltonian SystemsSubjects: Systems and Control (eess.SY)
It is known that a linear system with a system matrix A constitutes a Hamiltonian system with a quadratic Hamiltonian if and only if A is a Hamiltonian matrix. This provides a straightforward method to verify whether a linear system is Hamiltonian or whether a given Hamiltonian function corresponds to a linear system. These techniques fundamentally rely on the properties of Hamiltonian matrices. Building on recent advances in tensor algebra, this paper generalizes such results to a broad class of polynomial systems. As the systems of interest can be naturally represented in tensor forms, we name them tensor-based polynomial systems. Our main contribution is that we formally define Hamiltonian cubical tensors and characterize their properties. Crucially, we demonstrate that a tensor-based polynomial system is a Hamiltonian system with a polynomial Hamiltonian if and only if all associated system tensors are Hamiltonian cubical tensors-a direct parallel to the linear case. Additionally, we establish a computationally tractable stability criterion for tensor-based polynomial Hamiltonian systems. Finally, we validate all theoretical results through numerical examples and provide a further intuitive discussion.
- [308] arXiv:2503.21489 [pdf, html, other]
-
Title: Shape Modeling of Longitudinal Medical Images: From Diffeomorphic Metric Mapping to Deep LearningSubjects: Computer Vision and Pattern Recognition (cs.CV)
Living biological tissue is a complex system, constantly growing and changing in response to external and internal stimuli. These processes lead to remarkable and intricate changes in shape. Modeling and understanding both natural and pathological (or abnormal) changes in the shape of anatomical structures is highly relevant, with applications in diagnostic, prognostic, and therapeutic healthcare. Nevertheless, modeling the longitudinal shape change of biological tissue is a non-trivial task due to its inherent nonlinear nature. In this review, we highlight several existing methodologies and tools for modeling longitudinal shape change (i.e., spatiotemporal shape modeling). These methods range from diffeomorphic metric mapping to deep-learning based approaches (e.g., autoencoders, generative networks, recurrent neural networks, etc.). We discuss the synergistic combinations of existing technologies and potential directions for future research, underscoring key deficiencies in the current research landscape.
- [309] arXiv:2503.21491 [pdf, html, other]
-
Title: Data-Driven Contact-Aware Control Method for Real-Time Deformable Tool Manipulation: A Case Study in the Environmental SwabbingComments: Submitted for Journal ReviewSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Deformable Object Manipulation (DOM) remains a critical challenge in robotics due to the complexities of developing suitable model-based control strategies. Deformable Tool Manipulation (DTM) further complicates this task by introducing additional uncertainties between the robot and its environment. While humans effortlessly manipulate deformable tools using touch and experience, robotic systems struggle to maintain stability and precision. To address these challenges, we present a novel State-Adaptive Koopman LQR (SA-KLQR) control framework for real-time deformable tool manipulation, demonstrated through a case study in environmental swab sampling for food safety. This method leverages Koopman operator-based control to linearize nonlinear dynamics while adapting to state-dependent variations in tool deformation and contact forces. A tactile-based feedback system dynamically estimates and regulates the swab tool's angle, contact pressure, and surface coverage, ensuring compliance with food safety standards. Additionally, a sensor-embedded contact pad monitors force distribution to mitigate tool pivoting and deformation, improving stability during dynamic interactions. Experimental results validate the SA-KLQR approach, demonstrating accurate contact angle estimation, robust trajectory tracking, and reliable force regulation. The proposed framework enhances precision, adaptability, and real-time control in deformable tool manipulation, bridging the gap between data-driven learning and optimal control in robotic interaction tasks.
- [310] arXiv:2503.21495 [pdf, html, other]
-
Title: Adaptive Resampling with Bootstrap for Noisy Multi-Objective Optimization ProblemsComments: 14 pages. 5 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The challenge of noisy multi-objective optimization lies in the constant trade-off between exploring new decision points and improving the precision of known points through resampling. This decision should take into account both the variability of the objective functions and the current estimate of a point in relation to the Pareto front. Since the amount and distribution of noise are generally unknown, it is desirable for a decision function to be highly adaptive to the properties of the optimization problem. This paper presents a resampling decision function that incorporates the stochastic nature of the optimization problem by using bootstrapping and the probability of dominance. The distribution-free estimation of the probability of dominance is achieved using bootstrap estimates of the means. To make the procedure applicable even with very few observations, we transfer the distribution observed at other decision points. The efficiency of this resampling approach is demonstrated by applying it in the NSGA-II algorithm with a sequential resampling procedure under multiple noise variations.
- [311] arXiv:2503.21496 [pdf, html, other]
-
Title: Advancing CAN Network Security through RBM-Based Synthetic Attack Data Generation for Intrusion Detection SystemsComments: 11 pages, 10 figures, 7 tablesSubjects: Cryptography and Security (cs.CR)
The rapid development of network technologies and industrial intelligence has augmented the connectivity and intelligence within the automotive industry. Notably, in the Internet of Vehicles (IoV), the Controller Area Network (CAN), which is crucial for the communication of electronic control units but lacks inbuilt security measures, has become extremely vulnerable to severe cybersecurity threats. Meanwhile, the efficacy of Intrusion Detection Systems (IDS) is hampered by the scarcity of sufficient attack data for robust model training. To overcome this limitation, we introduce a novel methodology leveraging the Restricted Boltzmann Machine (RBM) to generate synthetic CAN attack data, thereby producing training datasets with a more balanced sample distribution. Specifically, we design a CAN Data Processing Module for transforming raw CAN data into an RBM-trainable format, and a Negative Sample Generation Module to generate data reflecting the distribution of CAN data frames denoting network intrusions. Experimental results show the generated data significantly improves IDS performance, with CANet accuracy rising from 0.6477 to 0.9725 and EfficientNet from 0.1067 to 0.1555. Code is available at this https URL.
- [312] arXiv:2503.21497 [pdf, html, other]
-
Title: Behavioral response to mobile phone evacuation alertsErick Elejalde, Timur Naushirvanov, Kyriaki Kalimeri, Elisa Omodei, Márton Karsai, Loreto Bravo, Leo FerresSubjects: Computers and Society (cs.CY); Social and Information Networks (cs.SI)
This study examines behavioral responses to mobile phone evacuation alerts during the February 2024 wildfires in Valparaíso, Chile. Using anonymized mobile network data from 580,000 devices, we analyze population movement following emergency SMS notifications. Results reveal three key patterns: (1) initial alerts trigger immediate evacuation responses with connectivity dropping by 80\% within 1.5 hours, while subsequent messages show diminishing effects; (2) substantial evacuation also occurs in non-warned areas, indicating potential transportation congestion; (3) socioeconomic disparities exist in evacuation timing, with high-income areas evacuating faster and showing less differentiation between warned and non-warned locations. Statistical modeling demonstrates socioeconomic variations in both evacuation decision rates and recovery patterns. These findings inform emergency communication strategies for climate-driven disasters, highlighting the need for targeted alerts, socioeconomically calibrated messaging, and staged evacuation procedures to enhance public safety during crises.
- [313] arXiv:2503.21498 [pdf, html, other]
-
Title: Distributed Forgetting-factor Regret-based Online Optimization over Undirected Connected NetworksComments: 11 pages,6 figuresSubjects: Systems and Control (eess.SY)
The evaluation of final-iteration tracking performance is a formidable obstacle in distributed online optimization algorithms. To address this issue, this paper proposes a novel evaluation metric named distributed forgetting-factor regret (DFFR). It incorporates a weight into the loss function at each iteration, which progressively reduces the weights of historical loss functions while enabling dynamic weights allocation across optimization horizon. Furthermore, we develop two distributed online optimization algorithms based on DFFR over undirected connected networks: the Distributed Online Gradient-free Algorithm for bandit-feedback problems and the Distributed Online Projection-free Algorithm for high-dimensional problems. Through theoretical analysis, we derive the upper bounds of DFFR for both algorithms and further prove that under mild conditions, DFFR either converges to zero or maintains a tight upper bound as iterations approach infinity. Experimental simulation demonstrates the effectiveness of the algorithms and the superior performance of DFFR.
- [314] arXiv:2503.21500 [pdf, html, other]
-
Title: OpenHuEval: Evaluating Large Language Model on Hungarian SpecificsHaote Yang, Xingjian Wei, Jiang Wu, Noémi Ligeti-Nagy, Jiaxing Sun, Yinfan Wang, Zijian Győző Yang, Junyuan Gao, Jingchao Wang, Bowen Jiang, Shasha Wang, Nanjun Yu, Zihao Zhang, Shixin Hong, Hongwei Liu, Wei Li, Songyang Zhang, Dahua Lin, Lijun Wu, Gábor Prószéky, Conghui HeSubjects: Computation and Language (cs.CL)
We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at this https URL .
- [315] arXiv:2503.21502 [pdf, html, other]
-
Title: ALADIN-$β$: A Distributed Optimization Algorithm for Solving MPCC ProblemsSubjects: Systems and Control (eess.SY)
Mathematical Programs with Complementarity Constraints (MPCC) are critical in various real-world applications but notoriously challenging due to non-smoothness and degeneracy from complementarity constraints. The $\ell_1$-Exact Penalty-Barrier enhanced \texttt{IPOPT} improves performance and robustness by introducing additional inequality constraints and decision variables. However, this comes at the cost of increased computational complexity due to the higher dimensionality and additional constraints introduced in the centralized formulation. To mitigate this, we propose a distributed structure-splitting reformulation that decomposes these inequality constraints and auxiliary variables into independent sub-problems. Furthermore, we introduce Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN)-$\beta$, a novel approach that integrates the $\ell_1$-Exact Penalty-Barrier method with ALADIN to efficiently solve the distributed reformulation. Numerical experiments demonstrate that even without a globalization strategy, the proposed distributed approach achieves fast convergence while maintaining high precision.
- [316] arXiv:2503.21503 [pdf, html, other]
-
Title: Distributed observer-based leak detection in pipe flow with nonlinear frictionComments: 4 pages, 3 figures, article was presented at IFAC CMWRS2022 (this https URL) in the "Extended Abstract" category and is not available anywhere elseSubjects: Systems and Control (eess.SY)
The problem of leak detection in a pipeline with nonlinear friction is considered. A distributed observer-based method is proposed which applies a linearised, distributed adaptive observer design to the nonlinear model. The methodology is tested in simulations for two different operating points.
- [317] arXiv:2503.21504 [pdf, html, other]
-
Title: Keyword-Oriented Multimodal Modeling for Euphemism IdentificationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Euphemism identification deciphers the true meaning of euphemisms, such as linking "weed" (euphemism) to "marijuana" (target keyword) in illicit texts, aiding content moderation and combating underground markets. While existing methods are primarily text-based, the rise of social media highlights the need for multimodal analysis, incorporating text, images, and audio. However, the lack of multimodal datasets for euphemisms limits further research. To address this, we regard euphemisms and their corresponding target keywords as keywords and first introduce a keyword-oriented multimodal corpus of euphemisms (KOM-Euph), involving three datasets (Drug, Weapon, and Sexuality), including text, images, and speech. We further propose a keyword-oriented multimodal euphemism identification method (KOM-EI), which uses cross-modal feature alignment and dynamic fusion modules to explicitly utilize the visual and audio features of the keywords for efficient euphemism identification. Extensive experiments demonstrate that KOM-EI outperforms state-of-the-art models and large language models, and show the importance of our multimodal datasets.
- [318] arXiv:2503.21505 [pdf, other]
-
Title: Fine-Grained Evaluation of Large Vision-Language Models in Autonomous DrivingYue Li, Meng Tian, Zhenyu Lin, Jiangtong Zhu, Dechang Zhu, Haiqiang Liu, Zining Wang, Yueyi Zhang, Zhiwei Xiong, Xinhai ZhaoSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce $\textbf{VLADBench}$, a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate $\textbf{VLADBench}$ spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.
- [319] arXiv:2503.21507 [pdf, html, other]
-
Title: F-INR: Functional Tensor Decomposition for Implicit Neural RepresentationsComments: 26 pages, 33 figures, 12 tablesSubjects: Machine Learning (cs.LG)
Implicit Neural Representation (INR) has emerged as a powerful tool for encoding discrete signals into continuous, differentiable functions using neural networks. However, these models often have an unfortunate reliance on monolithic architectures to represent high-dimensional data, leading to prohibitive computational costs as dimensionality grows. We propose F-INR, a framework that reformulates INR learning through functional tensor decomposition, breaking down high-dimensional tasks into lightweight, axis-specific sub-networks. Each sub-network learns a low-dimensional data component (e.g., spatial or temporal). Then, we combine these components via tensor operations, reducing forward pass complexity while improving accuracy through specialized learning. F-INR is modular and, therefore, architecture-agnostic, compatible with MLPs, SIREN, WIRE, or other state-of-the-art INR architecture. It is also decomposition-agnostic, supporting CP, TT, and Tucker modes with user-defined rank for speed-accuracy control. In our experiments, F-INR trains $100\times$ faster than existing approaches on video tasks while achieving higher fidelity (+3.4 dB PSNR). Similar gains hold for image compression, physics simulations, and 3D geometry reconstruction. Through this, F-INR offers a new scalable, flexible solution for high-dimensional signal modeling.
- [320] arXiv:2503.21510 [pdf, html, other]
-
Title: Uncertainty-aware Bayesian machine learning modelling of land cover classificationComments: 31 pages, 10 figuresSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Land cover classification involves the production of land cover maps, which determine the type of land through remote sensing imagery. Over recent years, such classification is being performed by machine learning classification models, which can give highly accurate predictions on land cover per pixel using large quantities of input training data. However, such models do not currently take account of input measurement uncertainty, which is vital for traceability in metrology. In this work we propose a Bayesian classification framework using generative modelling to take account of input measurement uncertainty. We take the specific case of Bayesian quadratic discriminant analysis, and apply it to land cover datasets from Copernicus Sentinel-2 in 2020 and 2021. We benchmark the performance of the model against more popular classification models used in land cover maps such as random forests and neural networks. We find that such Bayesian models are more trustworthy, in the sense that they are more interpretable, explicitly model the input measurement uncertainty, and maintain predictive performance of class probability outputs across datasets of different years and sizes, whilst also being computationally efficient.
- [321] arXiv:2503.21513 [pdf, html, other]
-
Title: Datasets for Depression Modeling in Social Media: An OverviewAna-Maria Bucur, Andreea-Codrina Moldovan, Krutika Parvatikar, Marcos Zampieri, Ashiqur R. KhudaBukhsh, Liviu P. DinuComments: Accepted to CLPsych Workshop, NAACL 2025Subjects: Computation and Language (cs.CL)
Depression is the most common mental health disorder, and its prevalence increased during the COVID-19 pandemic. As one of the most extensively researched psychological conditions, recent research has increasingly focused on leveraging social media data to enhance traditional methods of depression screening. This paper addresses the growing interest in interdisciplinary research on depression, and aims to support early-career researchers by providing a comprehensive and up-to-date list of datasets for analyzing and predicting depression through social media data. We present an overview of datasets published between 2019 and 2024. We also make the comprehensive list of datasets available online as a continuously updated resource, with the hope that it will facilitate further interdisciplinary research into the linguistic expressions of depression on social media.
- [322] arXiv:2503.21522 [pdf, html, other]
-
Title: MONO2REST: Identifying and Exposing Microservices: a Reusable RESTification ApproachSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
The microservices architectural style has become the de facto standard for large-scale cloud applications, offering numerous benefits in scalability, maintainability, and deployment flexibility. Many organizations are pursuing the migration of legacy monolithic systems to a microservices architecture. However, this process is challenging, risky, time-intensive, and prone-to-failure while several organizations lack necessary financial resources, time, or expertise to set up this migration process. So, rather than trying to migrate a legacy system where migration is risky or not feasible, we suggest exposing it as a microservice application without without having to migrate it. In this paper, we present a reusable, automated, two-phase approach that combines evolutionary algorithms with machine learning techniques. In the first phase, we identify microservices at the method level using a multi-objective genetic algorithm that considers both structural and semantic dependencies between methods. In the second phase, we generate REST APIs for each identified microservice using a classification algorithm to assign HTTP methods and endpoints. We evaluated our approach with a case study on the Spring PetClinic application, which has both monolithic and microservices implementations that serve as ground truth for comparison. Results demonstrate that our approach successfully aligns identified microservices with those in the reference microservices implementation, highlighting its effectiveness in service identification and API generation.
- [323] arXiv:2503.21525 [pdf, html, other]
-
Title: ICG-MVSNet: Learning Intra-view and Cross-view Relationships for Guidance in Multi-View StereoYuxi Hu, Jun Zhang, Zhe Zhang, Rafael Weilharter, Yuchen Rao, Kuangyi Chen, Runze Yuan, Friedrich FraundorferSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multi-view Stereo (MVS) aims to estimate depth and reconstruct 3D point clouds from a series of overlapping images. Recent learning-based MVS frameworks overlook the geometric information embedded in features and correlations, leading to weak cost matching. In this paper, we propose ICG-MVSNet, which explicitly integrates intra-view and cross-view relationships for depth estimation. Specifically, we develop an intra-view feature fusion module that leverages the feature coordinate correlations within a single image to enhance robust cost matching. Additionally, we introduce a lightweight cross-view aggregation module that efficiently utilizes the contextual information from volume correlations to guide regularization. Our method is evaluated on the DTU dataset and Tanks and Temples benchmark, consistently achieving competitive performance against state-of-the-art works, while requiring lower computational resources.
- [324] arXiv:2503.21529 [pdf, html, other]
-
Title: Physics-Informed Neural Network-Based Control for Grid-Forming Converter's Stability Under Overload ConditionsSubjects: Systems and Control (eess.SY)
Grid-forming converters (GFCs) are pivotal in maintaining frequency and voltage stability in modern distribution systems. However, a critical challenge arises when these converters encounter sudden power demands that exceed their rated capacity. Although GFCs are designed to manage DC source saturation and limit excessive AC currents, their ability to ensure sufficient power delivery under such constraints remains a significant concern. Existing studies often overlook this limitation, potentially compromising system stability during high-demand scenarios. This paper proposes a control strategy based on a physics-informed neural network (PINN) to improve GFC performance under overloaded conditions, effectively preventing switch failures and mitigating DC source saturation. The proposed approach outperforms conventional methods by maintaining stable voltage and frequency, even under significant load increases where traditional droop control alone proves inadequate. The post-disturbance operating point of GFCs remains unchanged using PINN-based control. Peak voltage deviation observed during transient reduced to 42.85\%. Furthermore, the proposed method ensures that the rate of change of frequency (ROCOF) and the rate of change of voltage (ROCOV) remain within acceptable limits, significantly improving system resilience in inertia-less power networks.
- [325] arXiv:2503.21530 [pdf, html, other]
-
Title: Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. Transliteration between Urdu and its Romanized form, Roman Urdu, remains underexplored despite the widespread use of both scripts in South Asia. Prior work using RNNs on the Roman-Urdu-Parl dataset showed promising results but suffered from poor domain adaptability and limited evaluation. We propose a transformer-based approach using the m2m100 multilingual translation model, enhanced with masked language modeling (MLM) pretraining and fine-tuning on both Roman-Urdu-Parl and the domain-diverse Dakshina dataset. To address previous evaluation flaws, we introduce rigorous dataset splits and assess performance using BLEU, character-level BLEU, and CHRF. Our model achieves strong transliteration performance, with Char-BLEU scores of 96.37 for Urdu->Roman-Urdu and 97.44 for Roman-Urdu->Urdu. These results outperform both RNN baselines and GPT-4o Mini and demonstrate the effectiveness of multilingual transfer learning for low-resource transliteration tasks.
- [326] arXiv:2503.21536 [pdf, html, other]
-
Title: Exploring the Energy Landscape of RBMs: Reciprocal Space Insights into Bosons, Hierarchical Learning and Symmetry BreakingComments: 19pp, 8figs, research articleSubjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
Deep generative models have become ubiquitous due to their ability to learn and sample from complex distributions. Despite the proliferation of various frameworks, the relationships among these models remain largely unexplored, a gap that hinders the development of a unified theory of AI learning. We address two central challenges: clarifying the connections between different deep generative models and deepening our understanding of their learning mechanisms. We focus on Restricted Boltzmann Machines (RBMs), known for their universal approximation capabilities for discrete distributions. By introducing a reciprocal space formulation, we reveal a connection between RBMs, diffusion processes, and coupled Bosons. We show that at initialization, the RBM operates at a saddle point, where the local curvature is determined by the singular values, whose distribution follows the Marcenko-Pastur law and exhibits rotational symmetry. During training, this rotational symmetry is broken due to hierarchical learning, where different degrees of freedom progressively capture features at multiple levels of abstraction. This leads to a symmetry breaking in the energy landscape, reminiscent of Landau theory. This symmetry breaking in the energy landscape is characterized by the singular values and the weight matrix eigenvector matrix. We derive the corresponding free energy in a mean-field approximation. We show that in the limit of infinite size RBM, the reciprocal variables are Gaussian distributed. Our findings indicate that in this regime, there will be some modes for which the diffusion process will not converge to the Boltzmann distribution. To illustrate our results, we trained replicas of RBMs with different hidden layer sizes using the MNIST dataset. Our findings bridge the gap between disparate generative frameworks and also shed light on the processes underpinning learning in generative models.
- [327] arXiv:2503.21540 [pdf, html, other]
-
Title: Combining Artificial Users and Psychotherapist Assessment to Evaluate Large Language Model-based Mental Health ChatbotsFlorian Onur Kuhlmeier, Leon Hanschmann, Melina Rabe, Stefan Luettke, Eva-Lotta Brakemeier, Alexander MaedcheSubjects: Human-Computer Interaction (cs.HC)
Large Language Models (LLMs) promise to overcome limitations of rule-based mental health chatbots through more natural conversations. However, evaluating LLM-based mental health chatbots presents a significant challenge: Their probabilistic nature requires comprehensive testing to ensure therapeutic quality, yet conducting such evaluations with people with depression would impose an additional burden on vulnerable people and risk exposing them to potentially harmful content. Our paper presents an evaluation approach for LLM-based mental health chatbots that combines dialogue generation with artificial users and dialogue evaluation by psychotherapists. We developed artificial users based on patient vignettes, systematically varying characteristics such as depression severity, personality traits, and attitudes toward chatbots, and let them interact with a LLM-based behavioral activation chatbot. Ten psychotherapists evaluated 48 randomly selected dialogues using standardized rating scales to assess the quality of behavioral activation and its therapeutic capabilities. We found that while artificial users showed moderate authenticity, they enabled comprehensive testing across different users. In addition, the chatbot demonstrated promising capabilities in delivering behavioral activation and maintaining safety. Furthermore, we identified deficits, such as ensuring the appropriateness of the activity plan, which reveals necessary improvements for the chatbot. Our framework provides an effective method for evaluating LLM-based mental health chatbots while protecting vulnerable people during the evaluation process. Future research should improve the authenticity of artificial users and develop LLM-augmented evaluation tools to make psychotherapist evaluation more efficient, and thus further advance the evaluation of LLM-based mental health chatbots.
- [328] arXiv:2503.21541 [pdf, html, other]
-
Title: LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image EditingSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Text-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. \method consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks. Code can be found on this https URL
- [329] arXiv:2503.21544 [pdf, html, other]
-
Title: SWI: Speaking with Intent in Large Language ModelsComments: 24 pages. Code: this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Intent, typically clearly formulated and planned, functions as a cognitive framework for reasoning and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model's underlying intention and provides high-level planning to guide subsequent analysis and communication. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on mathematical reasoning benchmarks consistently demonstrate the superiority of Speaking with Intent over Baseline (i.e., generation without explicit intent). Moreover, SWI outperforms answer-trigger prompting methods Chain-of-Thought and Plan-and-Solve and maintains competitive performance with the strong method ARR (Analyzing, Retrieving, and Reasoning). Additionally, the effectiveness and generalizability of SWI are solidified on reasoning-intensive question answering (QA) and text summarization benchmarks, where SWI brings consistent improvement to the Baseline generation. In text summarization, SWI-generated summaries exhibit greater accuracy, conciseness, and factual correctness, with fewer hallucinations. Furthermore, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. This proof-of-concept study creates a novel avenue for enhancing LLMs' reasoning abilities with cognitive notions.
- [330] arXiv:2503.21548 [pdf, html, other]
-
Title: Combining Graph Attention Networks and Distributed Optimization for Multi-Robot Mixed-Integer Convex ProgrammingComments: submitted to CDC 2025Subjects: Systems and Control (eess.SY)
In this paper, we develop a fast mixed-integer convex programming (MICP) framework for multi-robot navigation by combining graph attention networks and distributed optimization. We formulate a mixed-integer optimization problem for receding horizon motion planning of a multi-robot system, taking into account the surrounding obstacles. To address the resulting multi-agent MICP problem in real time, we propose a framework that utilizes heterogeneous graph attention networks to learn the latent mapping from problem parameters to optimal binary solutions. Furthermore, we apply a distributed proximal alternating direction method of multipliers algorithm for solving the convex continuous optimization problem. We demonstrate the effectiveness of our proposed framework through experiments conducted on a robotic testbed.
- [331] arXiv:2503.21552 [pdf, html, other]
-
Title: Real-time Tracking System with partially coupled sourcesSubjects: Systems and Control (eess.SY); Signal Processing (eess.SP)
We consider a pull-based real-time tracking system consisting of multiple partially coupled sources and a sink. The sink monitors the sources in real-time and can request one source for an update at each time instant. The sources send updates over an unreliable wireless channel. The sources are partially coupled, and updates about one source can provide partial knowledge about other sources. We study the problem of minimizing the sum of an average distortion function and a transmission cost. Since the controller is at the sink side, the controller (sink) has only partial knowledge about the source states, and thus, we model the problem as a partially observable Markov decision process (POMDP) and then cast it as a belief-MDP problem. Using the relative value iteration algorithm, we solve the problem and propose a control policy. Simulation results show the proposed policy's effectiveness and superiority compared to a baseline policy.
- [332] arXiv:2503.21555 [pdf, html, other]
-
Title: SyncSDE: A Probabilistic Framework for Diffusion SynchronizationComments: Accepted to CVPR2025Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
There have been many attempts to leverage multiple diffusion models for collaborative generation, extending beyond the original domain. A prominent approach involves synchronizing multiple diffusion trajectories by mixing the estimated scores to artificially correlate the generation processes. However, existing methods rely on naive heuristics, such as averaging, without considering task specificity. These approaches do not clarify why such methods work and often fail when a heuristic suitable for one task is blindly applied to others. In this paper, we present a probabilistic framework for analyzing why diffusion synchronization works and reveal where heuristics should be focused - modeling correlations between multiple trajectories and adapting them to each specific task. We further identify optimal correlation models per task, achieving better results than previous approaches that apply a single heuristic across all tasks without justification.
- [333] arXiv:2503.21557 [pdf, other]
-
Title: debug-gym: A Text-Based Environment for Interactive DebuggingXingdi Yuan, Morgane M Moss, Charbel El Feghali, Chinmay Singh, Darya Moldavskaya, Drew MacPhee, Lucas Caccia, Matheus Pereira, Minseon Kim, Alessandro Sordoni, Marc-Alexandre CôtéSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)
Large Language Models (LLMs) are increasingly relied upon for coding tasks, yet in most scenarios it is assumed that all relevant information can be either accessed in context or matches their training data. We posit that LLMs can benefit from the ability to interactively explore a codebase to gather the information relevant to their task. To achieve this, we present a textual environment, namely debug-gym, for developing LLM-based agents in an interactive coding setting. Our environment is lightweight and provides a preset of useful tools, such as a Python debugger (pdb), designed to facilitate an LLM-based agent's interactive debugging. Beyond coding and debugging tasks, this approach can be generalized to other tasks that would benefit from information-seeking behavior by an LLM agent.
- [334] arXiv:2503.21558 [pdf, html, other]
-
Title: A Local Perspective-based Model for Overlapping Community DetectionComments: 10 pages, 3 figures, 3 tablesSubjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
Community detection, which identifies densely connected node clusters with sparse between-group links, is vital for analyzing network structure and function in real-world systems. Most existing community detection methods based on GCNs primarily focus on node-level information while overlooking community-level features, leading to performance limitations on large-scale networks. To address this issue, we propose LQ-GCN, an overlapping community detection model from a local community perspective. LQ-GCN employs a Bernoulli-Poisson model to construct a community affiliation matrix and form an end-to-end detection framework. By adopting local modularity as the objective function, the model incorporates local community information to enhance the quality and accuracy of clustering results. Additionally, the conventional GCNs architecture is optimized to improve the model capability in identifying overlapping communities in large-scale networks. Experimental results demonstrate that LQ-GCN achieves up to a 33% improvement in Normalized Mutual Information (NMI) and a 26.3% improvement in Recall compared to baseline models across multiple real-world benchmark datasets.
- [335] arXiv:2503.21562 [pdf, html, other]
-
Title: uLayout: Unified Room Layout Estimation for Perspective and Panoramic ImagesComments: Accepted to WACV-2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
We present uLayout, a unified model for estimating room layout geometries from both perspective and panoramic images, whereas traditional solutions require different model designs for each image type. The key idea of our solution is to unify both domains into the equirectangular projection, particularly, allocating perspective images into the most suitable latitude coordinate to effectively exploit both domains seamlessly. To address the Field-of-View (FoV) difference between the input domains, we design uLayout with a shared feature extractor with an extra 1D-Convolution layer to condition each domain input differently. This conditioning allows us to efficiently formulate a column-wise feature regression problem regardless of the FoV input. This simple yet effective approach achieves competitive performance with current state-of-the-art solutions and shows for the first time a single end-to-end model for both domains. Extensive experiments in the real-world datasets, LSUN, Matterport3D, PanoContext, and Stanford 2D-3D evidence the contribution of our approach. Code is available at this https URL.
- [336] arXiv:2503.21563 [pdf, html, other]
-
Title: Consistent Multigroup Low-Rank ApproximationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We consider the problem of consistent low-rank approximation for multigroup data: we ask for a sequence of $k$ basis vectors such that projecting the data onto their spanned subspace treats all groups as equally as possible, by minimizing the maximum error among the groups. Additionally, we require that the sequence of basis vectors satisfies the natural consistency property: when looking for the best $k$ vectors, the first $d<k$ vectors are the best possible solution to the problem of finding $d$ basis vectors. Thus, this multigroup low-rank approximation method naturally generalizes \svd and reduces to \svd for data with a single group. We give an iterative algorithm for this task that sequentially adds to the basis the vector that gives the best rank$-1$ projection according to the min-max criterion, and then projects the data onto the orthogonal complement of that vector. For finding the best rank$-1$ projection, we use primal-dual approaches or semidefinite programming. We analyze the theoretical properties of the algorithms and demonstrate empirically that the proposed methods compare favorably to existing methods for multigroup (or fair) PCA.
- [337] arXiv:2503.21564 [pdf, html, other]
-
Title: Cooking Task Planning using LLM and Verified by Graph NetworkSubjects: Robotics (cs.RO)
Cooking tasks remain a challenging problem for robotics due to their complexity. Videos of people cooking are a valuable source of information for such task, but introduces a lot of variability in terms of how to translate this data to a robotic environment. This research aims to streamline this process, focusing on the task plan generation step, by using a Large Language Model (LLM)-based Task and Motion Planning (TAMP) framework to autonomously generate cooking task plans from videos with subtitles, and execute them. Conventional LLM-based task planning methods are not well-suited for interpreting the cooking video data due to uncertainty in the videos, and the risk of hallucination in its output. To address both of these problems, we explore using LLMs in combination with Functional Object-Oriented Networks (FOON), to validate the plan and provide feedback in case of failure. This combination can generate task sequences with manipulation motions that are logically correct and executable by a robot. We compare the execution of the generated plans for 5 cooking recipes from our approach against the plans generated by a few-shot LLM-only approach for a dual-arm robot setup. It could successfully execute 4 of the plans generated by our approach, whereas only 1 of the plans generated by solely using the LLM could be executed.
- [338] arXiv:2503.21566 [pdf, other]
-
Title: Bearing fault diagnosis based on multi-scale spectral images and convolutional neural networkComments: 12pages, 10 figures and 8 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV)
To address the challenges of low diagnostic accuracy in traditional bearing fault diagnosis methods, this paper proposes a novel fault diagnosis approach based on multi-scale spectrum feature images and deep learning. Firstly, the vibration signal are preprocessed through mean removal and then converted to multi-length spectrum with fast Fourier transforms (FFT). Secondly, a novel feature called multi-scale spectral image (MSSI) is constructed by multi-length spectrum paving scheme. Finally, a deep learning framework, convolutional neural network (CNN), is formulated to diagnose the bearing faults. Two experimental cases are utilized to verify the effectiveness of the proposed method. Experimental results demonstrate that the proposed method significantly improves the accuracy of fault diagnosis.
- [339] arXiv:2503.21571 [pdf, html, other]
-
Title: Magnitude-Phase Dual-Path Speech Enhancement Network based on Self-Supervised Embedding and Perceptual Contrast Stretch BoostingComments: Main paper (6 pages). Accepted for publication by ICME 2025Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Speech self-supervised learning (SSL) has made great progress in various speech processing tasks, but there is still room for improvement in speech enhancement (SE). This paper presents BSP-MPNet, a dual-path framework that combines self-supervised features with magnitude-phase information for SE. The approach starts by applying the perceptual contrast stretching (PCS) algorithm to enhance the magnitude-phase spectrum. A magnitude-phase 2D coarse (MP-2DC) encoder then extracts coarse features from the enhanced spectrum. Next, a feature-separating self-supervised learning (FS-SSL) model generates self-supervised embeddings for the magnitude and phase components separately. These embeddings are fused to create cross-domain feature representations. Finally, two parallel RNN-enhanced multi-attention (REMA) mask decoders refine the features, apply them to the mask, and reconstruct the speech signal. We evaluate BSP-MPNet on the VoiceBank+DEMAND and WHAMR! datasets. Experimental results show that BSP-MPNet outperforms existing methods under various noise conditions, providing new directions for self-supervised speech enhancement research. The implementation of the BSP-MPNet code is available online\footnote[2]{this https URL. \label{s1}}
- [340] arXiv:2503.21579 [pdf, html, other]
-
Title: Fusion of Graph Neural Networks via Optimal TransportSubjects: Machine Learning (cs.LG)
In this paper, we explore the idea of combining GCNs into one model. To that end, we align the weights of different models layer-wise using optimal transport (OT). We present and evaluate three types of transportation costs and show that the studied fusion method consistently outperforms the performance of vanilla averaging. Finally, we present results suggesting that model fusion using OT is harder in the case of GCNs than MLPs and that incorporating the graph structure into the process does not improve the performance of the method.
- [341] arXiv:2503.21581 [pdf, html, other]
-
Title: AlignDiff: Learning Physically-Grounded Camera Alignment via DiffusionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-the-wild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by ~8.2 degrees and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.
- [342] arXiv:2503.21582 [pdf, html, other]
-
Title: Time hierarchies for sublogarithmic-space quantum computationSubjects: Computational Complexity (cs.CC)
We present new results on the landscape of problems that can be solved by quantum Turing machines (QTM's) employing severely limited amounts of memory. In this context, we demonstrate two infinite time hierarchies of complexity classes within the ``small space'' regime: For all $i\geq 0$, there is a language that can be recognized by a constant-space machine in $2^{O(n^{1/2^i})}$ time, but not by any sublogarithmic-space QTM in $2^{O(n^{1/2^{i+1}})}$ time. For quantum machines operating within $o(\log \log n)$ space, there exists another hierarchy, each level of which corresponds to an expected runtime of $2^{O((\log n)^i)}$ for a different positive integer $i$. We also improve a quantum advantage result, demonstrating a language that can be recognized by a polynomial-time constant-space QTM, but not by any classical machine using $o(\log \log n)$ space, regardless of the time budget. The implications of our findings for quantum space-time tradeoffs are discussed.
- [343] arXiv:2503.21588 [pdf, html, other]
-
Title: Generalizable Implicit Neural Representations via Parameterized Latent Dynamics for Baroclinic Ocean ForecastingGuang Zhao, Xihaier Luo, Seungjun Lee, Yihui Ren, Shinjae Yoo, Luke Van Roekel, Balu Nadiga, Sri Hari Krishna Narayanan, Yixuan Sun, Wei XuSubjects: Machine Learning (cs.LG)
Mesoscale ocean dynamics play a critical role in climate systems, governing heat transport, hurricane genesis, and drought patterns. However, simulating these processes at high resolution remains computationally prohibitive due to their nonlinear, multiscale nature and vast spatiotemporal domains. Implicit neural representations (INRs) reduce the computational costs as resolution-independent surrogates but fail in many-query scenarios (inverse modeling) requiring rapid evaluations across diverse parameters. We present PINROD, a novel framework combining dynamics-aware implicit neural representations with parameterized neural ordinary differential equations to address these limitations. By integrating parametric dependencies into latent dynamics, our method efficiently captures nonlinear oceanic behavior across varying boundary conditions and physical parameters. Experiments on ocean mesoscale activity data show superior accuracy over existing baselines and improved computational efficiency compared to standard numerical simulations.
- [344] arXiv:2503.21591 [pdf, html, other]
-
Title: Dataset and Analysis of Long-Term Skill Acquisition in Robot-Assisted Minimally Invasive SurgeryComments: 12 pages, 8 figuresSubjects: Robotics (cs.RO)
Objective: We aim to investigate long-term robotic surgical skill acquisition among surgical residents and the effects of training intervals and fatigue on performance. Methods: For six months, surgical residents participated in three training sessions once a month, surrounding a single 26-hour hospital shift. In each shift, they participated in training sessions scheduled before, during, and after the shift. In each training session, they performed three dry-lab training tasks: Ring Tower Transfer, Knot-Tying, and Suturing. We collected a comprehensive dataset, including videos synchronized with kinematic data, activity tracking, and scans of the suturing pads. Results: We collected a dataset of 972 trials performed by 18 residents of different surgical specializations. Participants demonstrated consistent performance improvement across all tasks. In addition, we found variations in between-shift learning and forgetting across metrics and tasks, and hints for possible effects of fatigue. Conclusion: The findings from our first analysis shed light on the long-term learning processes of robotic surgical skills with extended intervals and varying levels of fatigue. Significance: This study lays the groundwork for future research aimed at optimizing training protocols and enhancing AI applications in surgery, ultimately contributing to improved patient outcomes. The dataset will be made available upon acceptance of our journal submission.
- [345] arXiv:2503.21592 [pdf, html, other]
-
Title: Critical Iterative Denoising: A Discrete Generative Model Applied to GraphsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Discrete Diffusion and Flow Matching models have significantly advanced generative modeling for discrete structures, including graphs. However, the time dependencies in the noising process of these models lead to error accumulation and propagation during the backward process. This issue, particularly pronounced in mask diffusion, is a known limitation in sequence modeling and, as we demonstrate, also impacts discrete diffusion models for graphs.
To address this problem, we propose a novel framework called Iterative Denoising, which simplifies discrete diffusion and circumvents the issue by assuming conditional independence across time. Additionally, we enhance our model by incorporating a Critic, which during generation selectively retains or corrupts elements in an instance based on their likelihood under the data distribution. Our empirical evaluations demonstrate that the proposed method significantly outperforms existing discrete diffusion baselines in graph generation tasks. - [346] arXiv:2503.21594 [pdf, html, other]
-
Title: AUTOBargeSim: MATLAB(R) toolbox for the design and analysis of the guidance and control system for autonomous inland vesselsAbhishek Dhyani, Amirreza Haqshenas Mojaveri, Chengqian Zhang, Dhanika Mahipala, Hoang Anh Tran, Yan-Yun Zhang, Zhongbi Luo, Vasso ReppaSubjects: Systems and Control (eess.SY)
This paper introduces AUTOBargeSim, a simulation toolbox for autonomous inland vessel guidance and control system design. AUTOBargeSim is developed using MATLAB and provides an easy-to-use introduction to various aspects of autonomous inland navigation, including mapping, modelling, control design, and collision avoidance, through examples and extensively documented code. Applying modular design principles in the simulator structure allows it to be easily modified according to the user's requirements. Furthermore, a GUI interface facilitates a simple and quick execution. Key performance indices for evaluating the performance of the controller and collision avoidance method in confined space are also provided. The current version of AUTOBargeSim attempts to improve reproducibility in the design and simulation of marine systems while serving as a foundation for simulating and evaluating vessel behaviour considering operational, system, and environmental constraints.
- [347] arXiv:2503.21595 [pdf, html, other]
-
Title: FusionSegReID: Advancing Person Re-Identification with Multimodal Retrieval and Precise SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Person re-identification (ReID) plays a critical role in applications like security surveillance and criminal investigations by matching individuals across large image galleries captured by non-overlapping cameras. Traditional ReID methods rely on unimodal inputs, typically images, but face limitations due to challenges like occlusions, lighting changes, and pose variations. While advancements in image-based and text-based ReID systems have been made, the integration of both modalities has remained under-explored. This paper presents FusionSegReID, a multimodal model that combines both image and text inputs for enhanced ReID performance. By leveraging the complementary strengths of these modalities, our model improves matching accuracy and robustness, particularly in complex, real-world scenarios where one modality may struggle. Our experiments show significant improvements in Top-1 accuracy and mean Average Precision (mAP) for ReID, as well as better segmentation results in challenging scenarios like occlusion and low-quality images. Ablation studies further confirm that multimodal fusion and segmentation modules contribute to enhanced re-identification and mask accuracy. The results show that FusionSegReID outperforms traditional unimodal models, offering a more robust and flexible solution for real-world person ReID tasks.
- [348] arXiv:2503.21598 [pdf, other]
-
Title: Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt ProcessingComments: 22 pages; 26 figuresSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have transformed task automation and content generation across various domains while incorporating safety filters to prevent misuse. We introduce a novel jailbreaking framework that employs distributed prompt processing combined with iterative refinements to bypass these safety measures, particularly in generating malicious code. Our architecture consists of four key modules: prompt segmentation, parallel processing, response aggregation, and LLM-based jury evaluation. Tested on 500 malicious prompts across 10 cybersecurity categories, the framework achieves a 73.2% Success Rate (SR) in generating malicious code. Notably, our comparative analysis reveals that traditional single-LLM judge evaluation overestimates SRs (93.8%) compared to our LLM jury system (73.2%), with manual verification confirming that single-judge assessments often accept incomplete implementations. Moreover, we demonstrate that our distributed architecture improves SRs by 12% over the non-distributed approach in an ablation study, highlighting both the effectiveness of distributed prompt processing and the importance of robust evaluation methodologies in assessing jailbreak attempts.
- [349] arXiv:2503.21601 [pdf, other]
-
Title: A Deep Reinforcement Learning-based Approach for Adaptive Handover ProtocolsSubjects: Networking and Internet Architecture (cs.NI)
The use of higher frequencies in mobile communication systems leads to smaller cell sizes, resulting in the deployment of more base stations and an increase in handovers to support user mobility. This can lead to frequent radio link failures and reduced data rates. In this work, we propose a handover optimization method using proximal policy optimization (PPO) to develop an adaptive handover protocol. Our PPO-based agent, implemented in the base stations, is highly adaptive to varying user equipment speeds and outperforms the 3GPP-standardized 5G NR handover procedure in terms of average data rate and radio link failure rate. Additionally, our simulation environment is carefully designed to ensure high accuracy, realistic user movements, and fair benchmarking against the 3GPP handover method.
- [350] arXiv:2503.21602 [pdf, html, other]
-
Title: GenEdit: Compounding Operators and Continuous Improvement to Tackle Text-to-SQL in the EnterpriseSubjects: Artificial Intelligence (cs.AI)
Recent advancements in Text-to-SQL, driven by large language models, are democratizing data access. Despite these advancements, enterprise deployments remain challenging due to the need to capture business-specific knowledge, handle complex queries, and meet expectations of continuous improvements. To address these issues, we designed and implemented GenEdit: our Text-to-SQL generation system that improves with user feedback. GenEdit builds and maintains a company-specific knowledge set, employs a pipeline of operators decomposing SQL generation, and uses feedback to update its knowledge set to improve future SQL generations.
We describe GenEdit's architecture made of two core modules: (i) decomposed SQL generation; and (ii) knowledge set edits based on user feedback. For generation, GenEdit leverages compounding operators to improve knowledge retrieval and to create a plan as chain-of-thought steps that guides generation. GenEdit first retrieves relevant examples in an initial retrieval stage where original SQL queries are decomposed into sub-statements, clauses or sub-queries. It then also retrieves instructions and schema elements. Using the retrieved contextual information, GenEdit then generates step-by-step plan in natural language on how to produce the query. Finally, GenEdit uses the plan to generate SQL, minimizing the need for model reasoning, which enhances complex SQL generation. If necessary, GenEdit regenerates the query based on syntactic and semantic errors. The knowledge set edits are recommended through an interactive copilot, allowing users to iterate on their feedback and to regenerate SQL queries as needed. Each generation uses staged edits which update the generation prompt. Once the feedback is submitted, it gets merged after passing regression testing and obtaining an approval, improving future generations. - [351] arXiv:2503.21613 [pdf, other]
-
Title: Evaluating book summaries from internal knowledge in Large Language Models: a cross-model and semantic consistency approachComments: 22 pages, 6 figuresSubjects: Computation and Language (cs.CL)
We study the ability of large language models (LLMs) to generate comprehensive and accurate book summaries solely from their internal knowledge, without recourse to the original text. Employing a diverse set of books and multiple LLM architectures, we examine whether these models can synthesize meaningful narratives that align with established human interpretations. Evaluation is performed with a LLM-as-a-judge paradigm: each AI-generated summary is compared against a high-quality, human-written summary via a cross-model assessment, where all participating LLMs evaluate not only their own outputs but also those produced by others. This methodology enables the identification of potential biases, such as the proclivity for models to favor their own summarization style over others. In addition, alignment between the human-crafted and LLM-generated summaries is quantified using ROUGE and BERTScore metrics, assessing the depth of grammatical and semantic correspondence. The results reveal nuanced variations in content representation and stylistic preferences among the models, highlighting both strengths and limitations inherent in relying on internal knowledge for summarization tasks. These findings contribute to a deeper understanding of LLM internal encodings of factual information and the dynamics of cross-model evaluation, with implications for the development of more robust natural language generative systems.
- [352] arXiv:2503.21614 [pdf, html, other]
-
Title: A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and BeyondXiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu ChengComments: Survey, 32 pages, Large Reasoning Models, Efficient Reasoning for Language, Multimodality, and BeyondSubjects: Computation and Language (cs.CL)
Recent Large Reasoning Models (LRMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference. However, a growing concern lies in their tendency to produce excessively long reasoning traces, which are often filled with redundant content (e.g., repeated definitions), over-analysis of simple problems, and superficial exploration of multiple reasoning paths for harder tasks. This inefficiency introduces significant challenges for training, inference, and real-world deployment (e.g., in agent-based systems), where token economy is critical. In this survey, we provide a comprehensive overview of recent efforts aimed at improving reasoning efficiency in LRMs, with a particular focus on the unique challenges that arise in this new paradigm. We identify common patterns of inefficiency, examine methods proposed across the LRM lifecycle, i.e., from pretraining to inference, and discuss promising future directions for research. To support ongoing development, we also maintain a real-time GitHub repository tracking recent progress in the field. We hope this survey serves as a foundation for further exploration and inspires innovation in this rapidly evolving area.
- [353] arXiv:2503.21615 [pdf, html, other]
-
Title: A Measure Based Generalizable Approach to UnderstandabilityComments: 6 pagesSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Successful agent-human partnerships require that any agent generated information is understandable to the human, and that the human can easily steer the agent towards a goal. Such effective communication requires the agent to develop a finer-level notion of what is understandable to the human. State-of-the-art agents, including LLMs, lack this detailed notion of understandability because they only capture average human sensibilities from the training data, and therefore afford limited steerability (e.g., requiring non-trivial prompt engineering).
In this paper, instead of only relying on data, we argue for developing generalizable, domain-agnostic measures of understandability that can be used as directives for these agents. Existing research on understandability measures is fragmented, we survey various such efforts across domains, and lay a cognitive-science-rooted groundwork for more coherent and domain-agnostic research investigations in future. - [354] arXiv:2503.21616 [pdf, html, other]
-
Title: Audio-driven Gesture Generation via Deviation Feature in the Latent SpaceComments: 6 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
- [355] arXiv:2503.21617 [pdf, html, other]
-
Title: Leveraging Language Models for Analyzing Longitudinal Experiential Data in EducationSubjects: Machine Learning (cs.LG)
We propose a novel approach to leveraging pre-trained language models (LMs) for early forecasting of academic trajectories in STEM students using high-dimensional longitudinal experiential data. This data, which captures students' study-related activities, behaviors, and psychological states, offers valuable insights for forecasting-based interventions. Key challenges in handling such data include high rates of missing values, limited dataset size due to costly data collection, and complex temporal variability across modalities. Our approach addresses these issues through a comprehensive data enrichment process, integrating strategies for managing missing values, augmenting data, and embedding task-specific instructions and contextual cues to enhance the models' capacity for learning temporal patterns. Through extensive experiments on a curated student learning dataset, we evaluate both encoder-decoder and decoder-only LMs. While our findings show that LMs effectively integrate data across modalities and exhibit resilience to missing data, they primarily rely on high-level statistical patterns rather than demonstrating a deeper understanding of temporal dynamics. Furthermore, their ability to interpret explicit temporal information remains limited. This work advances educational data science by highlighting both the potential and limitations of LMs in modeling student trajectories for early intervention based on longitudinal experiential data.
- [356] arXiv:2503.21618 [pdf, html, other]
-
Title: A shifted Laplace rational filter for large-scale eigenvalue problemsSubjects: Numerical Analysis (math.NA)
We present a rational filter for computing all eigenvalues of a symmetric definite eigenvalue problem lying in an interval on the real axis. The linear systems arising from the filter embedded in the subspace iteration framework, are solved via a preconditioned Krylov method.
The choice of the poles of the filter is based on two criteria. On the one hand, the filter should enhance the eigenvalues in the interval of interest, which suggests that the poles should be chosen close to or in the interval. On the other hand, the choice of poles has an important impact on the convergence speed of the iterative method. For the solution of problems arising from vibrations, the two criteria contradict each other, since fast convergence of the eigensolver requires poles to be in or close to the interval, whereas the iterative linear system solver becomes cheaper when the poles lie further away from the eigenvalues. In the paper, we propose a selection of poles inspired by the shifted Laplace preconditioner for the Helmholtz equation.
We show numerical experiments from finite element models of vibrations. We compare the shifted Laplace rational filter with rational filters based on quadrature rules for contour integration. - [357] arXiv:2503.21620 [pdf, html, other]
-
Title: UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement LearningSubjects: Artificial Intelligence (cs.AI)
The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Building on this idea, we are the first to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for graphic user interface (GUI) action prediction tasks. To this end, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. We also introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). Experimental results demonstrate that our proposed data-efficient model, UI-R1-3B, achieves substantial improvements on both in-domain (ID) and out-of-domain (OOD) tasks. Specifically, on the ID benchmark AndroidControl, the action type accuracy improves by 15%, while grounding accuracy increases by 10.3%, compared with the base model (i.e. Qwen2.5-VL-3B). On the OOD GUI grounding benchmark ScreenSpot-Pro, our model surpasses the base model by 6.0% and achieves competitive performance with larger models (e.g., OS-Atlas-7B), which are trained via supervised fine-tuning (SFT) on 76K data. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain.
- [358] arXiv:2503.21622 [pdf, other]
-
Title: The MVTec AD 2 Dataset: Advanced Scenarios for Unsupervised Anomaly DetectionComments: paper under review; dataset first released for the VAND3.0 challenge @ CVPR 2025 this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
In recent years, performance on existing anomaly detection benchmarks like MVTec AD and VisA has started to saturate in terms of segmentation AU-PRO, with state-of-the-art models often competing in the range of less than one percentage point. This lack of discriminatory power prevents a meaningful comparison of models and thus hinders progress of the field, especially when considering the inherent stochastic nature of machine learning results. We present MVTec AD 2, a collection of eight anomaly detection scenarios with more than 8000 high-resolution images. It comprises challenging and highly relevant industrial inspection use cases that have not been considered in previous datasets, including transparent and overlapping objects, dark-field and back light illumination, objects with high variance in the normal data, and extremely small defects. We provide comprehensive evaluations of state-of-the-art methods and show that their performance remains below 60% average AU-PRO. Additionally, our dataset provides test scenarios with lighting condition changes to assess the robustness of methods under real-world distribution shifts. We host a publicly accessible evaluation server that holds the pixel-precise ground truth of the test set (this https URL). All image data is available at this https URL.
- [359] arXiv:2503.21623 [pdf, html, other]
-
Title: RIS-Measurements for Codebook DesignComments: 6 pages, presented during WiMOB 2024 conferenceSubjects: Networking and Internet Architecture (cs.NI)
Reconfigurable Intelligent Surfaces (RIS) have gained significant attention for some time. Thanks to the possibility of individual steering of each reflecting element of the boards, they are envisaged to impact the propagation environment significantly. In this work, we concentrate on the practical verification of this concept. We present the results of detailed measurements of the reflection characteristics of the RIS boards, which have been conducted intentionally in the real environment. Various potential impacting factors have been considered (impact of azimuth and elevation angle, polarization, number of RIS boards, and distance). Achieved measurement results constituted the basis for conceptual analysis on the practical possibility of creating a codebook (consisting of RIS patterns - codewords) for some applications.
- [360] arXiv:2503.21626 [pdf, html, other]
-
Title: Inverse Lax-Wendroff boundary treatment for solving conservation laws with finite difference HWENO methodsSubjects: Numerical Analysis (math.NA)
This paper presents a novel inverse Lax-Wendroff (ILW) boundary treatment for finite difference Hermite weighted essentially non-oscillatory (HWENO) schemes to solve hyperbolic conservation laws on arbitrary geometries. The complex geometric domain is divided by a uniform Cartesian grid, resulting in challenge in boundary treatment. The proposed ILW boundary treatment could provide high order approximations of both solution values and spatial derivatives at ghost points outside the computational domain. Distinct from existing ILW approaches, our boundary treatment constructs the extrapolation via optimized through a least squares formulation, coupled with the spatial derivatives at the boundary obtained via the ILW procedure. Theoretical analysis indicates that compared with other ILW methods, our proposed one would require fewer terms by using the relatively complicated ILW procedure and thus improve computational efficiency while preserving accuracy and stability. The effectiveness and robustness of the method are validated through numerical experiments.
- [361] arXiv:2503.21627 [pdf, html, other]
-
Title: Provable Reduction in Communication Rounds for Non-Smooth Convex Federated LearningSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Multiple local steps are key to communication-efficient federated learning. However, theoretical guarantees for such algorithms, without data heterogeneity-bounding assumptions, have been lacking in general non-smooth convex problems. Leveraging projection-efficient optimization methods, we propose FedMLS, a federated learning algorithm with provable improvements from multiple local steps. FedMLS attains an $\epsilon$-suboptimal solution in $\mathcal{O}(1/\epsilon)$ communication rounds, requiring a total of $\mathcal{O}(1/\epsilon^2)$ stochastic subgradient oracle calls.
- [362] arXiv:2503.21629 [pdf, html, other]
-
Title: ClusterSC: Advancing Synthetic Control with Donor SelectionComments: 35 pages, 11 figures, to be published in Proceedings of The 28th International Conference on Artificial Intelligence and Statistics (AIStats) 2025Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In causal inference with observational studies, synthetic control (SC) has emerged as a prominent tool. SC has traditionally been applied to aggregate-level datasets, but more recent work has extended its use to individual-level data. As they contain a greater number of observed units, this shift introduces the curse of dimensionality to SC. To address this, we propose Cluster Synthetic Control (ClusterSC), based on the idea that groups of individuals may exist where behavior aligns internally but diverges between groups. ClusterSC incorporates a clustering step to select only the relevant donors for the target. We provide theoretical guarantees on the improvements induced by ClusterSC, supported by empirical demonstrations on synthetic and real-world datasets. The results indicate that ClusterSC consistently outperforms classical SC approaches.
- [363] arXiv:2503.21633 [pdf, other]
-
Title: Static and Repeated Cooperative Games for the Optimization of the AoI in IoT NetworksComments: 6 pages, 7 figures, submitted to MedComNet 2025Subjects: Networking and Internet Architecture (cs.NI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Wireless sensing and the internet of things (IoT) are nowadays pervasive in 5G and beyond networks, and they are expected to play a crucial role in 6G. However, a centralized optimization of a distributed system is not always possible and cost-efficient. In this paper, we analyze a setting in which two sensors collaboratively update a common server seeking to minimize the age of information (AoI) of the latest sample of a common physical process. We consider a distributed and uncoordinated setting where each sensor lacks information about whether the other decides to update the server. This strategic setting is modeled through game theory (GT) and two games are defined: i) a static game of complete information with an incentive mechanism for cooperation, and ii) a repeated game over a finite horizon where the static game is played at each stage. We perform a mathematical analysis of the static game finding three Nash Equilibria (NEs) in pure strategies and one in mixed strategies. A numerical simulation of the repeated game is also presented and novel and valuable insight into the setting is given thanks to the definition of a new metric, the price of delayed updates (PoDU), which shows that the decentralized solution provides results close to the centralized optimum.
- [364] arXiv:2503.21634 [pdf, html, other]
-
Title: When Astronomy Meets AI: Manazel For Crescent Visibility Prediction in MoroccoSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
The accurate determination of the beginning of each Hijri month is essential for religious, cultural, and administrative purposes. Manazel (The code and datasets are available at this https URL) addresses this challenge in Morocco by leveraging 13 years of crescent visibility data to refine the ODEH criterion, a widely used standard for lunar crescent visibility prediction. The study integrates two key features, the Arc of Vision (ARCV) and the total width of the crescent (W), to enhance the accuracy of lunar visibility assessments. A machine learning approach utilizing the Logistic Regression algorithm is employed to classify crescent visibility conditions, achieving a predictive accuracy of 98.83%. This data-driven methodology offers a robust and reliable framework for determining the start of the Hijri month, comparing different data classification tools, and improving the consistency of lunar calendar calculations in Morocco. The findings demonstrate the effectiveness of machine learning in astronomical applications and highlight the potential for further enhancements in the modeling of crescent visibility.
- [365] arXiv:2503.21636 [pdf, html, other]
-
Title: KRAFT -- A Knowledge-Graph-Based Resource Allocation FrameworkSubjects: Software Engineering (cs.SE)
Resource allocation in business process management involves assigning resources to open tasks while considering factors such as individual roles, aptitudes, case-specific characteristics, and regulatory constraints. Current information systems for resource allocation often require extensive manual effort to specify and maintain allocation rules, making them rigid and challenging to adapt. In contrast, fully automated approaches provide limited explainability, making it difficult to understand and justify allocation decisions. Knowledge graphs, which represent real-world entities and their relationships, offer a promising solution by capturing complex dependencies and enabling dynamic, context-aware resource allocation. This paper introduces KRAFT, a novel approach that leverages knowledge graphs and reasoning techniques to support resource allocation decisions. We demonstrate that integrating knowledge graphs into resource allocation software allows for adaptable and transparent decision-making based on an evolving knowledge base.
- [366] arXiv:2503.21638 [pdf, other]
-
Title: Data-Driven Extreme Response EstimationComments: From the 35th Symposium on Naval HydrodynamicsSubjects: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
A method to rapidly estimate extreme ship response events is developed in this paper. The method involves training by a Long Short-Term Memory (LSTM) neural network to correct a lower-fidelity hydrodynamic model to the level of a higher-fidelity simulation. More focus is placed on larger responses by isolating the time-series near peak events identified in the lower-fidelity simulations and training on only the shorter time-series around the large event. The method is tested on the estimation of pitch time-series maxima in Sea State 5 (significant wave height of 4.0 meters and modal period of 15.0 seconds,) generated by a lower-fidelity hydrodynamic solver known as SimpleCode and a higher-fidelity tool known as the Large Amplitude Motion Program (LAMP). The results are also compared with an LSTM trained without special considerations for large events.
- [367] arXiv:2503.21640 [pdf, html, other]
-
Title: Towards Fully Automated Decision-Making Systems for Greenhouse Control: Challenges and OpportunitiesSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Machine learning has been successful in building control policies to drive a complex system to desired states in various applications (e.g. games, robotics, etc.). To be specific, a number of parameters of policy can be automatically optimized from the observations of environment to be able to generate a sequence of decisions leading to the best performance. In this survey paper, we particularly explore such policy-learning techniques for another unique, practical use-case scenario--farming, in which critical decisions (e.g., water supply, heating, etc.) must be made in a timely manner to minimize risks (e.g., damage to plants) while maximizing the revenue (e.g., healthy crops) in the end. We first provide a broad overview of latest studies on it to identify not only domain-specific challenges but opportunities with potential solutions, some of which are suggested as promising directions for future research. Also, we then introduce our successful approach to being ranked second among 46 teams at the ''3rd Autonomous Greenhouse Challenge'' to use this specific example to discuss the lessons learned about important considerations for design to create autonomous farm-management systems.
- [368] arXiv:2503.21645 [pdf, other]
-
Title: Mapping the Digital Diplomatic Infrastructure: A Comparative Evaluation of Global Online Directories for Diplomatic MissionsComments: 14 pages, 2 tables, 1 chartSubjects: Digital Libraries (cs.DL)
This study provides a comparative evaluation of global diplomatic mission directories. this http URL, this http URL, and this http URL are strategically selected among the top ten global services. After analyzing nearly all available online global diplomatic directory services, these three platforms are selected as they represent fundamentally different approaches to creating worldwide diplomatic mission databases. Using official diplomatic lists from over 150 countries as benchmarks, we assessed data coverage, accuracy, and update frequency across these platforms. DiplomaticMonitor consistently outperforms its counterparts in structure, completeness, and timeliness, accurately reflecting ambassadorial appointment cycles and maintaining high precision across contact and personnel records. EmbassyPages, despite strong search engine visibility and widespread usage, exhibits significant data currency issues, with markedly diminished ambassadorial accuracy attributable to delayed refresh cycles. WikiData offers valuable historical documentation and open-source accessibility but lacks the consistency and verification protocols necessary for reliable real-time diplomatic information. Our findings highlight the critical challenge posed by the absence of a standardized global diplomatic mission registry. In this fragmented landscape, methodologically rigorous third-party platforms can occasionally surpass government-published records in quality and utility. The research demonstrates that in contemporary digital diplomacy, data reliability correlates less with institutional provenance than with disciplined, transparent, and consistent data stewardship practices.
- [369] arXiv:2503.21646 [pdf, html, other]
-
Title: Unlocking the Potential of Past Research: Using Generative AI to Reconstruct Healthcare Simulation ModelsSubjects: Artificial Intelligence (cs.AI); Applications (stat.AP)
Discrete-event simulation (DES) is widely used in healthcare Operations Research, but the models themselves are rarely shared. This limits their potential for reuse and long-term impact in the modelling and healthcare communities. This study explores the feasibility of using generative artificial intelligence (AI) to recreate published models using Free and Open Source Software (FOSS), based on the descriptions provided in an academic journal. Using a structured methodology, we successfully generated, tested and internally reproduced two DES models, including user interfaces. The reported results were replicated for one model, but not the other, likely due to missing information on distributions. These models are substantially more complex than AI-generated DES models published to date. Given the challenges we faced in prompt engineering, code generation, and model testing, we conclude that our iterative approach to model development, systematic comparison and testing, and the expertise of our team were necessary to the success of our recreated simulation models.
- [370] arXiv:2503.21655 [pdf, html, other]
-
Title: Output-sensitive approximate counting via a measure-bounded hyperedge oracle, or: How asymmetry helps estimate $k$-clique counts fasterComments: To appear in STOC 2025Subjects: Data Structures and Algorithms (cs.DS)
Dell, Lapinskas and Meeks [DLM SICOMP 2022] presented a general reduction from approximate counting to decision for a class of fine-grained problems that can be viewed as hyperedge counting or detection problems in an implicit hypergraph, thus obtaining tight equivalences between approximate counting and decision for many key problems such as $k$-clique, $k$-sum and more. Their result is a reduction from approximately counting the number of hyperedges in an implicit $k$-partite hypergraph to a polylogarithmic number of calls to a hyperedge oracle that returns whether a given subhypergraph contains an edge.
The main result of this paper is a generalization of the DLM result for {\em output-sensitive} approximate counting, where the running time of the desired counting algorithm is inversely proportional to the number of witnesses. Our theorem is a reduction from approximately counting the (unknown) number of hyperedges in an implicit $k$-partite hypergraph to a polylogarithmic number of calls to a hyperedge oracle called only on subhypergraphs with a small ``measure''. If a subhypergraph has $u_i$ nodes in the $i$th node partition of the $k$-partite hypergraph, then its measure is $\prod_i u_i$.
Using the new general reduction and by efficiently implementing measure-bounded colorful independence oracles, we obtain new improved output-sensitive approximate counting algorithms for $k$-clique, $k$-dominating set and $k$-sum. In graphs with $n^t$ $k$-cliques, for instance, our algorithm $(1\pm \epsilon)$-approximates the $k$-clique count in time
$$\tilde{O}_\epsilon(n^{\omega(\frac{k-t-1}{3},\frac{k-t}{3},\frac{k-t+2}{3}) }+n^2),$$ where $\omega(a,b,c)$ is the exponent of $n^a\times n^b$ by $n^b\times n^c$ matrix multiplication. For large $k$ and $t>2$, this is a substantial improvement over prior work, even if $\omega=2$. - [371] arXiv:2503.21657 [pdf, html, other]
-
Title: Model Assembly Learning with Heterogeneous Layer Weight MergingComments: ICLR 2025 Workshop on Neural Network Weights as a New Data ModalitySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Model merging acquires general capabilities without extra data or training by combining multiple models' parameters. Previous approaches achieve linear mode connectivity by aligning parameters into the same loss basin using permutation invariance. In this paper, we introduce Model Assembly Learning (MAL), a novel paradigm for model merging that iteratively integrates parameters from diverse models in an open-ended model zoo to enhance the base model's capabilities. Unlike previous works that require identical architectures, MAL allows the merging of heterogeneous architectures and selective parameters across layers. Specifically, the base model can incorporate parameters from different layers of multiple pre-trained models. We systematically investigate the conditions and fundamental settings of heterogeneous parameter merging, addressing all possible mismatches in layer widths between the base and target models. Furthermore, we establish key laws and provide practical guidelines for effectively implementing MAL.
- [372] arXiv:2503.21659 [pdf, html, other]
-
Title: InteractionMap: Improving Online Vectorized HDMap Construction with InteractionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Vectorized high-definition (HD) maps are essential for an autonomous driving system. Recently, state-of-the-art map vectorization methods are mainly based on DETR-like framework to generate HD maps in an end-to-end manner. In this paper, we propose InteractionMap, which improves previous map vectorization methods by fully leveraging local-to-global information interaction in both time and space. Firstly, we explore enhancing DETR-like detectors by explicit position relation prior from point-level to instance-level, since map elements contain strong shape priors. Secondly, we propose a key-frame-based hierarchical temporal fusion module, which interacts temporal information from local to global. Lastly, the separate classification branch and regression branch lead to the problem of misalignment in the output distribution. We interact semantic information with geometric information by introducing a novel geometric-aware classification loss in optimization and a geometric-aware matching cost in label assignment. InteractionMap achieves state-of-the-art performance on both nuScenes and Argoverse2 benchmarks.
- [373] arXiv:2503.21661 [pdf, other]
-
Title: From conceptualization to operationalized meaning via ontological componentsSubjects: Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
Ontologies enable knowledge sharing and interdisciplinary collaboration by providing standardized, structured vocabularies for diverse communities. While logical axioms are a cornerstone of ontology design, natural language elements such as annotations are equally critical for conveying intended meaning and ensuring consistent term usage. This paper explores how meaning is represented in ontologies and how it can be effectively represented and communicated, addressing challenges such as indeterminacy of reference and meaning holism. To this end, it proposes an approach founded on the use of a new structure, named 'ontological component' and defined by: a term-centered design; enhanced characterization of both formal and natural language statements; an operationalizable definition of communicated meaning based on general assertions; and the integration of natural language elements into the logical theory. By formalizing the meaning of ontological components, this work seeks to enhance the semantic robustness of terms, improving their clarity and accessibility across domains. Furthermore, it aims to address practical challenges in applied ontologies, such as facilitating reuse and managing versioning, thereby strengthening their role in diverse applications.
- [374] arXiv:2503.21666 [pdf, html, other]
-
Title: Economy and sustainability analysis with a novel modular configurable multi-modal white-box building modelSubjects: Systems and Control (eess.SY)
This paper presents a novel modeling approach for building performance simulation, characterized as a white-box model with a high degree of modularity and flexibility, enabling direct integration into complex large-scale energy system co-simulations. The introduced model is described in detail, with a focus on its modular structure, and proposes various configurations that include various building insulation, heating methods, occupancy patterns, and weather data to analyze different scenarios, and the energy consumption, CO2 emissions, and heating costs are compared and analyzed across 36 introduced scenarios. The thermodynamic behavior of the model is shown to be consistent with real-world conditions, and the comparison of the scenarios concludes that the use of heat pumps for indoor heating in well-insulated buildings has significant economic and sustainability benefits, whereas the use of natural gas-fueled boilers is more cost-effective for buildings with low energy ratings.
- [375] arXiv:2503.21667 [pdf, other]
-
Title: The Construction of Asymptotic Bode Plots: A New Direct MethodSubjects: Systems and Control (eess.SY)
Bode plots represent an essential tool in control and systems engineering. In order to perform an initial qualitative analysis of the considered systems, the construction of asymptotic Bode plots is often sufficient. The standard methods for constructing asymptotic Bode plots are characterized by the same drawbacks: they are not systematic, may be not precise and time-consuming. This is because they require the detailed analysis of the different factors composing the considered transfer function, meaning that more and more intermediate steps are required as the number of factors increases. In this paper, a new method for the construction of asymptotic Bode plots is proposed, which is based on the systematic calculations of the so-called generalized approximating functions and on the use of well defined properties. The proposed method is referred to as a direct method since it allows to directly draw the asymptotic Bode magnitude and phase plots of the complete transfer function without requiring the detailed analysis nor the plots construction of each factor. This latter feature also makes the proposed direct method more systematic, potentially more precise and less time-consuming compared to standard methods, especially when dealing with a large number of factors in the transfer function. The comparison of the proposed direct method with the standard approaches is performed, in order to examine the benefits offered by the direct method.
- [376] arXiv:2503.21668 [pdf, html, other]
-
Title: Cognitive Science-Inspired Evaluation of Core Capabilities for Object Understanding in AISubjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
One of the core components of our world models is 'intuitive physics' - an understanding of objects, space, and causality. This capability enables us to predict events, plan action and navigate environments, all of which rely on a composite sense of objecthood. Despite its importance, there is no single, unified account of objecthood, though multiple theoretical frameworks provide insights. In the first part of this paper, we present a comprehensive overview of the main theoretical frameworks in objecthood research - Gestalt psychology, enactive cognition, and developmental psychology - and identify the core capabilities each framework attributes to object understanding, as well as what functional roles they play in shaping world models in biological agents. Given the foundational role of objecthood in world modelling, understanding objecthood is also essential in AI. In the second part of the paper, we evaluate how current AI paradigms approach and test objecthood capabilities compared to those in cognitive science. We define an AI paradigm as a combination of how objecthood is conceptualised, the methods used for studying objecthood, the data utilised, and the evaluation techniques. We find that, whilst benchmarks can detect that AI systems model isolated aspects of objecthood, the benchmarks cannot detect when AI systems lack functional integration across these capabilities, not solving the objecthood challenge fully. Finally, we explore novel evaluation approaches that align with the integrated vision of objecthood outlined in this paper. These methods are promising candidates for advancing from isolated object capabilities toward general-purpose AI with genuine object understanding in real-world contexts.
- [377] arXiv:2503.21670 [pdf, html, other]
-
Title: COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-MixingSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The rapid growth of digital communication has driven the widespread use of code-mixing, particularly Hindi-English, in multilingual communities. Existing datasets often focus on romanized text, have limited scope, or rely on synthetic data, which fails to capture realworld language nuances. Human annotations are crucial for assessing the naturalness and acceptability of code-mixed text. To address these challenges, We introduce COMI-LINGUA, the largest manually annotated dataset for code-mixed text, comprising 100,970 instances evaluated by three expert annotators in both Devanagari and Roman scripts. The dataset supports five fundamental NLP tasks: Language Identification, Matrix Language Identification, Part-of-Speech Tagging, Named Entity Recognition, and Translation. We evaluate LLMs on these tasks using COMILINGUA, revealing limitations in current multilingual modeling strategies and emphasizing the need for improved code-mixed text processing capabilities. COMI-LINGUA is publically availabe at: this https URL.
- [378] arXiv:2503.21671 [pdf, html, other]
-
Title: A Bespoke Design Approach to Low-Power Printed Microprocessors for Machine Learning ApplicationsComments: Accepted for publication at the IEEE International Symposium on Circuits and Systems (ISCAS `25), May 25-28, London, United KingdomSubjects: Hardware Architecture (cs.AR)
Printed electronics have gained significant traction in recent years, presenting a viable path to integrating computing into everyday items, from disposable products to low-cost healthcare. However, the adoption of computing in these domains is hindered by strict area and power constraints, limiting the effectiveness of general-purpose microprocessors. This paper proposes a bespoke microprocessor design approach to address these challenges, by tailoring the design to specific applications and eliminating unnecessary logic. Targeting machine learning applications, we further optimize core operations by integrating a SIMD MAC unit supporting 4 precision configurations that boost the efficiency of microprocessors. Our evaluation across 6 ML models and the large-scale Zero-Riscy core, shows that our methodology can achieve improvements of 22.2%, 23.6%, and 33.79% in area, power, and speed, respectively, without compromising accuracy. Against state-of-the-art printed processors, our approach can still offer significant speedups, but along with some accuracy degradation. This work explores how such trade-offs can enable low-power printed microprocessors for diverse ML applications.
- [379] arXiv:2503.21674 [pdf, html, other]
-
Title: Intelligent IoT Attack Detection Design via ODLLM with Feature Ranking-based Knowledge BaseSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
The widespread adoption of Internet of Things (IoT) devices has introduced significant cybersecurity challenges, particularly with the increasing frequency and sophistication of Distributed Denial of Service (DDoS) attacks. Traditional machine learning (ML) techniques often fall short in detecting such attacks due to the complexity of blended and evolving patterns. To address this, we propose a novel framework leveraging On-Device Large Language Models (ODLLMs) augmented with fine-tuning and knowledge base (KB) integration for intelligent IoT network attack detection. By implementing feature ranking techniques and constructing both long and short KBs tailored to model capacities, the proposed framework ensures efficient and accurate detection of DDoS attacks while overcoming computational and privacy limitations. Simulation results demonstrate that the optimized framework achieves superior accuracy across diverse attack types, especially when using compact models in edge computing environments. This work provides a scalable and secure solution for real-time IoT security, advancing the applicability of edge intelligence in cybersecurity.
- [380] arXiv:2503.21676 [pdf, html, other]
-
Title: How do language models learn facts? Dynamics, curricula and hallucinationsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.
- [381] arXiv:2503.21677 [pdf, html, other]
-
Title: A tale of two goals: leveraging sequentiality in multi-goal scenariosComments: 14 pages, 5 figuresSubjects: Machine Learning (cs.LG)
Several hierarchical reinforcement learning methods leverage planning to create a graph or sequences of intermediate goals, guiding a lower-level goal-conditioned (GC) policy to reach some final goals. The low-level policy is typically conditioned on the current goal, with the aim of reaching it as quickly as possible. However, this approach can fail when an intermediate goal can be reached in multiple ways, some of which may make it impossible to continue toward subsequent goals. To address this issue, we introduce two instances of Markov Decision Process (MDP) where the optimization objective favors policies that not only reach the current goal but also subsequent ones. In the first, the agent is conditioned on both the current and final goals, while in the second, it is conditioned on the next two goals in the sequence. We conduct a series of experiments on navigation and pole-balancing tasks in which sequences of intermediate goals are given. By evaluating policies trained with TD3+HER on both the standard GC-MDP and our proposed MDPs, we show that, in most cases, conditioning on the next two goals improves stability and sample efficiency over other approaches.
- [382] arXiv:2503.21679 [pdf, html, other]
-
Title: JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai CommunityComments: 20 pages, 1 figuresSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
This paper introduces JiraiBench, the first bilingual benchmark for evaluating large language models' effectiveness in detecting self-destructive content across Chinese and Japanese social media communities. Focusing on the transnational "Jirai" (landmine) online subculture that encompasses multiple forms of self-destructive behaviors including drug overdose, eating disorders, and self-harm, we present a comprehensive evaluation framework incorporating both linguistic and cultural dimensions. Our dataset comprises 10,419 Chinese posts and 5,000 Japanese posts with multidimensional annotation along three behavioral categories, achieving substantial inter-annotator agreement. Experimental evaluations across four state-of-the-art models reveal significant performance variations based on instructional language, with Japanese prompts unexpectedly outperforming Chinese prompts when processing Chinese content. This emergent cross-cultural transfer suggests that cultural proximity can sometimes outweigh linguistic similarity in detection tasks. Cross-lingual transfer experiments with fine-tuned models further demonstrate the potential for knowledge transfer between these language systems without explicit target language training. These findings highlight the need for culturally-informed approaches to multilingual content moderation and provide empirical evidence for the importance of cultural context in developing more effective detection systems for vulnerable online communities.
- [383] arXiv:2503.21683 [pdf, other]
-
Title: LLM-Gomoku: A Large Language Model-Based System for Strategic Gomoku with Self-Play and Reinforcement LearningSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
In recent years, large language models (LLMs) have shown significant advancements in natural language processing (NLP), with strong capa-bilities in generation, comprehension, and rea-soning. These models have found applications in education, intelligent decision-making, and gaming. However, effectively utilizing LLMs for strategic planning and decision-making in the game of Gomoku remains a challenge. This study aims to develop a Gomoku AI system based on LLMs, simulating the human learning process of playing chess. The system is de-signed to understand and apply Gomoku strat-egies and logic to make rational decisions. The research methods include enabling the model to "read the board," "understand the rules," "select strategies," and "evaluate positions," while en-hancing its abilities through self-play and rein-forcement learning. The results demonstrate that this approach significantly improves the se-lection of move positions, resolves the issue of generating illegal positions, and reduces pro-cess time through parallel position evaluation. After extensive self-play training, the model's Gomoku-playing capabilities have been notably enhanced.
- [384] arXiv:2503.21690 [pdf, html, other]
-
Title: CMED: A Child Micro-Expression DatasetSubjects: Computer Vision and Pattern Recognition (cs.CV)
Micro-expressions are short bursts of emotion that are difficult to hide. Their detection in children is an important cue to assist psychotherapists in conducting better therapy. However, existing research on the detection of micro-expressions has focused on adults, whose expressions differ in their characteristics from those of children. The lack of research is a direct consequence of the lack of a child-based micro-expressions dataset as it is much more challenging to capture children's facial expressions due to the lack of predictability and controllability. This study compiles a dataset of spontaneous child micro-expression videos, the first of its kind, to the best of the authors knowledge. The dataset is captured in the wild using video conferencing software. This dataset enables us to then explore key features and differences between adult and child micro-expressions. This study also establishes a baseline for the automated spotting and recognition of micro-expressions in children using three approaches comprising of hand-created and learning-based approaches.
- [385] arXiv:2503.21691 [pdf, other]
-
Title: Place Capability Graphs: A General-Purpose Model of Rust's Ownership and Borrowing GuaranteesZachary Grannan, Aurel Bíly, Jonáš Fiala, Jasper Geer, Markus de Medeiros, Peter Müller, Alexander J. SummersSubjects: Programming Languages (cs.PL)
Rust's novel type system has proved an attractive target for verification and program analysis tools, due to the rich guarantees it provides for controlling aliasing and mutability. However, fully understanding, extracting and exploiting these guarantees is subtle and challenging: existing models for Rust's type checking either support a smaller idealised language disconnected from real-world Rust code, or come with severe limitations in terms of precise modelling of Rust borrows, composite types storing them, function signatures and loops.
In this paper, we present a novel model of Rust's type-checking called Place Capability Graphs, which lifts these limitations, and which can be directly calculated from the Rust compiler's own programmatic representations and analyses. We demonstrate that our model supports over 98% of Rust functions in the most popular public crates, and show its suitability as a general-purpose basis for verification and program analysis tools by developing promising new prototype versions of the existing Flowistry and Prusti tools. - [386] arXiv:2503.21692 [pdf, html, other]
-
Title: RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a MillisecondSubjects: Computer Vision and Pattern Recognition (cs.CV)
The integration of multi-view imaging and pose estimation represents a significant advance in computer vision applications, offering new possibilities for understanding human movement and interactions. This work presents a new algorithm that improves multi-view multi-person pose estimation, focusing on fast triangulation speeds and good generalization capabilities. The approach extends to whole-body pose estimation, capturing details from facial expressions to finger movements across multiple individuals and viewpoints. Adaptability to different settings is demonstrated through strong performance across unseen datasets and configurations. To support further progress in this field, all of this work is publicly accessible.
- [387] arXiv:2503.21694 [pdf, html, other]
-
Title: Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D DataSubjects: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only $2.5\%$ trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at this https URL.
- [388] arXiv:2503.21695 [pdf, html, other]
-
Title: AMA-SAM: Adversarial Multi-Domain Alignment of Segment Anything Model for High-Fidelity Histology Nuclei SegmentationComments: 13 pages, 4 tables, 2 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Accurate segmentation of cell nuclei in histopathology images is essential for numerous biomedical research and clinical applications. However, existing cell nucleus segmentation methods only consider a single dataset (i.e., primary domain), while neglecting to leverage supplementary data from diverse sources (i.e., auxiliary domains) to reduce overfitting and enhance the performance. Although incorporating multiple datasets could alleviate overfitting, it often exacerbates performance drops caused by domain shifts. In this work, we introduce Adversarial Multi-domain Alignment of Segment Anything Model (AMA-SAM) that extends the Segment Anything Model (SAM) to overcome these obstacles through two key innovations. First, we propose a Conditional Gradient Reversal Layer (CGRL), a multi-domain alignment module that harmonizes features from diverse domains to promote domain-invariant representation learning while preserving crucial discriminative features for the primary dataset. Second, we address SAM's inherent low-resolution output by designing a High-Resolution Decoder (HR-Decoder), which directly produces fine-grained segmentation maps in order to capture intricate nuclei boundaries in high-resolution histology images. To the best of our knowledge, this is the first attempt to adapt SAM for multi-dataset learning with application to histology nuclei segmentation. We validate our method on several publicly available datasets, demonstrating consistent and significant improvements over state-of-the-art approaches.
- [389] arXiv:2503.21696 [pdf, html, other]
-
Title: Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive TasksWenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, Weiming Lu, Peng Li, Yueting ZhuangSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.
- [390] arXiv:2503.21697 [pdf, html, other]
-
Title: The commutativity problem for effective varieties of formal series, and applicationsComments: under submissionSubjects: Formal Languages and Automata Theory (cs.FL); Discrete Mathematics (cs.DM); Logic in Computer Science (cs.LO)
A formal series in noncommuting variables $\Sigma$ over the rationals is a mapping $\Sigma^* \to \mathbb Q$. We say that a series is commutative if the value in the output does not depend on the order of the symbols in the input. The commutativity problem for a class of series takes as input a (finite presentation of) a series from the class and amounts to establishing whether it is commutative. This is a very natural, albeit nontrivial problem, which has not been considered before from an algorithmic perspective.
We show that commutativity is decidable for all classes of series that constitute a so-called effective prevariety, a notion generalising Reutenauer's varieties of formal series. For example, the class of rational series, introduced by Schützenberger in the 1960's, is well-known to be an effective (pre)variety, and thus commutativity is decidable for it.
In order to showcase the applicability of our result, we consider classes of formal series generalising the rational ones. We consider polynomial automata, shuffle automata, and infiltration automata, and we show that each of these models recognises an effective prevariety of formal series. Consequently, their commutativity problem is decidable, which is a novel result. We find it remarkable that commutativity can be decided in a uniform way for such disparate computation models.
Finally, we present applications of commutativity outside the theory of formal series. We show that we can decide solvability in sequences and in power series for restricted classes of algebraic difference and differential equations, for which such problems are undecidable in full generality. Thanks to this, we can prove that the syntaxes of multivariate polynomial recursive sequences and of constructible differentially algebraic power series are effective, which are new results which were left open in previous work. - [391] arXiv:2503.21699 [pdf, html, other]
-
Title: MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeXLiuyue Xie, George Z. Wei, Avik Kuthiala, Ce Zheng, Ananya Bal, Mosam Dabhi, Liting Wen, Taru Rustagi, Ethan Lai, Sushil Khyalia, Rohan Choudhury, Morteza Ziyadi, Xu Zhang, Hao Yang, László A. JeniSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Frontier models have either been language-only or have primarily focused on vision and language modalities. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework for thoroughly assessing their cross-modality perception performance. We introduce MAVERIX~(Multimodal Audio-Visual Evaluation Reasoning IndeX), a novel benchmark with 700 videos and 2,556 questions explicitly designed to evaluate multimodal models through tasks that necessitate close integration of video and audio information. MAVERIX uniquely provides models with audiovisual tasks, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration. Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, show performance approaching human levels (around 70% accuracy), while human experts reach near-ceiling performance (95.1%). With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.
- [392] arXiv:2503.21704 [pdf, html, other]
-
Title: Learning to Represent Individual Differences for Choice Decision MakingYan-Ying Chen, Yue Weng, Alexandre Filipowicz, Rumen Iliev, Francine Chen, Shabnam Hakimi, Yanxia Zhang, Matthew Lee, Kent Lyons, Charlene WuComments: Published in IJCAI MRC 2022Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Human decision making can be challenging to predict because decisions are affected by a number of complex factors. Adding to this complexity, decision-making processes can differ considerably between individuals, and methods aimed at predicting human decisions need to take individual differences into account. Behavioral science offers methods by which to measure individual differences (e.g., questionnaires, behavioral models), but these are often narrowed down to low dimensions and not tailored to specific prediction tasks. This paper investigates the use of representation learning to measure individual differences from behavioral experiment data. Representation learning offers a flexible approach to create individual embeddings from data that are both structured (e.g., demographic information) and unstructured (e.g., free text), where the flexibility provides more options for individual difference measures for personalization, e.g., free text responses may allow for open-ended questions that are less privacy-sensitive. In the current paper we use representation learning to characterize individual differences in human performance on an economic decision-making task. We demonstrate that models using representation learning to capture individual differences consistently improve decision predictions over models without representation learning, and even outperform well-known theory-based behavioral models used in these environments. Our results propose that representation learning offers a useful and flexible tool to capture individual differences.
- [393] arXiv:2503.21705 [pdf, html, other]
-
Title: SoK: Towards Reproducibility for Software Packages in Scripting Language EcosystemsComments: 22 pages, 1 figure, submitted to ARES 2025Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR)
The disconnect between distributed software artifacts and their supposed source code enables attackers to leverage the build process for inserting malicious functionality. Past research in this field focuses on compiled language ecosystems, mostly analysing Linux distribution packages. However, the popular scripting language ecosystems potentially face unique issues given the systematic difference in distributed artifacts. This SoK provides an overview of existing research, aiming to highlight future directions, as well as chances to transfer existing knowledge from compiled language ecosystems. To that end, we work out key aspects in current research, systematize identified challenges for software reproducibility, and map them between the ecosystems. We find that the literature is sparse, focusing on few individual problems and ecosystems. This allows us to effectively identify next steps to improve reproducibility in this field.
- [394] arXiv:2503.21708 [pdf, html, other]
-
Title: Elementwise Layer NormalizationComments: 11 pages, 3 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
A recent paper proposed Dynamic Tanh (DyT) as a drop-in replacement for Layer Normalization. Although the method is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we derive DyT mathematically and show that a well-defined approximation is needed to do so. By dropping said approximation, an alternative element-wise transformation is obtained, which we call Elementwise Layer Normalization (ELN). We demonstrate that ELN resembles Layer Normalization more accurately than DyT does.
- [395] arXiv:2503.21709 [pdf, other]
-
Title: Redefining Network Topology in Complex Systems: Merging Centrality Metrics, Spectral Theory, and Diffusion DynamicsSubjects: Other Computer Science (cs.OH)
This paper introduces a novel framework that combines traditional centrality measures with eigenvalue spectra and diffusion processes for a more comprehensive analysis of complex networks. While centrality measures such as degree, closeness, and betweenness have been commonly used to assess nodal importance, they provide limited insight into dynamic network behaviors. By incorporating eigenvalue analysis, which evaluates network robustness and connectivity through spectral properties, and diffusion processes that model information flow, this framework offers a deeper understanding of how networks function under dynamic conditions. Applied to synthetic networks, the approach identifies key nodes not only by centrality but also by their role in diffusion dynamics and vulnerability points, offering a multi-dimensional view that traditional methods alone cannot. This integrated analysis enables a more precise identification of critical nodes and potential weaknesses, with implications for improving network resilience in fields ranging from epidemiology to cybersecurity. Keywords: Centrality measures, eigenvalue spectra, diffusion processes, network analysis, network robustness, information flow, synthetic networks.
- [396] arXiv:2503.21710 [pdf, html, other]
-
Title: Enhancing Repository-Level Software Repair via Repository-Aware Knowledge GraphsSubjects: Software Engineering (cs.SE)
Repository-level software repair faces challenges in bridging semantic gaps between issue descriptions and code patches. Existing approaches, which mostly depend on large language models (LLMs), suffer from semantic ambiguities, limited structural context understanding, and insufficient reasoning capability. To address these limitations, we propose KGCompass with two innovations: (1) a novel repository-aware knowledge graph (KG) that accurately links repository artifacts (issues and pull requests) and codebase entities (files, classes, and functions), allowing us to effectively narrow down the vast search space to only 20 most relevant functions with accurate candidate bug locations and contextual information, and (2) a path-guided repair mechanism that leverages KG-mined entity path, tracing through which allows us to augment LLMs with relevant contextual information to generate precise patches along with their explanations. Experimental results in the SWE-Bench-Lite demonstrate that KGCompass achieves state-of-the-art repair performance (45.67%) and function-level localization accuracy (51.33%) across open-source approaches, costing only $0.20 per repair. Our analysis reveals that among successfully localized bugs, 69.7% require multi-hop traversals through the knowledge graph, without which LLM-based approaches struggle to accurately locate bugs. The knowledge graph built in KGCompass is language agnostic and can be incrementally updated, making it a practical solution for real-world development environments.
- [397] arXiv:2503.21711 [pdf, html, other]
-
Title: Efficient Computation of the Directional Extremal Boundary of a Union of Equal-Radius CirclesComments: 6 pages, 2 figuresSubjects: Computational Geometry (cs.CG)
This paper focuses on computing the directional extremal boundary of a union of equal-radius circles. We introduce an efficient algorithm that accurately determines this boundary by analyzing the intersections and dominant relationships among the circles. The algorithm has time complexity of O(n log n).
- [398] arXiv:2503.21714 [pdf, html, other]
-
Title: As easy as PIE: understanding when pruning causes language models to disagreeComments: Accepted to NAACL 2025 (Findings)Subjects: Computation and Language (cs.CL)
Language Model (LM) pruning compresses the model by removing weights, nodes, or other parts of its architecture. Typically, pruning focuses on the resulting efficiency gains at the cost of effectiveness. However, when looking at how individual data points are affected by pruning, it turns out that a particular subset of data points always bears most of the brunt (in terms of reduced accuracy) when pruning, but this effect goes unnoticed when reporting the mean accuracy of all data points. These data points are called PIEs and have been studied in image processing, but not in NLP. In a study of various NLP datasets, pruning methods, and levels of compression, we find that PIEs impact inference quality considerably, regardless of class frequency, and that BERT is more prone to this than BiLSTM. We also find that PIEs contain a high amount of data points that have the largest influence on how well the model generalises to unseen data. This means that when pruning, with seemingly moderate loss to accuracy across all data points, we in fact hurt tremendously those data points that matter the most. We trace what makes PIEs both hard and impactful to inference to their overall longer and more semantically complex text. These findings are novel and contribute to understanding how LMs are affected by pruning. The code is available at: this https URL
- [399] arXiv:2503.21717 [pdf, html, other]
-
Title: CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?Jiefu Ou, William Gantt Walden, Kate Sanders, Zhengping Jiang, Kaiser Sun, Jeffrey Cheng, William Jurayj, Miriam Wanner, Shaobo Liang, Candice Morgan, Seunghoon Han, Weiqi Wang, Chandler May, Hannah Recknor, Daniel Khashabi, Benjamin Van DurmeSubjects: Computation and Language (cs.CL)
A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper's claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.
- [400] arXiv:2503.21718 [pdf, html, other]
-
Title: Outlier dimensions favor frequent tokens in language modelComments: 9 pages, 4 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
We study last-layer outlier dimensions, this http URL that display extreme activations for the majority of inputs. We show that outlier dimensions arise in many different modern language models, and trace their function back to the heuristic of constantly predicting frequent words. We further show how a model can block this heuristic when it is not contextually appropriate, by assigning a counterbalancing weight mass to the remaining dimensions, and we investigate which model parameters boost outlier dimensions and when they arise during training. We conclude that outlier dimensions are a specialized mechanism discovered by many distinct models to implement a useful token prediction heuristic.
- [401] arXiv:2503.21720 [pdf, html, other]
-
Title: Collab: Controlled Decoding using Mixture of Agents for LLM AlignmentSouradip Chakraborty, Sujay Bhatt, Udari Madhushani Sehwag, Soumya Suvra Ghosal, Jiahao Qiu, Mengdi Wang, Dinesh Manocha, Furong Huang, Alec Koppel, Sumitra GaneshComments: Accepted to ICLR 2025Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences and broader utilities, but it requires updating billions of model parameters, which is computationally expensive. Controlled Decoding, by contrast, provides a mechanism for aligning a model at inference time without retraining. However, single-agent decoding approaches often struggle to adapt to diverse tasks due to the complexity and variability inherent in these tasks. To strengthen the test-time performance w.r.t the target task, we propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies. Treating each prior policy as an agent in the spirit of mixture of agent collaboration, we develop a decoding method that allows for inference-time alignment through a token-level selection strategy among multiple agents. For each token, the most suitable LLM is dynamically chosen from a pool of models based on a long-term utility metric. This policy-switching mechanism ensures optimal model selection at each step, enabling efficient collaboration and alignment among LLMs during decoding. Theoretical analysis of our proposed algorithm establishes optimal performance with respect to the target task represented via a target reward for the given off-the-shelf models. We conduct comprehensive empirical evaluations with open-source aligned models on diverse tasks and preferences, which demonstrates the merits of this approach over single-agent decoding baselines. Notably, Collab surpasses the current SoTA decoding strategy, achieving an improvement of up to 1.56x in average reward and 71.89% in GPT-4 based win-tie rate.
- [402] arXiv:2503.21721 [pdf, html, other]
-
Title: Evaluating Text-to-Image Synthesis with a Conditional Fréchet DistanceSubjects: Computer Vision and Pattern Recognition (cs.CV)
Evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences. We propose cFreD, a metric based on the notion of Conditional Fréchet Distance that explicitly accounts for both visual fidelity and text-prompt alignment. Existing metrics such as Inception Score (IS), Fréchet Inception Distance (FID) and CLIPScore assess either image quality or image-text alignment but not both which limits their correlation with human preferences. Scoring models explicitly trained to replicate human preferences require constant updates and may not generalize to novel generation techniques or out-of-domain inputs. Through extensive experiments across multiple recently proposed text-to-image models and diverse prompt datasets, we demonstrate that cFreD exhibits a higher correlation with human judgments compared to statistical metrics, including metrics trained with human preferences. Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models, standardizing benchmarking in this rapidly evolving field. We release our evaluation toolkit and benchmark in the appendix.
- [403] arXiv:2503.21722 [pdf, html, other]
-
Title: Energy Minimization for Participatory Federated Learning in IoT Analyzed via Game TheoryComments: 6 pages, 6 figures, 2 tables, conferenceJournal-ref: 2024 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Osaka, Japan, 2024, pp. 249-254Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
The Internet of Things requires intelligent decision making in many scenarios. To this end, resources available at the individual nodes for sensing or computing, or both, can be leveraged. This results in approaches known as participatory sensing and federated learning, respectively. We investigate the simultaneous implementation of both, through a distributed approach based on empowering local nodes with game theoretic decision making. A global objective of energy minimization is combined with the individual node's optimization of local expenditure for sensing and transmitting data over multiple learning rounds. We present extensive evaluations of this technique, based on both a theoretical framework and experiments in a simulated network scenario with real data. Such a distributed approach can reach a desired level of accuracy for federated learning without a centralized supervision of the data collector. However, depending on the weight attributed to the local costs of the single node, it may also result in a significantly high Price of Anarchy (from 1.28 onwards). Thus, we argue for the need of incentive mechanisms, possibly based on Age of Information of the single nodes.
- [404] arXiv:2503.21723 [pdf, html, other]
-
Title: OccRobNet : Occlusion Robust Network for Accurate 3D Interacting Hand-Object Pose EstimationComments: Accepted in NATIONAL CONFERENCE ON COMMUNICATIONS (NCC) 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Occlusion is one of the challenging issues when estimating 3D hand pose. This problem becomes more prominent when hand interacts with an object or two hands are involved. In the past works, much attention has not been given to these occluded regions. But these regions contain important and beneficial information that is vital for 3D hand pose estimation. Thus, in this paper, we propose an occlusion robust and accurate method for the estimation of 3D hand-object pose from the input RGB image. Our method includes first localising the hand joints using a CNN based model and then refining them by extracting contextual information. The self attention transformer then identifies the specific joints along with the hand identity. This helps the model to identify the hand belongingness of a particular joint which helps to detect the joint even in the occluded region. Further, these joints with hand identity are then used to estimate the pose using cross attention mechanism. Thus, by identifying the joints in the occluded region, the obtained network becomes robust to occlusion. Hence, this network achieves state-of-the-art results when evaluated on the InterHand2.6M, HO3D and H$_2$O3D datasets.
- [405] arXiv:2503.21727 [pdf, html, other]
-
Title: Enhancing Underwater Navigation through Cross-Correlation-Aware Deep INS/DVL FusionSubjects: Robotics (cs.RO)
The accurate navigation of autonomous underwater vehicles critically depends on the precision of Doppler velocity log (DVL) velocity measurements. Recent advancements in deep learning have demonstrated significant potential in improving DVL outputs by leveraging spatiotemporal dependencies across multiple sensor modalities. However, integrating these estimates into model-based filters, such as the extended Kalman filter, introduces statistical inconsistencies, most notably, cross-correlations between process and measurement noise. This paper addresses this challenge by proposing a cross-correlation-aware deep INS/DVL fusion framework. Building upon BeamsNet, a convolutional neural network designed to estimate AUV velocity using DVL and inertial data, we integrate its output into a navigation filter that explicitly accounts for the cross-correlation induced between the noise sources. This approach improves filter consistency and better reflects the underlying sensor error structure. Evaluated on two real-world underwater trajectories, the proposed method outperforms both least squares and cross-correlation-neglecting approaches in terms of state uncertainty. Notably, improvements exceed 10% in velocity and misalignment angle confidence metrics. Beyond demonstrating empirical performance, this framework provides a theoretically principled mechanism for embedding deep learning outputs within stochastic filters.
- [406] arXiv:2503.21729 [pdf, html, other]
-
Title: ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented GenerationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Reasoning Models (LRMs) exhibit remarkable reasoning abilities but rely primarily on parametric knowledge, limiting factual accuracy. While recent works equip reinforcement learning (RL)-based LRMs with retrieval capabilities, they suffer from overthinking and lack robustness in reasoning, reducing their effectiveness in question answering (QA) tasks. To address this, we propose ReaRAG, a factuality-enhanced reasoning model that explores diverse queries without excessive iterations. Our solution includes a novel data construction framework with an upper bound on the reasoning chain length. Specifically, we first leverage an LRM to generate deliberate thinking, then select an action from a predefined action space (Search and Finish). For Search action, a query is executed against the RAG engine, where the result is returned as observation to guide reasoning steps later. This process iterates until a Finish action is chosen. Benefiting from ReaRAG's strong reasoning capabilities, our approach outperforms existing baselines on multi-hop QA. Further analysis highlights its strong reflective ability to recognize errors and refine its reasoning trajectory. Our study enhances LRMs' factuality while effectively integrating robust reasoning for Retrieval-Augmented Generation (RAG).
- [407] arXiv:2503.21730 [pdf, html, other]
-
Title: Effective Skill Unlearning through Intervention and AbstentionComments: Accepted to NAACL 2025 main conferenceSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Large language Models (LLMs) have demonstrated remarkable skills across various domains. Understanding the mechanisms behind their abilities and implementing controls over them is becoming increasingly important for developing better models. In this paper, we focus on skill unlearning in LLMs, specifically unlearning a particular skill while retaining their overall capabilities. We introduce two lightweight, training-free machine skill unlearning techniques for LLMs. First, we observe that the pre-activation distribution of neurons in each Feed-Forward Layer (FFL) differs when the model demonstrates different skills. Additionally, we find that queries triggering the same skill cluster within the FFL key space and can be separated from other queries using a hypercube. Based on these observations, we propose two lightweight, training-free skill unlearning methods via \textit{intervention} and \textit{abstention} respectively: \texttt{Neuron Adjust} and \texttt{Key Space Detection}. We evaluate our methods on unlearning math-solving, Python-coding, and comprehension skills across seven different languages. The results demonstrate their strong unlearning capabilities for the designated skills. Specifically, \texttt{Key Space Detection} achieves over 80\% relative performance drop on the forgetting skill and less than 10\% relative performance drop on other skills and the model's general knowledge (MMLU) for most unlearning tasks. Our code is available at this https URL
- [408] arXiv:2503.21731 [pdf, html, other]
-
Title: Cylindrical Algebraic Decomposition in \textit{Macaulay2}Comments: 16 pages, 9 figuresSubjects: Symbolic Computation (cs.SC); Algebraic Geometry (math.AG)
\texttt{CylindricalAlgebraicDecomposition.m2} is the first implementation of Cylindrical Algebraic Decomposition (CAD) in \textit{Macaulay2}. CAD decomposes space into `cells' where input polynomials are sign-invariant. This package computes an Open CAD (full-dimensional cells only) for sets of real polynomials with rational coefficients, enabling users to solve existential problems involving strict inequalities. With the construction of a full CAD (cells of all dimensions), this tool could be extended to solve any real quantifier elimination problem. The current implementation employs the Lazard projection and introduces a new heuristic for choosing the variable ordering.
- [409] arXiv:2503.21732 [pdf, html, other]
-
Title: SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape ModelingXianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, Yangguang LiComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to $1024^3$ directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ~82% reduction in Chamfer Distance and a ~88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation and modeling.
- [410] arXiv:2503.21733 [pdf, other]
-
Title: Fully dynamic biconnectivity in $\tilde{\mathcal{O}}(\log^2 n)$ timeSubjects: Data Structures and Algorithms (cs.DS)
We present a deterministic fully-dynamic data structure for maintaining information about the cut-vertices in a graph; i.e. the vertices whose removal would disconnect the graph. Our data structure supports insertion and deletion of edges, as well as queries to whether a pair of connected vertices are either biconnected, or can be separated by a cutvertex, and in the latter case we support access to separating cutvertices. All update operations are supported in amortized $O(\log^2 n \log^2 \log n)$ time, and queries take worst-case $O(\log n \log^2 \log n)$ time. Note that these time bounds match the current best for deterministic dynamic connectivity up to $\log \log n$ factors.
We obtain our improved running time by a series of reductions from the original problem into well-defined data structure problems. While we do apply the well-known techniques for improving running time of two-edge connectivity [STOC'00, SODA'18], these techniques alone do not lead to an update time of $\tilde{O}(\log^3 n)$, let alone the $\tilde{O}(\log^2 n)$ we give as a final result.
Our contributions include a formally defined transient expose operation, which can be thought of as a cheaper read-only expose operation on a top tree. For each vertex in the graph, we maintain a data structure over its neighbors, and in this data structure we apply biasing (twice) to save two $\tilde{O}(\log n)$ factors. One of these biasing techniques is a new biased disjoint sets data structure, which may be of independent interest. Moreover, in this neighborhood data structure, we facilitate that the vertex can select two VIP neighbors that get special treatment, corresponding to its potentially two neighbors on an exposed path, improving a $\log n$-time operation down to constant time. It is this combination of VIP neighbors with the transient expose that saves an $\tilde{O}(\log n)$-factor from another bottleneck. - [411] arXiv:2503.21735 [pdf, html, other]
-
Title: GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release AnalyticsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Ensuring the reliability and effectiveness of software release decisions is critical, particularly in safety-critical domains like automotive systems. Precise analysis of release validation data, often presented in tabular form, plays a pivotal role in this process. However, traditional methods that rely on manual analysis of extensive test datasets and validation metrics are prone to delays and high costs. Large Language Models (LLMs) offer a promising alternative but face challenges in analytical reasoning, contextual understanding, handling out-of-scope queries, and processing structured test data consistently; limitations that hinder their direct application in safety-critical scenarios. This paper introduces GateLens, an LLM-based tool for analyzing tabular data in the automotive domain. GateLens translates natural language queries into Relational Algebra (RA) expressions and then generates optimized Python code. It outperforms the baseline system on benchmarking datasets, achieving higher F1 scores and handling complex and ambiguous queries with greater robustness. Ablation studies confirm the critical role of the RA module, with performance dropping sharply when omitted. Industrial evaluations reveal that GateLens reduces analysis time by over 80% while maintaining high accuracy and reliability. As demonstrated by presented results, GateLens achieved high performance without relying on few-shot examples, showcasing strong generalization across various query types from diverse company roles. Insights from deploying GateLens with a partner automotive company offer practical guidance for integrating AI into critical workflows such as release validation. Results show that by automating test result analysis, GateLens enables faster, more informed, and dependable release decisions, and can thus advance software scalability and reliability in automotive systems.
- [412] arXiv:2503.21745 [pdf, html, other]
-
Title: 3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV)
3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a comprehensive preference dataset over generative models. To mitigate this absence, we develop 3DGen-Arena, an integrated platform in a battle manner. Then, we carefully design diverse text and image prompts and leverage the arena platform to gather human preferences from both public users and expert annotators, resulting in a large-scale multi-dimension human preference dataset 3DGen-Bench. Using this dataset, we further train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval. These two models innovatively unify the quality evaluation of text-to-3D and image-to-3D generation, and jointly form our automated evaluation system with their respective strengths. Extensive experiments demonstrate the efficacy of our scoring model in predicting human preferences, exhibiting a superior correlation with human ranks compared to existing metrics. We believe that our 3DGen-Bench dataset and automated evaluation system will foster a more equitable evaluation in the field of 3D generation, further promoting the development of 3D generative models and their downstream applications.
- [413] arXiv:2503.21747 [pdf, html, other]
-
Title: CTRL-O: Language-Controllable Object-Centric Visual Representation LearningAniket Didolkar, Andrii Zadaianchuk, Rabiul Awal, Maximilian Seitzer, Efstratios Gavves, Aishwarya AgrawalComments: Accepted at CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.
- [414] arXiv:2503.21749 [pdf, html, other]
-
Title: LeX-Art: Rethinking Text Generation via Scalable High-Quality Data SynthesisShitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Peng Gao, Bin Fu, Zhen LiComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.
- [415] arXiv:2503.21751 [pdf, html, other]
-
Title: Reconstructing Humans with a Biomechanically Accurate SkeletonComments: CVPR 2025. Project Webpage: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
In this paper, we introduce a method for reconstructing 3D humans from a single image using a biomechanically accurate skeleton model. To achieve this, we train a transformer that takes an image as input and estimates the parameters of the model. Due to the lack of training data for this task, we build a pipeline to produce pseudo ground truth model parameters for single images and implement a training procedure that iteratively refines these pseudo labels. Compared to state-of-the-art methods for 3D human mesh recovery, our model achieves competitive performance on standard benchmarks, while it significantly outperforms them in settings with extreme 3D poses and viewpoints. Additionally, we show that previous reconstruction methods frequently violate joint angle limits, leading to unnatural rotations. In contrast, our approach leverages the biomechanically plausible degrees of freedom making more realistic joint rotation estimates. We validate our approach across multiple human pose estimation benchmarks. We make the code, models and data available at: this https URL
- [416] arXiv:2503.21755 [pdf, html, other]
-
Title: VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic FaithfulnessDian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, Ziwei LiuComments: Equal contributions from first two authors. Project page: this https URL Code: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real "world models" through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored for individual dimensions, our evaluation framework integrates generalists such as state-of-the-art VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive annotations to ensure alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.
- [417] arXiv:2503.21756 [pdf, html, other]
-
Title: A Unified Framework for Diffusion Bridge Problems: Flow Matching and Schrödinger Matching into OneSubjects: Machine Learning (cs.LG)
The bridge problem is to find an SDE (or sometimes an ODE) that bridges two given distributions. The application areas of the bridge problem are enormous, among which the recent generative modeling (e.g., conditional or unconditional image generation) is the most popular. Also the famous Schrödinger bridge problem, a widely known problem for a century, is a special instance of the bridge problem. Two most popular algorithms to tackle the bridge problems in the deep learning era are: (conditional) flow matching and iterative fitting algorithms, where the former confined to ODE solutions, and the latter specifically for the Schrödinger bridge problem. The main contribution of this article is in two folds: i) We provide concise reviews of these algorithms with technical details to some extent; ii) We propose a novel unified perspective and framework that subsumes these seemingly unrelated algorithms (and their variants) into one. In particular, we show that our unified framework can instantiate the Flow Matching (FM) algorithm, the (mini-batch) optimal transport FM algorithm, the (mini-batch) Schrödinger bridge FM algorithm, and the deep Schrödinger bridge matching (DSBM) algorithm as its special cases. We believe that this unified framework will be useful for viewing the bridge problems in a more general and flexible perspective, and in turn can help researchers and practitioners to develop new bridge algorithms in their fields.
- [418] arXiv:2503.21757 [pdf, html, other]
-
Title: Fwd2Bot: LVLM Visual Token Compression with Double Forward BottleneckSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In this work, we aim to compress the vision tokens of a Large Vision Language Model (LVLM) into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) is storage-efficient. We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner. At the core of Fwd2bot there exists a "double-forward pass" training strategy, whereby, during the first forward pass, the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of summary tokens. Then, using the same LLM, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training signal is provided by two losses: an autoregressive one applied after the second pass that provides a direct optimization objective for compression, and a contrastive loss, applied after the first pass, that further boosts the representation strength, especially for discriminative tasks. The training is further enhanced by stage-specific adapters. We accompany the proposed method by an in-depth ablation study. Overall, Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks. For generative tasks, we offer a 2x higher compression rate without compromising the generative capabilities, setting a new state-of-the-art result. For discriminative tasks, we set a new state-of-the-art on image retrieval and compositionality.
- [419] arXiv:2503.21758 [pdf, html, other]
-
Title: Lumina-Image 2.0: A Unified and Efficient Image Generative FrameworkQi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Manyuan Zhang, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Conghui He, Bo Zhang, Xiaohong Liu, Hongsheng Li, Yu Qiao, Chang Xu, Peng GaoComments: Tech Report, 21 pages, 12 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at this https URL.
- [420] arXiv:2503.21760 [pdf, html, other]
-
Title: MemInsight: Autonomous Memory Augmentation for LLM AgentsSubjects: Computation and Language (cs.CL)
Large language model (LLM) agents have evolved to intelligently process information, make decisions, and interact with users or tools. A key capability is the integration of long-term memory capabilities, enabling these agents to draw upon historical interactions and knowledge. However, the growing memory size and need for semantic structuring pose significant challenges. In this work, we propose an autonomous memory augmentation approach, MemInsight, to enhance semantic data representation and retrieval mechanisms. By leveraging autonomous augmentation to historical interactions, LLM agents are shown to deliver more accurate and contextualized responses. We empirically validate the efficacy of our proposed approach in three task scenarios; conversational recommendation, question answering and event summarization. On the LLM-REDIAL dataset, MemInsight boosts persuasiveness of recommendations by up to 14%. Moreover, it outperforms a RAG baseline by 34% in recall for LoCoMo retrieval. Our empirical results show the potential of MemInsight to enhance the contextual performance of LLM agents across multiple tasks.
- [421] arXiv:2503.21761 [pdf, html, other]
-
Title: Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single VideoComments: CVPR 2025. Project page (with code): this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.
- [422] arXiv:2503.21765 [pdf, html, other]
-
Title: Exploring the Evolution of Physics Cognition in Video Generation: A SurveyMinghui Lin, Xiang Wang, Yishan Wang, Shu Wang, Fengqi Dai, Pengxiang Ding, Cunxiang Wang, Zhengrong Zuo, Nong Sang, Siteng Huang, Donglin WangComments: A comprehensive list of papers studied in this survey is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - generated content often violates the fundamental laws of physics, falling into the dilemma of ''visual realism but physical absurdity". Researchers began to increasingly recognize the importance of physical fidelity in video generation and attempted to integrate heuristic physical cognition such as motion representations and physical knowledge into generative systems to simulate real-world dynamic scenarios. Considering the lack of a systematic overview in this field, this survey aims to provide a comprehensive summary of architecture designs and their applications to fill this gap. Specifically, we discuss and organize the evolutionary process of physical cognition in video generation from a cognitive science perspective, while proposing a three-tier taxonomy: 1) basic schema perception for generation, 2) passive cognition of physical knowledge for generation, and 3) active cognition for world simulation, encompassing state-of-the-art methods, classical paradigms, and benchmarks. Subsequently, we emphasize the inherent key challenges in this domain and delineate potential pathways for future research, contributing to advancing the frontiers of discussion in both academia and industry. Through structured review and interdisciplinary analysis, this survey aims to provide directional guidance for developing interpretable, controllable, and physically consistent video generation paradigms, thereby propelling generative models from the stage of ''visual mimicry'' towards a new phase of ''human-like physical comprehension''.
- [423] arXiv:2503.21766 [pdf, html, other]
-
Title: Stable-SCore: A Stable Registration-based Framework for 3D Shape CorrespondenceComments: Accepted by CVPR 2025. Homepage: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Establishing character shape correspondence is a critical and fundamental task in computer vision and graphics, with diverse applications including re-topology, attribute transfer, and shape interpolation. Current dominant functional map methods, while effective in controlled scenarios, struggle in real situations with more complex challenges such as non-isometric shape discrepancies. In response, we revisit registration-for-correspondence methods and tap their potential for more stable shape correspondence estimation. To overcome their common issues including unstable deformations and the necessity for careful pre-alignment or high-quality initial 3D correspondences, we introduce Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence. We first re-purpose a foundation model for 2D character correspondence that ensures reliable and stable 2D mappings. Crucially, we propose a novel Semantic Flow Guided Registration approach that leverages 2D correspondence to guide mesh deformations. Our framework significantly surpasses existing methods in challenging scenarios, and brings possibilities for a wide array of real applications, as demonstrated in our results.
- [424] arXiv:2503.21767 [pdf, html, other]
-
Title: Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary QueryingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Open-vocabulary querying in 3D Gaussian Splatting aims to identify semantically relevant regions within a 3D Gaussian representation based on a given text query. Prior work, such as LangSplat, addressed this task by retrieving these regions in the form of segmentation masks on 2D renderings. More recently, OpenGaussian introduced point-level querying, which directly selects a subset of 3D Gaussians. In this work, we propose a point-level querying method that builds upon LangSplat's framework. Our approach improves the framework in two key ways: (a) we leverage masklets from the Segment Anything Model 2 (SAM2) to establish semantic consistent ground-truth for distilling the language Gaussians; (b) we introduces a novel two-step querying approach that first retrieves the distilled ground-truth and subsequently uses the ground-truth to query the individual Gaussians. Experimental evaluations on three benchmark datasets demonstrate that the proposed method achieves better performance compared to state-of-the-art approaches. For instance, our method achieves an mIoU improvement of +20.42 on the 3D-OVS dataset.
- [425] arXiv:2503.21770 [pdf, html, other]
-
Title: Visual Jenga: Discovering Object Dependencies via Counterfactual InpaintingComments: project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.
- [426] arXiv:2503.21771 [pdf, html, other]
-
Title: A Unified Image-Dense Annotation Generation Model for Underwater ScenesComments: Accepted by CVPR 2025. The code is available at https: //github.com/HongkLin/TIDESubjects: Computer Vision and Pattern Recognition (cs.CV)
Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for gaining a comprehensive understanding of underwater scenes. Nevertheless, high-quality and large-scale underwater datasets with dense annotations remain scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text-to-image and text-to-dense annotations within a single model. The Implicit Layout Sharing mechanism (ILS) and cross-modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations. We synthesize a large-scale underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks. The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations. We hope our method can offer new perspectives on alleviating data scarcity issues in other fields. The code is available at https: //github.com/HongkLin/TIDE.
- [427] arXiv:2503.21772 [pdf, html, other]
-
Title: LOCORE: Image Re-ranking with Long-Context Sequence ModelingZilin Xiao, Pavel Suma, Ayush Sachdeva, Hao-Jen Wang, Giorgos Kordopatis-Zilos, Giorgos Tolias, Vicente OrdonezComments: CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce LOCORE, Long-Context Re-ranker, a model that takes as input local descriptors corresponding to an image query and a list of gallery images and outputs similarity scores between the query and each gallery image. This model is used for image retrieval, where typically a first ranking is performed with an efficient similarity measure, and then a shortlist of top-ranked images is re-ranked based on a more fine-grained similarity measure. Compared to existing methods that perform pair-wise similarity estimation with local descriptors or list-wise re-ranking with global descriptors, LOCORE is the first method to perform list-wise re-ranking with local descriptors. To achieve this, we leverage efficient long-context sequence models to effectively capture the dependencies between query and gallery images at the local-descriptor level. During testing, we process long shortlists with a sliding window strategy that is tailored to overcome the context size limitations of sequence models. Our approach achieves superior performance compared with other re-rankers on established image retrieval benchmarks of landmarks (ROxf and RPar), products (SOP), fashion items (In-Shop), and bird species (CUB-200) while having comparable latency to the pair-wise local descriptor re-rankers.
- [428] arXiv:2503.21774 [pdf, html, other]
-
Title: Optimal Stepsize for Diffusion SamplingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Diffusion models achieve remarkable generation quality but suffer from computational intensive sampling due to suboptimal step discretization. While existing works focus on optimizing denoising directions, we address the principled design of stepsize schedules. This paper proposes Optimal Stepsize Distillation, a dynamic programming framework that extracts theoretically optimal schedules by distilling knowledge from reference trajectories. By reformulating stepsize optimization as recursive error minimization, our method guarantees global discretization bounds through optimal substructure exploitation. Crucially, the distilled schedules demonstrate strong robustness across architectures, ODE solvers, and noise schedules. Experiments show 10x accelerated text-to-image generation while preserving 99.4% performance on GenEval. Our code is available at this https URL.
- [429] arXiv:2503.21775 [pdf, html, other]
-
Title: StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross FusionComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance. Project Page: this https URL
- [430] arXiv:2503.21776 [pdf, html, other]
-
Title: Video-R1: Reinforcing Video Reasoning in MLLMsKaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, Xiangyu YueComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All codes, models, data are released.
- [431] arXiv:2503.21777 [pdf, html, other]
-
Title: Test-Time Visual In-Context TuningComments: CVPR 2025. Code: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. Specifically, we flip the role between the task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks ranging from high-level visual understanding to low-level image processing, with 15 common corruptions, demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. In addition, we show the potential of applying VICT for unseen tasks at test time. Code: this https URL.
- [432] arXiv:2503.21778 [pdf, html, other]
-
Title: HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAMComments: ICRA 2025. Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capacity, we propose a hybrid encoding network that combines the complementary strengths of hash-grid, tri-planes, and one-blob, improving the completeness and smoothness of reconstruction. Additionally, we introduce structural supervision by sampling patches of non-local pixels rather than individual rays to better capture the scene structure. To ensure global consistency, we implement an active global bundle adjustment (BA) to eliminate camera drifts and mitigate accumulative errors. Experimental results demonstrate that HS-SLAM outperforms the baselines in tracking and reconstruction accuracy while maintaining the efficiency required for robotics.
- [433] arXiv:2503.21779 [pdf, html, other]
-
Title: X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic ReconstructionComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows. Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality. In this paper, We propose X$^2$-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization. To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization. Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. By unifying continuous motion modeling with hardware-free period learning, X$^2$-Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging. Project website at: this https URL.
- [434] arXiv:2503.21780 [pdf, html, other]
-
Title: Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic SegmentationReza Qorbani, Gianluca Villani, Theodoros Panagiotakopoulos, Marc Botet Colomer, Linus Härenstam-Nielsen, Mattia Segu, Pier Luigi Dovesi, Jussi Karlgren, Daniel Cremers, Federico Tombari, Matteo PoggiSubjects: Computer Vision and Pattern Recognition (cs.CV)
Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world applications. We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation. SemLA leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on a 20-domain benchmark built over 10 standard datasets demonstrate SemLA's superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.
- [435] arXiv:2503.21781 [pdf, html, other]
-
Title: VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion ModelsComments: CVPR 2025. Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.
- [436] arXiv:2503.21782 [pdf, html, other]
-
Title: Mobile-VideoGPT: Fast and Accurate Video Understanding Language ModelComments: Technical Report. Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: this https URL.
New submissions (showing 436 of 436 entries)
- [437] arXiv:2503.09299 (cross-list from math.ST) [pdf, html, other]
-
Title: Low-Rank Graphon Estimation: Theory and Applications to Graphon GamesSubjects: Statistics Theory (math.ST); Computer Science and Game Theory (cs.GT)
This paper tackles the challenge of estimating a low-rank graphon from sampled network data, employing a singular value thresholding (SVT) estimator to create a piecewise-constant graphon based on the network's adjacency matrix. Under certain assumptions about the graphon's structural properties, we establish bounds on the operator norm distance between the true graphon and its estimator, as well as on the rank of the estimated graphon. In the second part of the paper, we apply our estimator to graphon games. We derive bounds on the suboptimality of interventions in the social welfare problem in graphon games when the intervention is based on the estimated graphon. These bounds are expressed in terms of the operator norm of the difference between the true and estimated graphons. We also emphasize the computational benefits of using the low-rank estimated graphon to solve these problems.
- [438] arXiv:2503.20787 (cross-list from q-fin.TR) [pdf, html, other]
-
Title: Advanced Digital Simulation for Financial Market Dynamics: A Case of Commodity FuturesSubjects: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
After decades of evolution, the financial system has increasingly deviated from an idealized framework based on theorems. It necessitates accurate projections of complex market dynamics and human behavioral patterns. With the development of data science and machine intelligence, researchers are trying to digitalize and automate market prediction. However, existing methodologies struggle to represent the diversity of individuals and are regardless of the domino effects of interactions on market dynamics, leading to the poor performance facing abnormal market conditions where non-quantitative information dominates the market. To alleviate these disadvantages requires the introduction of knowledge about how non-quantitative information, like news and policy, affects market dynamics. This study investigates overcoming these challenges through rehearsing potential market trends based on the financial large language model agents whose behaviors are aligned with their cognition and analyses in markets. We propose a hierarchical knowledge architecture for financial large language model agents, integrating fine-tuned language models and specialized generators optimized for trading scenarios. For financial market, we develop an advanced interactive behavioral simulation system that enables users to configure agents and automate market simulations. In this work, we take commodity futures as an example to research the effectiveness of our methodologies. Our real-world case simulation succeeds in rehearsing abnormal market dynamics under geopolitical events and reaches an average accuracy of 3.4% across various points in time after the event on predicting futures price. Experimental results demonstrate our method effectively leverages diverse information to simulate behaviors and their impact on market dynamics through systematic interaction.
- [439] arXiv:2503.20789 (cross-list from eess.SP) [pdf, other]
-
Title: Neuro-Informed Adaptive Learning (NIAL) Algorithm: A Hybrid Deep Learning Approach for ECG Signal ClassificationComments: 1 figure ,2 pagesSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
The detection of cardiac abnormalities using electrocardiogram (ECG) signals is crucial for early diagnosis and intervention in cardiovascular diseases. Traditional deep learning models often lack adaptability to varying signal patterns. This study introduces the Neuro-Informed Adaptive Learning (NIAL) algorithm, a hybrid approach integrating convolutional neural networks (CNNs) and transformer-based attention mechanisms to enhance ECG signal classification. The algorithm dynamically adjusts learning rates based on real-time validation performance, ensuring efficient convergence. Using the MIT-BIH Arrhythmia and PTB Diagnostic ECG datasets, our model achieves high classification accuracy, outperforming conventional approaches. These findings highlight the potential of NIAL in real-time cardiovascular monitoring applications.
- [440] arXiv:2503.20807 (cross-list from stat.ML) [pdf, html, other]
-
Title: Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language ModelsComments: The first two authors contribute equally to this work and are listed in alphabetical orderSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.
- [441] arXiv:2503.20822 (cross-list from eess.IV) [pdf, html, other]
-
Title: Synthetic Video Enhances Physical Fidelity in Video SynthesisSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. These rendered videos respect real-world physics, such as maintaining 3D consistency, and serve as a valuable resource that can potentially improve video generation models. To harness this potential, we propose a solution that curates and integrates synthetic data while introducing a method to transfer its physical realism to the model, significantly reducing unwanted artifacts. Through experiments on three representative tasks emphasizing physical consistency, we demonstrate its efficacy in enhancing physical fidelity. While our model still lacks a deep understanding of physics, our work offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis. Website: this https URL
- [442] arXiv:2503.20824 (cross-list from eess.IV) [pdf, html, other]
-
Title: Exploiting Temporal State Space Sharing for Video Semantic SegmentationSyed Ariff Syed Hesham, Yun Liu, Guolei Sun, Henghui Ding, Jing Yang, Ender Konukoglu, Xue Geng, Xudong JiangComments: IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool. By processing spatial patches independently and incorporating shifted operation, TV3S supports highly parallel computation in both training and inference stages, which reduces the delay in sequential state space processing and improves the scalability for long video sequences. Moreover, TV3S incorporates information from prior frames during inference, achieving long-range temporal coherence and superior adaptability to extended sequences. Evaluations on the VSPW and Cityscapes datasets reveal that our approach outperforms current state-of-the-art methods, establishing a new standard for VSS with consistent results across long video sequences. By achieving a good balance between accuracy and efficiency, TV3S shows a significant advancement in spatiotemporal modeling, paving the way for efficient video analysis. The code is publicly available at this https URL.
- [443] arXiv:2503.20841 (cross-list from q-bio.QM) [pdf, other]
-
Title: In vitro 2 In vivo : Bidirectional and High-Precision Generation of In Vitro and In Vivo Neuronal Spike DataComments: 17 pages, 5 figuresSubjects: Quantitative Methods (q-bio.QM); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Adaptation and Self-Organizing Systems (nlin.AO); Neurons and Cognition (q-bio.NC)
Neurons encode information in a binary manner and process complex signals. However, predicting or generating diverse neural activity patterns remains challenging. In vitro and in vivo studies provide distinct advantages, yet no robust computational framework seamlessly integrates both data types. We address this by applying the Transformer model, widely used in large-scale language models, to neural data. To handle binary data, we introduced Dice loss, enabling accurate cross-domain neural activity generation. Structural analysis revealed how Dice loss enhances learning and identified key brain regions facilitating high-precision data generation. Our findings support the 3Rs principle in animal research, particularly Replacement, and establish a mathematical framework bridging animal experiments and human clinical studies. This work advances data-driven neuroscience and neural activity modeling, paving the way for more ethical and effective experimental methodologies.
- [444] arXiv:2503.20879 (cross-list from quant-ph) [pdf, other]
-
Title: Quantum advantage for learning shallow neural networks with natural data distributionsComments: 8 pages, 1 figure + 80-page appendixSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG)
The application of quantum computers to machine learning tasks is an exciting potential direction to explore in search of quantum advantage. In the absence of large quantum computers to empirically evaluate performance, theoretical frameworks such as the quantum probably approximately correct (PAC) and quantum statistical query (QSQ) models have been proposed to study quantum algorithms for learning classical functions. Despite numerous works investigating quantum advantage in these models, we nevertheless only understand it at two extremes: either exponential quantum advantages for uniform input distributions or no advantage for potentially adversarial distributions. In this work, we study the gap between these two regimes by designing an efficient quantum algorithm for learning periodic neurons in the QSQ model over a broad range of non-uniform distributions, which includes Gaussian, generalized Gaussian, and logistic distributions. To our knowledge, our work is also the first result in quantum learning theory for classical functions that explicitly considers real-valued functions. Recent advances in classical learning theory prove that learning periodic neurons is hard for any classical gradient-based algorithm, giving us an exponential quantum advantage over such algorithms, which are the standard workhorses of machine learning. Moreover, in some parameter regimes, the problem remains hard for classical statistical query algorithms and even general classical algorithms learning under small amounts of noise.
- [445] arXiv:2503.21002 (cross-list from quant-ph) [pdf, html, other]
-
Title: Covert Entanglement Generation and SecrecySubjects: Quantum Physics (quant-ph); Information Theory (cs.IT)
We determine the covert capacity for entanglement generation over a noisy quantum channel. While secrecy guarantees that the transmitted information remains inaccessible to an adversary, covert communication ensures that the transmission itself remains undetectable. The entanglement dimension follows a square root law (SRL) in the covert setting, i.e., $O(\sqrt{n})$ EPR pairs can be distributed covertly and reliably over n channel uses. We begin with covert communication of classical information under a secrecy constraint. We then leverage this result to construct a coding scheme for covert entanglement generation. Consequently, we establish achievability of the same covert entanglement generation rate as the classical information rate without secrecy, albeit with a larger key.
- [446] arXiv:2503.21054 (cross-list from eess.IV) [pdf, html, other]
-
Title: Operating Room Workflow Analysis via Reasoning Segmentation over Digital TwinsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Analyzing operating room (OR) workflows to derive quantitative insights into OR efficiency is important for hospitals to maximize patient care and financial sustainability. Prior work on OR-level workflow analysis has relied on end-to-end deep neural networks. While these approaches work well in constrained settings, they are limited to the conditions specified at development time and do not offer the flexibility necessary to accommodate the OR workflow analysis needs of various OR scenarios (e.g., large academic center vs. rural provider) without data collection, annotation, and retraining. Reasoning segmentation (RS) based on foundation models offers this flexibility by enabling automated analysis of OR workflows from OR video feeds given only an implicit text query related to the objects of interest. Due to the reliance on large language model (LLM) fine-tuning, current RS approaches struggle with reasoning about semantic/spatial relationships and show limited generalization to OR video due to variations in visual characteristics and domain-specific terminology. To address these limitations, we first propose a novel digital twin (DT) representation that preserves both semantic and spatial relationships between the various OR components. Then, building on this foundation, we propose ORDiRS (Operating Room Digital twin representation for Reasoning Segmentation), an LLM-tuning-free RS framework that reformulates RS into a "reason-retrieval-synthesize" paradigm. Finally, we present ORDiRS-Agent, an LLM-based agent that decomposes OR workflow analysis queries into manageable RS sub-queries and generates responses by combining detailed textual explanations with supporting visual evidence from RS. Experimental results on both an in-house and a public OR dataset demonstrate that our ORDiRS achieves a cIoU improvement of 6.12%-9.74% compared to the existing state-of-the-arts.
- [447] arXiv:2503.21128 (cross-list from stat.ML) [pdf, html, other]
-
Title: Squared families: Searching beyond regular probability modelsComments: 43 pages. PreprintSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We introduce squared families, which are families of probability densities obtained by squaring a linear transformation of a statistic. Squared families are singular, however their singularity can easily be handled so that they form regular models. After handling the singularity, squared families possess many convenient properties. Their Fisher information is a conformal transformation of the Hessian metric induced from a Bregman generator. The Bregman generator is the normalising constant, and yields a statistical divergence on the family. The normalising constant admits a helpful parameter-integral factorisation, meaning that only one parameter-independent integral needs to be computed for all normalising constants in the family, unlike in exponential families. Finally, the squared family kernel is the only integral that needs to be computed for the Fisher information, statistical divergence and normalising constant. We then describe how squared families are special in the broader class of $g$-families, which are obtained by applying a sufficiently regular function $g$ to a linear transformation of a statistic. After removing special singularities, positively homogeneous families and exponential families are the only $g$-families for which the Fisher information is a conformal transformation of the Hessian metric, where the generator depends on the parameter only through the normalising constant. Even-order monomial families also admit parameter-integral factorisations, unlike exponential families. We study parameter estimation and density estimation in squared families, in the well-specified and misspecified settings. We use a universal approximation property to show that squared families can learn sufficiently well-behaved target densities at a rate of $\mathcal{O}(N^{-1/2})+C n^{-1/4}$, where $N$ is the number of datapoints, $n$ is the number of parameters, and $C$ is some constant.
- [448] arXiv:2503.21134 (cross-list from quant-ph) [pdf, html, other]
-
Title: On the Utility of Quantum Entanglement for Joint Communication and Instantaneous DetectionComments: Submitted to the IEEE for possible publicationSubjects: Quantum Physics (quant-ph); Information Theory (cs.IT)
Entanglement is known to significantly improve the performance (separately) of communication and detection schemes that utilize quantum resources. This work explores the simultaneous utility of quantum entanglement for (joint) communication and detection schemes, over channels that are convex combinations of identity, depolarization and erasure operators, both with perfect and imperfect entanglement assistance. The channel state is binary, rapidly time-varying and unknown to the transmitter. While the communication is delay-tolerant, allowing the use of arbitrarily long codewords to ensure reliable decoding, the channel state detection is required to be instantaneous. The detector is neither co-located with the transmitter, nor able to wait for the decoding in order to learn the transmitted waveform. The results of this work appear in the form of communication-rate vs instantaneous-detection-error tradeoffs, with and without quantum entanglement. Despite the challenges that place the two tasks at odds with each other, the results indicate that quantum entanglement can indeed be simultaneously and significantly beneficial for joint communication and instantaneous detection.
- [449] arXiv:2503.21176 (cross-list from physics.comp-ph) [pdf, html, other]
-
Title: GPU-Accelerated Charge-Equilibration for Shadow Molecular Dynamics in PythonSubjects: Computational Physics (physics.comp-ph); Computational Engineering, Finance, and Science (cs.CE)
With recent advancements in machine learning for interatomic potentials, Python has become the go-to programming language for exploring new ideas. While machine-learning potentials are often developed in Python-based frameworks, existing molecular dynamics software is predominantly written in lower-level languages. This disparity complicates the integration of machine learning potentials into these molecular dynamics libraries. Additionally, machine learning potentials typically focus on local features, often neglecting long-range electrostatics due to computational complexities. This is a key limitation as applications can require long-range electrostatics and even flexible charges to achieve the desired accuracy. Recent charge equilibration models can address these issues, but they require iterative solvers to assign relaxed flexible charges to the atoms. Conventional implementations also demand very tight convergence to achieve long-term stability, further increasing computational cost. In this work, we present a scalable Python implementation of a recently proposed shadow molecular dynamics scheme based on a charge equilibration model, which avoids the convergence problem while maintaining long-term energy stability and accuracy of observable properties. To deliver a functional and user-friendly Python-based library, we implemented an efficient neighbor list algorithm, Particle Mesh Ewald, and traditional Ewald summation techniques, leveraging the GPU-accelerated power of Triton and PyTorch. We integrated these approaches with the Python-based shadow molecular dynamics scheme, enabling fast charge equilibration for scalable machine learning potentials involving systems with hundreds of thousands of atoms.
- [450] arXiv:2503.21186 (cross-list from quant-ph) [pdf, html, other]
-
Title: DemoQuanDT: A Carrier-Grade QKD NetworkP. Horoschenkoff, J. Henrich, R. Böhn, I. Khan, J. Rödiger, M. Gunkel, M. Bauch, J. Benda, P. Bläcker, E. Eichhammer, U. Eismann, G. Frenck, H. Griesser, W. Jontofsohn, N. Kopshoff, S. Röhrich, F. Seidl, N. Schark, E. Sollner, D. von Blanckenburg, A. Heinemann, M. Stiemerling, M. GärtnerComments: All rights, including for text and data mining (TDM), Artificial Intelligence (AI) training, and similar technologies, are reserved. This project has received funding from the German research ministry "Bundesministerium fuer Bildung, Wissenschaft, Forschung und Technologie" (BMBF) as part of the DemoQuanDT research and innovation programm under grand agreement No. 16KISQ074Subjects: Quantum Physics (quant-ph); Networking and Internet Architecture (cs.NI)
Quantum Key Distribution Networks (QKDN) enable secure communication even in the age of powerful quantum computers. In the hands of a network operator, which can offer its service to many users, the economic viability of a QKDN increases significantly. The highly challenging operator-user relationship in a large-scale network setting demands additional requirements to ensure carrier-grade operation. Addressing this challenge, this work presents a carrier-grade QKDN architecture, which combines the functional QKDN architecture with the operational perspective of a network operator, ultimately enhancing the economic viability of QKDN. The focus is on the network and key management aspects of a QKDN while assuming state-of-the-art commercial QKD-Modules. The presented architecture was rolled out within an in-field demonstrator, connecting the cities of Berlin and Bonn over a link distance of 923 km across Germany. We could show, that the proposed network architecture is feasible, integrable, and scalable making it suitable for deployment in real-world networks. Overall, the presented carrier-grade QKDN architecture promises to serve as a blueprint for network operators providing QKD-based services to their customers.
- [451] arXiv:2503.21194 (cross-list from math.CO) [pdf, html, other]
-
Title: Matchgate signatures under variable permutationsSubjects: Combinatorics (math.CO); Computational Complexity (cs.CC)
In this article, we give a sufficient and necessary condition for determining whether a matchgate signature retains its property under a certain variable permutation, which can be checked in polynomial time. We also define the concept of permutable matchgate signatures, and use it to erase the gap between Pl-\#CSP and \#CSP on planar graphs in the previous study. We provide a detailed characterization of permutable matchgate signatures as well, by presenting their relation to symmetric matchgate signatures. In addition, we prove a dichotomy for Pl-$\#R_D$-CSP where $D\ge 3$ is an integer.
- [452] arXiv:2503.21211 (cross-list from physics.ao-ph) [pdf, other]
-
Title: Interpretable Cross-Sphere Multiscale Deep Learning Predicts ENSO Skilfully Beyond 2 YearsComments: 13 pages, 4 figuresSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
El Niño-Southern Oscillation (ENSO) exerts global climate and societal impacts, but real-time prediction with lead times beyond one year remains challenging. Dynamical models suffer from large biases and uncertainties, while deep learning struggles with interpretability and multi-scale dynamics. Here, we introduce PTSTnet, an interpretable model that unifies dynamical processes and cross-scale spatiotemporal learning in an innovative neural-network framework with physics-encoding learning. PTSTnet produces interpretable predictions significantly outperforming state-of-the-art benchmarks with lead times beyond 24 months, providing physical insights into error propagation in ocean-atmosphere interactions. PTSTnet learns feature representations with physical consistency from sparse data to tackle inherent multi-scale and multi-physics challenges underlying ocean-atmosphere processes, thereby inherently enhancing long-term prediction skill. Our successful realizations mark substantial steps forward in interpretable insights into innovative neural ocean modelling.
- [453] arXiv:2503.21228 (cross-list from q-bio.PE) [pdf, html, other]
-
Title: Value of risk-contact data from digital contact monitoring apps in infectious disease modelingMartijn H. H. Schoot Uiterkamp, Willian J. van Dijk, Hans Heesterbeek, Remco van der Hofstad, Jessica C. Kiefte-de Jong, Nelly LitvakComments: 15 pages, 5 figuresSubjects: Populations and Evolution (q-bio.PE); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
In this paper, we present a simple method to integrate risk-contact data, obtained via digital contact monitoring (DCM) apps, in conventional compartmental transmission models. During the recent COVID-19 pandemic, many such data have been collected for the first time via newly developed DCM apps. However, it is unclear what the added value of these data is, unlike that of traditionally collected data via, e.g., surveys during non-epidemic times. The core idea behind our method is to express the number of infectious individuals as a function of the proportion of contacts that were with infected individuals and use this number as a starting point to initialize the remaining compartments of the model. As an important consequence, using our method, we can estimate key indicators such as the effective reproduction number using only two types of daily aggregated contact information, namely the average number of contacts and the average number of those contacts that were with an infected individual. We apply our method to the recent COVID-19 epidemic in the Netherlands, using self-reported data from the health surveillance app COVID RADAR and proximity-based data from the contact tracing app CoronaMelder. For both data sources, our corresponding estimates of the effective reproduction number agree both in time and magnitude with estimates based on other more detailed data sources such as daily numbers of cases and hospitalizations. This suggests that the use of DCM data in transmission models, regardless of the precise data type and for example via our method, offers a promising alternative for estimating the state of an epidemic, especially when more detailed data are not available.
- [454] arXiv:2503.21242 (cross-list from eess.SP) [pdf, html, other]
-
Title: PLAIN: Scalable Estimation Architecture for Integrated Sensing and CommunicationComments: Submitted to the IEEE Transactions on Wireless Communications. Code available at GitHub: this https URLSubjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
Integrated sensing and communication (ISAC) is envisioned be to one of the paradigms upon which next-generation mobile networks will be built, extending localization and tracking capabilities, as well as giving birth to environment-aware wireless access. A key aspect of sensing integration is parameter estimation, which involves extracting information about the surrounding environment, such as the direction, distance, and velocity of various objects within. This is typically of a high-dimensional nature, which leads to significant computational complexity, if performed jointly across multiple sensing dimensions, such as space, frequency, and time. Additionally, due to the incorporation of sensing on top of the data transmission, the time window available for sensing is likely to be short, resulting in an estimation problem where only a single snapshot is accessible. In this work, we propose PLAIN, a tensor-based estimation architecture that flexibly scales with multiple sensing dimensions and can handle high dimensionality, limited measurement time, and super-resolution requirements. It consists of three stages: a compression stage, where the high dimensional input is converted into lower dimensionality, without sacrificing resolution; a decoupled estimation stage, where the parameters across the different dimensions are estimated in parallel with low complexity; an input-based fusion stage, where the decoupled parameters are fused together to form a paired multidimensional estimate. We investigate the performance of the architecture for different configurations and compare it against practical sequential and joint estimation baselines, as well as theoretical bounds. Our results show that PLAIN, using tools from tensor algebra, subspace-based processing, and compressed sensing, can scale flexibly with dimensionality, while operating with low complexity and maintaining super-resolution.
- [455] arXiv:2503.21252 (cross-list from math.OC) [pdf, html, other]
-
Title: Multi-fidelity Learning of Reduced Order Models for Parabolic PDE Constrained OptimizationComments: 36 pages, 5 figuresSubjects: Optimization and Control (math.OC); Numerical Analysis (math.NA)
This article builds on the recently proposed RB-ML-ROM approach for parameterized parabolic PDEs and proposes a novel hierarchical Trust Region algorithm for solving parabolic PDE constrained optimization problems. Instead of using a traditional offline/online splitting approach for model order reduction, we adopt an active learning or enrichment strategy to construct a multi-fidelity hierarchy of reduced order models on-the-fly during the outer optimization loop. The multi-fidelity surrogate model consists of a full order model, a reduced order model and a machine learning model. The proposed hierarchical framework adaptively updates its hierarchy when querying parameters, utilizing a rigorous a posteriori error estimator in an error aware trust region framework. Numerical experiments are given to demonstrate the efficiency of the proposed approach.
- [456] arXiv:2503.21287 (cross-list from math.CO) [pdf, html, other]
-
Title: On Supports for graphs of bounded genusSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
Let $(X,\mathcal{E})$ be a hypergraph. A support is a graph $Q$ on $X$ such that for each $E\in\mathcal{E}$, the subgraph of $Q$ induced on the elements in $E$ is connected. We consider the problem of constructing a support for hypergraphs defined by connected subgraphs of a host graph. For a graph $G=(V,E)$, let $\mathcal{H}$ be a set of connected subgraphs of $G$. Let the vertices of $G$ be partitioned into two sets the \emph{terminals} $\mathbf{b}(V)$ and the \emph{non-terminals} $\mathbf{r}(V)$. We define a hypergraph on $\mathbf{b}(V)$, where each $H\in\mathcal{H}$ defines a hyperedge consisting of the vertices of $\mathbf{b}(V)$ in $H$.
We also consider the problem of constructing a support for the \emph{dual hypergraph} - a hypergraph on $\mathcal{H}$ where each $v\in \mathbf{b}(V)$ defines a hyperedge consisting of the subgraphs in $\mathcal{H}$ containing $v$. In fact, we construct supports for a common generalization of the primal and dual settings called the \emph{intersection hypergraph}.
As our main result, we show that if the host graph $G$ has bounded genus and the subgraphs in $\mathcal{H}$ satisfy a condition of being \emph{cross-free}, then there exists a support that also has bounded genus. Our results are a generalization of the results of Raman and Ray (Rajiv Raman, Saurabh Ray: Constructing Planar Support for Non-Piercing Regions. Discret. Comput. Geom. 64(3): 1098-1122 (2020)).
Our techniques imply a unified analysis for packing and covering problems for hypergraphs defined on surfaces of bounded genus. We also describe applications of our results for hypergraph colorings. - [457] arXiv:2503.21303 (cross-list from physics.ao-ph) [pdf, html, other]
-
Title: Simulation-informed deep learning for enhanced SWOT observations of fine-scale ocean dynamicsEugenio Cutolo (IMT Atlantique - MEE, Lab-STICC\_OSE, ODYSSEY), Carlos Granero-Belinchon (ODYSSEY, IMT Atlantique - MEE, Lab-STICC\_OSE), Ptashanna Thiraux (IMT Atlantique - MEE, Lab-STICC\_OSE, ODYSSEY), Jinbo Wang (JPL), Ronan Fablet (IMT Atlantique - MEE, Lab-STICC\_OSE, ODYSSEY)Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
Oceanic processes at fine scales are crucial yet difficult to observe accurately due to limitations in satellite and in-situ measurements. The Surface Water and Ocean Topography (SWOT) mission provides high-resolution Sea Surface Height (SSH) data, though noise patterns often obscure fine scale structures. Current methods struggle with noisy data or require extensive supervised training, limiting their effectiveness on real-world observations. We introduce SIMPGEN (Simulation-Informed Metric and Prior for Generative Ensemble Networks), an unsupervised adversarial learning framework combining real SWOT observations with simulated reference data. SIMPGEN leverages wavelet-informed neural metrics to distinguish noisy from clean fields, guiding realistic SSH reconstructions. Applied to SWOT data, SIMPGEN effectively removes noise, preserving fine-scale features better than existing neural methods. This robust, unsupervised approach not only improves SWOT SSH data interpretation but also demonstrates strong potential for broader oceanographic applications, including data assimilation and super-resolution.
- [458] arXiv:2503.21321 (cross-list from stat.AP) [pdf, other]
-
Title: Explainable Boosting Machine for Predicting Claim Severity and Frequency in Car InsuranceSubjects: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
In a context of constant increase in competition and heightened regulatory pressure, accuracy, actuarial precision, as well as transparency and understanding of the tariff, are key issues in non-life insurance. Traditionally used generalized linear models (GLM) result in a multiplicative tariff that favors interpretability. With the rapid development of machine learning and deep learning techniques, actuaries and the rest of the insurance industry have adopted these techniques widely. However, there is a need to associate them with interpretability techniques. In this paper, our study focuses on introducing an Explainable Boosting Machine (EBM) model that combines intrinsically interpretable characteristics and high prediction performance. This approach is described as a glass-box model and relies on the use of a Generalized Additive Model (GAM) and a cyclic gradient boosting algorithm. It accounts for univariate and pairwise interaction effects between features and provides naturally explanations on them. We implement this approach on car insurance frequency and severity data and extensively compare the performance of this approach with classical competitors: a GLM, a GAM, a CART model and an Extreme Gradient Boosting (XGB) algorithm. Finally, we examine the interpretability of these models to capture the main determinants of claim costs.
- [459] arXiv:2503.21422 (cross-list from q-fin.CP) [pdf, html, other]
-
Title: From Deep Learning to LLMs: A survey of AI in Quantitative InvestmentSubjects: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR)
Quantitative investment (quant) is an emerging, technology-driven approach in asset management, increasingy shaped by advancements in artificial intelligence. Recent advances in deep learning and large language models (LLMs) for quant finance have improved predictive modeling and enabled agent-based automation, suggesting a potential paradigm shift in this field. In this survey, taking alpha strategy as a representative example, we explore how AI contributes to the quantitative investment pipeline. We first examine the early stage of quant research, centered on human-crafted features and traditional statistical models with an established alpha pipeline. We then discuss the rise of deep learning, which enabled scalable modeling across the entire pipeline from data processing to order execution. Building on this, we highlight the emerging role of LLMs in extending AI beyond prediction, empowering autonomous agents to process unstructured data, generate alphas, and support self-iterative workflows.
- [460] arXiv:2503.21432 (cross-list from hep-ph) [pdf, html, other]
-
Title: Exploring the flavor structure of leptons via diffusion modelsComments: 23 pages, 5 figuresSubjects: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
We propose a method to explore the flavor structure of leptons using diffusion models, which are known as one of generative artificial intelligence (generative AI). We consider a simple extension of the Standard Model with the type I seesaw mechanism and train a neural network to generate the neutrino mass matrix. By utilizing transfer learning, the diffusion model generates 104 solutions that are consistent with the neutrino mass squared differences and the leptonic mixing angles. The distributions of the CP phases and the sums of neutrino masses, which are not included in the conditional labels but are calculated from the solutions, exhibit non-trivial tendencies. In addition, the effective mass in neutrinoless double beta decay is concentrated near the boundaries of the existing confidence intervals, allowing us to verify the obtained solutions through future experiments. An inverse approach using the diffusion model is expected to facilitate the experimental verification of flavor models from a perspective distinct from conventional analytical methods.
- [461] arXiv:2503.21434 (cross-list from math.CT) [pdf, other]
-
Title: Elgot Categories and Abacus ProgramsComments: In peer rewview, although not at MFPS, I'm just using their style files!Subjects: Category Theory (math.CT); Logic in Computer Science (cs.LO)
We introduce Elgot categories, a sort of distributive monoidal category with additional structure in which the partial recursive functions are representable. Moreover, we construct an initial Elgot category, the morphisms of which coincide with a lightly modified version of Lambek's abacus programs. The partial functions that are strongly representable in this initial Elgot category are precisely the partial recursive ones.
- [462] arXiv:2503.21443 (cross-list from stat.ME) [pdf, html, other]
-
Title: Sparse Bayesian Learning for Label Efficiency in Cardiac Real-Time MRIFelix Terhag, Philipp Knechtges, Achim Basermann, Anja Bach, Darius Gerlach, Jens Tank, Raúl TemponeSubjects: Methodology (stat.ME); Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR); Statistics Theory (math.ST); Applications (stat.AP)
Cardiac real-time magnetic resonance imaging (MRI) is an emerging technology that images the heart at up to 50 frames per second, offering insight into the respiratory effects on the heartbeat. However, this method significantly increases the number of images that must be segmented to derive critical health indicators. Although neural networks perform well on inner slices, predictions on outer slices are often unreliable.
This work proposes sparse Bayesian learning (SBL) to predict the ventricular volume on outer slices with minimal manual labeling to address this challenge. The ventricular volume over time is assumed to be dominated by sparse frequencies corresponding to the heart and respiratory rates. Moreover, SBL identifies these sparse frequencies on well-segmented inner slices by optimizing hyperparameters via type -II likelihood, automatically pruning irrelevant components. The identified sparse frequencies guide the selection of outer slice images for labeling, minimizing posterior variance.
This work provides performance guarantees for the greedy algorithm. Testing on patient data demonstrates that only a few labeled images are necessary for accurate volume prediction. The labeling procedure effectively avoids selecting inefficient images. Furthermore, the Bayesian approach provides uncertainty estimates, highlighting unreliable predictions (e.g., when choosing suboptimal labels). - [463] arXiv:2503.21469 (cross-list from eess.IV) [pdf, html, other]
-
Title: Embedding Compression Distortion in Video Coding for MachinesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis. However, existing codecs are primarily optimized for pixel-domain and HVS-perception metrics rather than the needs of machine vision tasks. To address this issue, we propose a Compression Distortion Representation Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models, addressing the information lost during compression and improving task performance. Specifically, to better analyze the machine-perception-related distortion, we design a compression-sensitive extractor that identifies compression degradation in the feature domain. For efficient transmission, a lightweight distortion codec is introduced to compress the distortion information into a compact representation. Subsequently, the representation is progressively embedded into the downstream model, enabling it to be better informed about compression degradation and enhancing performance. Experiments across various codecs and downstream tasks demonstrate that our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of bitrate, execution time, and number of parameters. Our codes and supplementary materials are released in this https URL.
- [464] arXiv:2503.21473 (cross-list from stat.ML) [pdf, html, other]
-
Title: DeepRV: pre-trained spatial priors for accelerated disease mappingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Recently introduced prior-encoding deep generative models (e.g., PriorVAE, $\pi$VAE, and PriorCVAE) have emerged as powerful tools for scalable Bayesian inference by emulating complex stochastic processes like Gaussian processes (GPs). However, these methods remain largely a proof-of-concept and inaccessible to practitioners. We propose DeepRV, a lightweight, decoder-only approach that accelerates training, and enhances real-world applicability in comparison to current VAE-based prior encoding approaches. Leveraging probabilistic programming frameworks (e.g., NumPyro) for inference, DeepRV achieves significant speedups while also improving the quality of parameter inference, closely matching full MCMC sampling. We showcase its effectiveness in process emulation and spatial analysis of the UK using simulated data, gender-wise cancer mortality rates for individuals under 50, and HIV prevalence in Zimbabwe. To bridge the gap between theory and practice, we provide a user-friendly API, enabling scalable and efficient Bayesian inference.
- [465] arXiv:2503.21479 (cross-list from quant-ph) [pdf, other]
-
Title: Quantum umlaut informationComments: 52 pagesSubjects: Quantum Physics (quant-ph); Information Theory (cs.IT); Mathematical Physics (math-ph)
We study the quantum umlaut information, a correlation measure defined for bipartite quantum states $\rho_{AB}$ as a reversed variant of the quantum mutual information: $U(A;B)_\rho = \min_{\sigma_B} D(\rho_A\otimes \sigma_B\|\rho_{AB})$ in terms of the quantum relative entropy $D$. As in the classical case [Girardi et al., arXiv:2503.18910], this definition allows for a closed-form expression and has an operational interpretation as the asymptotic error exponent in the hypothesis testing task of deciding whether a given bipartite state is product or not. We generalise the umlaut information to quantum channels, where it also extends the notion of `oveloh information' [Nuradha et al., arXiv:2404.16101]. We prove that channel umlaut information is additive for classical-quantum channels, while we observe additivity violations for fully quantum channels. Inspired by recent results in entanglement theory, we then show as our main result that the regularised umlaut information constitutes a fundamental measure of the quality of classical information transmission over a quantum channel -- as opposed to the capacity, which quantifies the quantity of information that can be sent. This interpretation applies to coding assisted by activated non-signalling correlations, and the channel umlaut information is in general larger than the corresponding expression for unassisted communication as obtained by Dalai for the classical-quantum case [IEEE Trans. Inf. Theory 59, 8027 (2013)]. Combined with prior works on non-signalling--assisted zero-error channel capacities, our findings imply a dichotomy between the settings of zero-rate error exponents and zero-error communication. While our results are single-letter only for classical-quantum channels, we also give a single-letter bound for fully quantum channels in terms of the `geometric' version of umlaut information.
- [466] arXiv:2503.21501 (cross-list from eess.IV) [pdf, html, other]
-
Title: Double Blind Imaging with Generative ModelingSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Blind inverse problems in imaging arise from uncertainties in the system used to collect (noisy) measurements of images. Recovering clean images from these measurements typically requires identifying the imaging system, either implicitly or explicitly. A common solution leverages generative models as priors for both the images and the imaging system parameters (e.g., a class of point spread functions). To learn these priors in a straightforward manner requires access to a dataset of clean images as well as samples of the imaging system. We propose an AmbientGAN-based generative technique to identify the distribution of parameters in unknown imaging systems, using only unpaired clean images and corrupted measurements. This learned distribution can then be used in model-based recovery algorithms to solve blind inverse problems such as blind deconvolution. We successfully demonstrate our technique for learning Gaussian blur and motion blur priors from noisy measurements and show their utility in solving blind deconvolution with diffusion posterior sampling.
- [467] arXiv:2503.21514 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantitative Evaluation of Quantum/Classical Neural Network Using a Game Solver MetricComments: 11 pages, 16 figuresSubjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
To evaluate the performance of quantum computing systems relative to classical counterparts and explore the potential for quantum advantage, we propose a game-solving benchmark based on Elo ratings in the game of tic-tac-toe. We compare classical convolutional neural networks (CNNs), quantum convolutional neural networks (QCNNs), and hybrid classical-quantum models by assessing their performance against a random-move agent in automated matches. Additionally, we implement a QCNN integrated with quantum communication and evaluate its performance to quantify the overhead introduced by noisy quantum channels. Our results show that the classical-quantum hybrid model achieves Elo ratings comparable to those of classical CNNs, while the standalone QCNN underperforms under current hardware constraints. The communication overhead was found to be modest. These findings demonstrate the viability of using game-based benchmarks for evaluating quantum computing systems and suggest that quantum communication can be incorporated with limited impact on performance, providing a foundation for future hybrid quantum applications.
- [468] arXiv:2503.21526 (cross-list from stat.ML) [pdf, other]
-
Title: Constraint-based causal discovery with tiered background knowledge and latent variables in single or overlapping datasetsComments: Accepted for the 4th Conference on Causal Learning and Reasoning (CLeaR 2025)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
In this paper we consider the use of tiered background knowledge within constraint based causal discovery. Our focus is on settings relaxing causal sufficiency, i.e. allowing for latent variables which may arise because relevant information could not be measured at all, or not jointly, as in the case of multiple overlapping datasets. We first present novel insights into the properties of the 'tiered FCI' (tFCI) algorithm. Building on this, we introduce a new extension of the IOD (integrating overlapping datasets) algorithm incorporating tiered background knowledge, the 'tiered IOD' (tIOD) algorithm. We show that under full usage of the tiered background knowledge tFCI and tIOD are sound, while simple versions of the tIOD and tFCI are sound and complete. We further show that the tIOD algorithm can often be expected to be considerably more efficient and informative than the IOD algorithm even beyond the obvious restriction of the Markov equivalence classes. We provide a formal result on the conditions for this gain in efficiency and informativeness. Our results are accompanied by a series of examples illustrating the exact role and usefulness of tiered background knowledge.
- [469] arXiv:2503.21528 (cross-list from stat.ML) [pdf, html, other]
-
Title: Bayesian Pseudo Posterior Mechanism for Differentially Private Machine LearningRobert Chew, Matthew R. Williams, Elan A. Segarra, Alexander J. Preiss, Amanda Konet, Terrance D. SavitskySubjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Differential privacy (DP) is becoming increasingly important for deployed machine learning applications because it provides strong guarantees for protecting the privacy of individuals whose data is used to train models. However, DP mechanisms commonly used in machine learning tend to struggle on many real world distributions, including highly imbalanced or small labeled training sets. In this work, we propose a new scalable DP mechanism for deep learning models, SWAG-PPM, by using a pseudo posterior distribution that downweights by-record likelihood contributions proportionally to their disclosure risks as the randomized mechanism. As a motivating example from official statistics, we demonstrate SWAG-PPM on a workplace injury text classification task using a highly imbalanced public dataset published by the U.S. Occupational Safety and Health Administration (OSHA). We find that SWAG-PPM exhibits only modest utility degradation against a non-private comparator while greatly outperforming the industry standard DP-SGD for a similar privacy budget.
- [470] arXiv:2503.21535 (cross-list from math.NT) [pdf, html, other]
-
Title: Computing Isomorphisms between Products of Supersingular Elliptic CurvesSubjects: Number Theory (math.NT); Cryptography and Security (cs.CR)
The Deligne-Ogus-Shioda theorem guarantees the existence of isomorphisms between products of supersingular elliptic curves over finite fields. In this paper, we present methods for explicitly computing these isomorphisms in polynomial time, given the endomorphism rings of the curves involved. Our approach leverages the Deuring correspondence, enabling us to reformulate computational isogeny problems into algebraic problems in quaternions. Specifically, we reduce the computation of isomorphisms to solving systems of quadratic and linear equations over the integers derived from norm equations. We develop $\ell$-adic techniques for solving these equations when we have access to a low discriminant subring. Combining these results leads to the description of an efficient probabilistic Las Vegas algorithm for computing the desired isomorphisms. Under GRH, it is proved to run in expected polynomial time.
- [471] arXiv:2503.21538 (cross-list from math.OC) [pdf, other]
-
Title: Formation Shape Control using the Gromov-Wasserstein MetricComments: To appear in the proceedings of Learning for Dynamics and Control (L4DC) conference, PMLR, 2025Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
This article introduces a formation shape control algorithm, in the optimal control framework, for steering an initial population of agents to a desired configuration via employing the Gromov-Wasserstein distance. The underlying dynamical system is assumed to be a constrained linear system and the objective function is a sum of quadratic control-dependent stage cost and a Gromov-Wasserstein terminal cost. The inclusion of the Gromov-Wasserstein cost transforms the resulting optimal control problem into a well-known NP-hard problem, making it both numerically demanding and difficult to solve with high accuracy. Towards that end, we employ a recent semi-definite relaxation-driven technique to tackle the Gromov-Wasserstein distance. A numerical example is provided to illustrate our results.
- [472] arXiv:2503.21546 (cross-list from q-bio.GN) [pdf, html, other]
-
Title: consexpressionR: an R package for consensus differential gene expression analysisSubjects: Genomics (q-bio.GN); Systems and Control (eess.SY)
Motivation: Bulk RNA-Seq is a widely used method for studying gene expression across a variety of contexts. The significance of RNA-Seq studies has grown with the advent of high-throughput sequencing technologies. Computational methods have been developed for each stage of the identification of differentially expressed genes. Nevertheless, there are few studies exploring the association between different types of methods. In this study, we evaluated the impact of the association of methodologies in the results of differential expression analysis. By adopting two data sets with qPCR data (to gold-standard reference), seven methods were implemented and assessed in R packages (EBSeq, edgeR, DESeq2, limma, SAMseq, NOISeq, and Knowseq), which was performed and assessed separately and in association. The results were evaluated considering the adopted qPCR data. Results: Here, we introduce consexpressionR, an R package that automates differential expression analysis using consensus of at least seven methodologies, producing more assertive results with a significant reduction in false positives. Availability: consexpressionR is an R package available via source code and support are available at GitHub (this https URL).
- [473] arXiv:2503.21576 (cross-list from math.PR) [pdf, other]
-
Title: Empirical Measures and Strong Laws of Large Numbers in Categorical ProbabilityComments: 54 pagesSubjects: Probability (math.PR); Logic in Computer Science (cs.LO); Category Theory (math.CT); Statistics Theory (math.ST)
The Glivenko-Cantelli theorem is a uniform version of the strong law of large numbers. It states that for every IID sequence of random variables, the empirical measure converges to the underlying distribution (in the sense of uniform convergence of the CDF). In this work, we provide tools to study such limits of empirical measures in categorical probability.
We propose two axioms, permutation invariance and empirical adequacy, that a morphism of type $X^\mathbb{N} \to X$ should satisfy to be interpretable as taking an infinite sequence as input and producing a sample from its empirical measure as output. Since not all sequences have a well-defined empirical measure, ``such empirical sampling morphisms'' live in quasi-Markov categories, which, unlike Markov categories, allow partial morphisms. Given an empirical sampling morphism and a few other properties, we prove representability as well as abstract versions of the de Finetti theorem, the Glivenko-Cantelli theorem and the strong law of large numbers.
We provide several concrete constructions of empirical sampling morphisms as partially defined Markov kernels on standard Borel spaces. Instantiating our abstract results then recovers the standard Glivenko-Cantelli theorem and the strong law of large numbers for random variables with finite first moment. Our work thus provides a joint proof of these two theorems in conjunction with the de Finetti theorem from first principles. - [474] arXiv:2503.21585 (cross-list from stat.ML) [pdf, html, other]
-
Title: Probabilistic Functional Neural NetworksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
High-dimensional functional time series (HDFTS) are often characterized by nonlinear trends and high spatial dimensions. Such data poses unique challenges for modeling and forecasting due to the nonlinearity, nonstationarity, and high dimensionality. We propose a novel probabilistic functional neural network (ProFnet) to address these challenges. ProFnet integrates the strengths of feedforward and deep neural networks with probabilistic modeling. The model generates probabilistic forecasts using Monte Carlo sampling and also enables the quantification of uncertainty in predictions. While capturing both temporal and spatial dependencies across multiple regions, ProFnet offers a scalable and unified solution for large datasets. Applications to Japan's mortality rates demonstrate superior performance. This approach enhances predictive accuracy and provides interpretable uncertainty estimates, making it a valuable tool for forecasting complex high-dimensional functional data and HDFTS.
- [475] arXiv:2503.21603 (cross-list from physics.optics) [pdf, other]
-
Title: All-Optical High-speed Programmable Nonlinear Activation Functions using a Fabry-Perot LaserMladen Banović (1), Petar Atanasijević (1), Antonios Prapas (2), Christos Pappas (2), Jasna Crnjanski (1), Apostolos Tsakyridis (2), Miltiadis Moralis-Pegios (2), Konstantinos Vyrsokinos (2), Milanka Lović (1), Nina Zdravković (1), Milena Mićić (1), Marko Krstić (1), Slobodan Petričević (1), Nikos Pleros (2), Dejan Gvozdić (1) ((1) University of Belgrade - School of Electrical Engineering, 11120 Belgrade, Serbia and (2) Centre for Interdisciplinary Research and Innovation, Informatics Dept. Aristotle University of Thessaloniki, Greece)Subjects: Optics (physics.optics); Emerging Technologies (cs.ET)
The threads of photonics are eagerly awaited to redefine the future of neuromorphic data processing, especially as the computing-intensive artificial intelligence models become an unavoidable part of our everyday lives. Still, there is much to be improved within the domain of photonic nonlinear activation functions, as the programmable, all-optical, energy-efficient nonlinearities remain beyond the grasp of today's state of the art. In this paper, we address the issue at hand and propose a novel approach in the realization of high-performing all-optical photonic activations. Through simulations and experiments, we show that Fabry-Perot laser diodes (FP-LDs) exhibit richness and high programmability of their nonlinear response to input optical pulses with widths as low as 25 ps. We demonstrate a variety of sigmoid-like and inverted PReLU-like trends to be used as all-optical activation functions in photonic neural networks, testing their performance in stringent, real-life training scenarios with randomized data patterns at repetition rates up to 10 GHz. The programmability of activations is shown using a multitude of experimental operating parameters, among which we highlight the power variation of an additional continuous wave laser, injected into the FP-LD, enriching our approach with all-optical control of all-optical activations. With very low static power consumption of our active element, we achieve a record-breaking energy draw on the order of pJ to hundreds of fJ per nonlinear operation.
- [476] arXiv:2503.21608 (cross-list from stat.ML) [pdf, html, other]
-
Title: Nonlinear Multiple Response Regression and Learning of Latent SpacesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Identifying low-dimensional latent structures within high-dimensional data has long been a central topic in the machine learning community, driven by the need for data compression, storage, transmission, and deeper data understanding. Traditional methods, such as principal component analysis (PCA) and autoencoders (AE), operate in an unsupervised manner, ignoring label information even when it is available. In this work, we introduce a unified method capable of learning latent spaces in both unsupervised and supervised settings. We formulate the problem as a nonlinear multiple-response regression within an index model context. By applying the generalized Stein's lemma, the latent space can be estimated without knowing the nonlinear link functions. Our method can be viewed as a nonlinear generalization of PCA. Moreover, unlike AE and other neural network methods that operate as "black boxes", our approach not only offers better interpretability but also reduces computational complexity while providing strong theoretical guarantees. Comprehensive numerical experiments and real data analyses demonstrate the superior performance of our method.
- [477] arXiv:2503.21653 (cross-list from math.PR) [pdf, html, other]
-
Title: Strong convergence and stability of stochastic theta method for time-changed stochastic differential equations with local Lipschitz coefficientsSubjects: Probability (math.PR); Numerical Analysis (math.NA)
In this paper, the stochastic theta (ST) method is investigated for a class of stochastic differential equations driven by a time-changed Brownian motion, whose coefficients are time-space-dependent and satisfy the local Lipschitz condition. It is proved that under the local Lipschitz and some additional assumptions, the ST method with $\theta\in[1/2,1]$ is strongly convergent. It is also obtained that, for all positive stepsizes, the ST method with $\theta\in[1/2,1]$ is asymptotically mean square stable under a coercivity condition. With some restrictions on the stepsize, the ST method with $\theta\in[0,1/2)$ is asymptotically mean square stable under a stronger assumption. Some numerical simulations are presented to illustrate the theoretical results.
- [478] arXiv:2503.21672 (cross-list from math.CO) [pdf, html, other]
-
Title: The Avoider-Enforcer game on hypergraphs of rank 3Subjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
In the Avoider-Enforcer convention of positional games, two players, Avoider and Enforcer, take turns selecting vertices from a hypergraph H. Enforcer wins if, by the time all vertices of H have been selected, Avoider has completely filled an edge of H with her vertices; otherwise, Avoider wins. In this paper, we first give some general results, in particular regarding the outcome of the game and disjoint unions of hypergraphs. We then determine which player has a winning strategy for all hypergraphs of rank 2, and for linear hypergraphs of rank 3 when Avoider plays the last move. The structural characterisations we obtain yield polynomial-time algorithms.
- [479] arXiv:2503.21681 (cross-list from q-bio.BM) [pdf, html, other]
-
Title: A Comprehensive Benchmark for RNA 3D Structure-Function ModelingSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Machine Learning (stat.ML)
The RNA structure-function relationship has recently garnered significant attention within the deep learning community, promising to grow in importance as nucleic acid structure models advance. However, the absence of standardized and accessible benchmarks for deep learning on RNA 3D structures has impeded the development of models for RNA functional characteristics.
In this work, we introduce a set of seven benchmarking datasets for RNA structure-function prediction, designed to address this gap. Our library builds on the established Python library rnaglib, and offers easy data distribution and encoding, splitters and evaluation methods, providing a convenient all-in-one framework for comparing models. Datasets are implemented in a fully modular and reproducible manner, facilitating for community contributions and customization. Finally, we provide initial baseline results for all tasks using a graph neural network.
Source code: this https URL
Documentation: this https URL - [480] arXiv:2503.21686 (cross-list from quant-ph) [pdf, html, other]
-
Title: Molecular Quantum TransformerComments: 13 pages, 8 figuresSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG)
The Transformer model, renowned for its powerful attention mechanism, has achieved state-of-the-art performance in various artificial intelligence tasks but faces challenges such as high computational cost and memory usage. Researchers are exploring quantum computing to enhance the Transformer's design, though it still shows limited success with classical data. With a growing focus on leveraging quantum machine learning for quantum data, particularly in quantum chemistry, we propose the Molecular Quantum Transformer (MQT) for modeling interactions in molecular quantum systems. By utilizing quantum circuits to implement the attention mechanism on the molecular configurations, MQT can efficiently calculate ground-state energies for all configurations. Numerical demonstrations show that in calculating ground-state energies for H_2, LiH, BeH_2, and H_4, MQT outperforms the classical Transformer, highlighting the promise of quantum effects in Transformer structures. Furthermore, its pretraining capability on diverse molecular data facilitates the efficient learning of new molecules, extending its applicability to complex molecular systems with minimal additional effort. Our method offers an alternative to existing quantum algorithms for estimating ground-state energies, opening new avenues in quantum chemistry and materials science.
Cross submissions (showing 44 of 44 entries)
- [481] arXiv:1511.03086 (replaced) [pdf, html, other]
-
Title: The CTU Prague Relational Learning RepositoryComments: 9 pagesSubjects: Machine Learning (cs.LG); Databases (cs.DB)
The aim of the Prague Relational Learning Repository is to support machine learning research with multi-relational data. The repository currently contains 148 SQL databases hosted on a public MySQL server located at this https URL. The server is provided by the Czech Technical University (CTU). A searchable meta-database provides metadata (e.g., the number of tables in the database, the number of rows and columns in the tables, the number of self-relationships).
- [482] arXiv:1910.14067 (replaced) [pdf, html, other]
-
Title: Spectral properties of kernel matrices in the flat limitComments: 41 pages, 8 picturesJournal-ref: Siam J. Matrix Anal. Appl., 42(1):17-57, 2021Subjects: Numerical Analysis (math.NA); Spectral Theory (math.SP); Statistics Theory (math.ST)
Kernel matrices are of central importance to many applied fields. In this manuscript, we focus on spectral properties of kernel matrices in the so-called ``flat limit'', which occurs when points are close together relative to the scale of the kernel. We establish asymptotic expressions for the determinants of the kernel matrices, which we then leverage to obtain asymptotic expressions for the main terms of the eigenvalues. Analyticity of the eigenprojectors yields expressions for limiting eigenvectors, which are strongly tied to discrete orthogonal polynomials. Both smooth and finitely smooth kernels are covered, with stronger results available in the finite smoothness case.
- [483] arXiv:2008.08025 (replaced) [pdf, html, other]
-
Title: How to organize an in-person, online or hybrid hackathon -- A revised planning kitComments: 37 pages, 0 figuresSubjects: Computers and Society (cs.CY); Software Engineering (cs.SE)
Hackathons and similar time-bounded events are a global phenomenon. Their proliferation in various domains and their usefulness for a variety of goals has led to the emergence of different formats. While there are a multitude of guidelines available on how to prepare and run a hackathon, most of them focus on a particular format that was created for a specific purpose within a domain for a certain type of participant. This makes it difficult, in particular, for novice organizers to decide how to run an event that fits their needs. To address this gap we developed the original version of this planning kit in 2020 which focused on in-person events that were the dominant form of hackathons then. That planning kit was organized around 12 key decisions that organizers need to take when preparing for, running, and following up on a hackathon. Fast forward to 2025, after going through a global pandemic that forced all events to move online, we now see different forms of events - in-person, online, and hybrid - taking place across the globe, and while they can be all valuable, they have different affordances and require different considerations when planning. To account for these differences, we decided to update the original planning kit by adding a section that discusses the affordances and requirements of in-person, online, and hybrid events to each of the 12 decisions. In addition, we modified the original example timelines to include different forms and types of events. We also updated the planning kit in general based on insights we gained through continuing to organize and study hackathons. The main planning kit is available online while this report is meant to be a downloadable and citable resource.
- [484] arXiv:2012.04726 (replaced) [pdf, html, other]
-
Title: Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual MisinformationComments: ACL 2021Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems.
We present the task of Edited Media Understanding, requiring models to answer open-ended questions that capture the intent and implications of an image edit. We introduce a dataset for our task, EMU, with 48k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 40.35% of the time. At the same time, there is still much work to be done -- humans prefer human-annotated captions 93.56% of the time -- and we provide analysis that highlights areas for further progress. - [485] arXiv:2109.05237 (replaced) [pdf, html, other]
-
Title: Physics-based Deep LearningComments: PBDL v0.3, online version: this https URLSubjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
This document is a hands-on, comprehensive guide to deep learning in the realm of physical simulations. Rather than just theory, we emphasize practical application: every concept is paired with interactive Jupyter notebooks to get you up and running quickly. Beyond traditional supervised learning, we dive into physical loss-constraints, differentiable simulations, diffusion-based approaches for probabilistic generative AI, as well as reinforcement learning and advanced neural network architectures. These foundations are paving the way for the next generation of scientific foundation models. We are living in an era of rapid transformation. These methods have the potential to redefine what's possible in computational science.
- [486] arXiv:2111.06880 (replaced) [pdf, html, other]
-
Title: Robust Eigenvectors of Symmetric TensorsComments: 22 pages, 3 figuresJournal-ref: SIAM J. Matrix Anal. Appl., 43(4):1784--1805, 2022Subjects: Numerical Analysis (math.NA); Algebraic Geometry (math.AG); Spectral Theory (math.SP)
The tensor power method generalizes the matrix power method to higher order arrays, or tensors. Like in the matrix case, the fixed points of the tensor power method are the eigenvectors of the tensor. While every real symmetric matrix has an eigendecomposition, the vectors generating a symmetric decomposition of a real symmetric tensor are not always eigenvectors of the tensor.
In this paper we show that whenever an eigenvector is a generator of the symmetric decomposition of a symmetric tensor, then (if the order of the tensor is sufficiently high) this eigenvector is robust, i.e., it is an attracting fixed point of the tensor power method. We exhibit new classes of symmetric tensors whose symmetric decomposition consists of eigenvectors. Generalizing orthogonally decomposable tensors, we consider equiangular tight frame decomposable and equiangular set decomposable tensors. Our main result implies that such tensors can be decomposed using the tensor power method. - [487] arXiv:2204.07958 (replaced) [pdf, html, other]
-
Title: Convergence analysis of a solver for the linear Poisson--Boltzmann modelSubjects: Numerical Analysis (math.NA)
This work investigates the convergence of a domain decomposition method for the Poisson-Boltzmann model that can be formulated as an interior-exterior transmission problem. To study its convergence, we introduce an interior-exterior constant providing an upper bound of the $L^2$ norm of any harmonic function in the interior, and establish a spectral equivalence for related Dirichlet-to-Neumann operators to estimate the spectrum of interior-exterior iteration operator. This analysis is nontrivial due to the unboundedness of the exterior subdomain, which distinguishes it from the classical analysis of the Schwarz alternating method with nonoverlapping bounded subdomains. It is proved that for the linear Poisson-Boltzmann solvent model in reality, the convergence of interior-exterior iteration is ensured when the relaxation parameter lies between 0 and 2. This convergence result interprets the good performance of ddLPB method developed in [SIAM Journal on Scientific Computing, 41 (2019), pp. B320-B350] where the relaxation parameter is set to 1. Numerical simulations are conducted to verify our convergence analysis and to investigate the optimal relaxation parameter for the interior-exterior iteration.
- [488] arXiv:2204.08005 (replaced) [pdf, other]
-
Title: A Survey on Location-Driven Influence MaximizationTaotao Cai, Quan Z.Sheng, Xiangyu Song, Jian Yang, Shuang Wang, Wei Emma Zhang, Jia Wu, Philip S. YuComments: Plan to update and extend this manuscriptSubjects: Social and Information Networks (cs.SI); Computer Science and Game Theory (cs.GT)
Influence Maximization (IM), which aims to select a set of users from a social network to maximize the expected number of influenced users, is an evergreen hot research topic. Its research outcomes significantly impact real-world applications such as business marketing. The booming location-based network platforms of the last decade appeal to the researchers embedding the location information into traditional IM research. In this survey, we provide a comprehensive review of the existing location-driven IM studies from the perspective of the following key aspects: (1) a review of the application scenarios of these works, (2) the diffusion models to evaluate the influence propagation, and (3) a comprehensive study of the approaches to deal with the location-driven IM problems together with a particular focus on the accelerating techniques. In the end, we draw prospects into the research directions in future IM research.
- [489] arXiv:2204.09298 (replaced) [pdf, other]
-
Title: Exploring Widevine for Fun and ProfitGwendal Patat (SPICY, IRISA-D1), Mohamed Sabt (SPICY, IRISA-D1), Pierre-Alain Fouque (CAPSULE, IRISA-D1)Journal-ref: 16th IEEE Workshop on Offensive Technologies, WOOT 2022, Aug 2022, San Francisco, CA, United StatesSubjects: Cryptography and Security (cs.CR)
For years, Digital Right Management (DRM) systems have been used as the go-to solution for media content protection against piracy. With the growing consumption of content using Over-the-Top platforms, such as Netflix or Prime Video, DRMs have been deployed on numerous devices considered as potential hostile environments. In this paper, we focus on the most widespread solution, the closed-source Widevine DRM. Installed on billions of devices, Widevine relies on cryptographic operations to protect content. Our work presents a study of Widevine internals on Android, mapping its distinct components and bringing out its different cryptographic keys involved in content decryption. We provide a structural view of Widevine as a protocol with its complete key ladder. Based on our insights, we develop WideXtractor, a tool based on Frida to trace Widevine function calls and intercept messages for inspection. Using this tool, we analyze Netflix usage of Widevine as a proof-of-concept, and raised privacy concerns on user-tracking. In addition, we leverage our knowledge to bypass the obfuscation of Android Widevine software-only version, namely L3, and recover its Root-of-Trust.
- [490] arXiv:2206.05183 (replaced) [pdf, html, other]
-
Title: GD-VAEs: Geometric Dynamic Variational Autoencoders for Learning Nonlinear Dynamics and Dimension ReductionsComments: 15 figures, related to non-archival proceedings communicationSubjects: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
We develop data-driven methods incorporating geometric and topological information to learn parsimonious representations of nonlinear dynamics from observations. The approaches learn nonlinear state-space models of the dynamics for general manifold latent spaces using training strategies related to Variational Autoencoders (VAEs). Our methods are referred to as Geometric Dynamic (GD) Variational Autoencoders (GD-VAEs). We learn encoders and decoders for the system states and evolution based on deep neural network architectures that include general Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and other architectures. Motivated by problems arising in parameterized PDEs and physics, we investigate the performance of our methods on tasks for learning reduced dimensional representations of the nonlinear Burgers Equations, Constrained Mechanical Systems, and spatial fields of Reaction-Diffusion Systems. GD-VAEs provide methods that can be used to obtain representations in manifold latent spaces for diverse learning tasks involving dynamics.
- [491] arXiv:2210.03166 (replaced) [pdf, other]
-
Title: The Power of Greedy for Online Minimum Cost Matching on the LineSubjects: Data Structures and Algorithms (cs.DS)
We consider the online minimum cost matching problem on the line, in which there are $n$ servers and, at each of $n$ time steps, a request arrives and must be irrevocably matched to a server that has not yet been matched to, with the goal of minimizing the sum of the distances between the matched pairs. Despite achieving a worst-case competitive ratio that is exponential in $n$, the simple greedy algorithm, which matches each request to its nearest available free server, performs very well in practice. A major question is thus to explain greedy's strong empirical performance. In this paper, we aim to understand the performance of greedy over instances that are at least partially random. When both the requests and the servers are drawn uniformly and independently from $[0,1]$, we show that greedy is constant competitive, which improves over the previously best-known $O(\sqrt{n})$ bound. We extend this constant competitive ratio to a setting with a linear excess of servers, which improves over the previously best-known $O(\log^3{n})$ bound. We moreover show that in the semi-random model where the requests are still drawn uniformly and independently but where the servers are chosen adversarially, greedy achieves an $O(\log{n})$ competitive ratio. When the requests arrive in a random order but are chosen adversarially, it was previously known that greedy is $O(n)$-competitive. Even though this one-sided randomness allows a large improvement in greedy's competitive ratio compared to the model where requests are adversarial and arrive in a random order, we show that it is not sufficient to obtain a constant competitive ratio by giving a tight $\Omega(\log{n})$ lower bound. These results invite further investigation about how much randomness is necessary and sufficient to obtain strong theoretical guarantees for the greedy algorithm for online minimum cost matching, on the line and beyond.
- [492] arXiv:2302.00058 (replaced) [pdf, html, other]
-
Title: Graph Anomaly Detection in Time Series: A SurveyComments: 20 pages, 7 figures, 6 tablesSubjects: Machine Learning (cs.LG)
With the recent advances in technology, a wide range of systems continue to collect a large amount of data over time and thus generate time series. Time-Series Anomaly Detection (TSAD) is an important task in various time-series applications such as e-commerce, cybersecurity, vehicle maintenance, and healthcare monitoring. However, this task is very challenging as it requires considering both the intra-variable dependency (relationships within a variable over time) and the inter-variable dependency (relationships between multiple variables) existing in time-series data. Recent graph-based approaches have made impressive progress in tackling the challenges of this field. In this survey, we conduct a comprehensive and up-to-date review of TSAD using graphs, referred to as G-TSAD. First, we explore the significant potential of graph representation for time-series data and and its contributions to facilitating anomaly detection. Then, we review state-of-the-art graph anomaly detection techniques, mostly leveraging deep learning architectures, in the context of time series. For each method, we discuss its strengths, limitations, and the specific applications where it excels. Finally, we address both the technical and application challenges currently facing the field, and suggest potential future directions for advancing research and improving practical outcomes.
- [493] arXiv:2302.13997 (replaced) [pdf, html, other]
-
Title: Host Community Respecting Refugee HousingComments: A preliminary version appeared in AAMAS '23Subjects: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS); Theoretical Economics (econ.TH)
We propose a novel model for refugee housing respecting the preferences of the accepting community and refugees themselves. In particular, we are given a topology representing the local community, a set of inhabitants occupying some vertices of the topology, and a set of refugees that should be housed on the empty vertices of the graph. Both the inhabitants and the refugees have preferences over the structure of their neighborhood.
We are specifically interested in the problem of finding housing such that the preferences of every individual are met; using game-theoretical words, we are looking for housing that is stable with respect to some well-defined notion of stability. We investigate conditions under which the existence of equilibria is guaranteed and study the computational complexity of finding such a stable outcome. As the problem is NP-hard even in very simple settings, we employ the parameterized complexity framework to give a finer-grained view of the problem's complexity with respect to natural parameters and structural restrictions of the given topology. - [494] arXiv:2304.01107 (replaced) [pdf, html, other]
-
Title: Process Channels: A New Layer for Process Enactment Based on Blockchain State ChannelsComments: Accepted at BPM 2023Journal-ref: In: Di Francescomarino, C., Burattin, A., Janiesch, C., Sadiq, S. (eds) Business Process Management. BPM 2023Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
For the enactment of inter-organizational business processes, blockchain can guarantee the enforcement of process models and the integrity of execution traces. However, existing solutions come with downsides regarding throughput scalability, latency, and suboptimal tradeoffs between confidentiality and transparency. To address these issues, we propose to change the foundation of blockchain-based business process execution: from on-chain smart contracts to state channels, an overlay network on top of a blockchain. State channels allow conducting most transactions off-chain while mostly retaining the core security properties offered by blockchain. Our proposal, process channels, is a model-driven approach to enacting processes on state channels, with the aim to retain the desired blockchain properties while reducing the on-chain footprint as much as possible. We here focus on the principled approach of state channels as a platform, to enable manifold future optimizations in various directions, like latency and confidentiality. We implement our approach prototypical and evaluate it both qualitatively (w.r.t. assumptions and guarantees) and quantitatively (w.r.t. correctness and gas cost). In short, while the initial deployment effort is higher with state channels, it typically pays off after a few process instances; and as long as the new assumptions hold, so do the guarantees.
- [495] arXiv:2304.10286 (replaced) [pdf, html, other]
-
Title: On the Computational Power of Particle MethodsComments: 17 pages, 24 appendix pagesSubjects: Formal Languages and Automata Theory (cs.FL); Numerical Analysis (math.NA)
We investigate the computational power of particle methods, a well-established class of algorithms with applications in scientific computing and computer simulation. The computational power of a compute model determines the class of problems it can solve. Automata theory allows describing the computational power of abstract machines (automata) and the problems they can solve. At the top of the Chomsky hierarchy of formal languages and grammars are Turing machines, which resemble the concept on which most modern computers are built. Although particle methods can be interpreted as automata based on their formal definition, their computational power has so far not been studied. We address this by analyzing Turing completeness of particle methods. In particular, we prove two sets of restrictions under which a particle method is still Turing powerful, and we show when it loses Turing powerfulness. This contributes to understanding the theoretical foundations of particle methods and provides insight into the powerfulness of computer simulations.
- [496] arXiv:2306.13255 (replaced) [pdf, html, other]
-
Title: Precise Asymptotic Generalization for Multiclass Classification with Overparameterized Linear ModelsComments: NeurIPS 2023, 56 pages v3: fixed typos in sparse Hanson-Wright theorem statementSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study the asymptotic generalization of an overparameterized linear model for multiclass classification under the Gaussian covariates bi-level model introduced in Subramanian et al.~'22, where the number of data points, features, and classes all grow together. We fully resolve the conjecture posed in Subramanian et al.~'22, matching the predicted regimes for generalization. Furthermore, our new lower bounds are akin to an information-theoretic strong converse: they establish that the misclassification rate goes to 0 or 1 asymptotically. One surprising consequence of our tight results is that the min-norm interpolating classifier can be asymptotically suboptimal relative to noninterpolating classifiers in the regime where the min-norm interpolating regressor is known to be optimal.
The key to our tight analysis is a new variant of the Hanson-Wright inequality which is broadly useful for multiclass problems with sparse labels. As an application, we show that the same type of analysis can be used to analyze the related multilabel classification problem under the same bi-level ensemble. - [497] arXiv:2307.15220 (replaced) [pdf, html, other]
-
Title: Learning Multi-modal Representations by Watching Hundreds of Surgical Video LecturesKun Yuan, Vinkle Srivastav, Tong Yu, Joel L. Lavanchy, Jacques Marescaux, Pietro Mascagni, Nassir Navab, Nicolas PadoySubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP's potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications. The [training code](this https URL) and [weights](this https URL) are public.
- [498] arXiv:2310.01791 (replaced) [pdf, html, other]
-
Title: Online POMDP Planning with Anytime Deterministic GuaranteesSubjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Decision-making under uncertainty is a critical aspect of many practical autonomous systems due to incomplete information. Partially Observable Markov Decision Processes (POMDPs) offer a mathematically principled framework for formulating decision-making problems under such conditions. However, finding an optimal solution for a POMDP is generally intractable. In recent years, there has been a significant progress of scaling approximate solvers from small to moderately sized problems, using online tree search solvers. Often, such approximate solvers are limited to probabilistic or asymptotic guarantees towards the optimal solution. In this paper, we derive a deterministic relationship for discrete POMDPs between an approximated and the optimal solution. We show that at any time, we can derive bounds that relate between the existing solution and the optimal one. We show that our derivations provide an avenue for a new set of algorithms and can be attached to existing algorithms that have a certain structure to provide them with deterministic guarantees with marginal computational overhead. In return, not only do we certify the solution quality, but we demonstrate that making a decision based on the deterministic guarantee may result in superior performance compared to the original algorithm without the deterministic certification.
- [499] arXiv:2310.04722 (replaced) [pdf, html, other]
-
Title: A Holistic Evaluation of Piano Sound QualityComments: 15 pages, 9 figuresJournal-ref: Proceedings of the 10th Conference on Sound and Music Technology. CSMT 2023. Lecture Notes in Electrical Engineering, vol 1268. Springer, SingaporeSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
This paper aims to develop a holistic evaluation method for piano sound quality to assist in purchasing decisions. Unlike previous studies that focused on the effect of piano performance techniques on sound quality, this study evaluates the inherent sound quality of different pianos. To derive quality evaluation systems, the study uses subjective questionnaires based on a piano sound quality dataset. The method selects the optimal piano classification models by comparing the fine-tuning results of different pre-training models of Convolutional Neural Networks (CNN). To improve the interpretability of the models, the study applies Equivalent Rectangular Bandwidth (ERB) analysis. The results reveal that musically trained individuals are better able to distinguish between the sound quality differences of different pianos. The best fine-tuned CNN pre-trained backbone achieves a high accuracy of 98.3% as the piano classifier. However, the dataset is limited, and the audio is sliced to increase its quantity, resulting in a lack of diversity and balance, so we use focal loss to reduce the impact of data imbalance. To optimize the method, the dataset will be expanded, or few-shot learning techniques will be employed in future research.
- [500] arXiv:2312.00206 (replaced) [pdf, html, other]
-
Title: SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian SplattingComments: Version accepted to 3DV 2025. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
3D Gaussian Splatting (3DGS) has recently enabled real-time rendering of unbounded 3D scenes for novel view synthesis. However, this technique requires dense training views to accurately reconstruct 3D geometry. A limited number of input views will significantly degrade reconstruction quality, resulting in artifacts such as "floaters" and "background collapse" at unseen viewpoints. In this work, we introduce SparseGS, an efficient training pipeline designed to address the limitations of 3DGS in scenarios with sparse training views. SparseGS incorporates depth priors, novel depth rendering techniques, and a pruning heuristic to mitigate floater artifacts, alongside an Unseen Viewpoint Regularization module to alleviate background collapses. Our extensive evaluations on the Mip-NeRF360, LLFF, and DTU datasets demonstrate that SparseGS achieves high-quality reconstruction in both unbounded and forward-facing scenarios, with as few as 12 and 3 input images, respectively, while maintaining fast training and real-time rendering capabilities.
- [501] arXiv:2312.03858 (replaced) [pdf, html, other]
-
Title: Empowering WebAssembly with Thin Kernel InterfacesComments: This work is published at EuroSys 2025, Rotterdam, Netherlands (March 30 - April 3) 14 pages, 8 figuresJournal-ref: Twentieth European Conference on Computer Systems (EuroSys 2025)Subjects: Operating Systems (cs.OS); Software Engineering (cs.SE)
Wasm is gaining popularity outside the Web as a well-specified low-level binary format with ISA portability, low memory footprint and polyglot targetability, enabling efficient in-process sandboxing of untrusted code. Despite these advantages, Wasm adoption for new domains is often hindered by the lack of many standard system interfaces which precludes reusability of existing software and slows ecosystem growth.
This paper proposes thin kernel interfaces for Wasm, which directly expose OS userspace syscalls without breaking intra-process sandboxing, enabling a new class of virtualization with Wasm as a universal binary format. By virtualizing the bottom layer of userspace, kernel interfaces enable effortless application ISA portability, compiler backend reusability, and armor programs with Wasm's built-in control flow integrity and arbitrary code execution protection. Furthermore, existing capability-based APIs for Wasm, such as WASI, can be implemented as a Wasm module over kernel interfaces, improving reuse, robustness, and portability through better layering. We present an implementation of this concept for two kernels -- Linux and Zephyr -- by extending a modern Wasm engine and evaluate our system's performance on a number of sophisticated applications which can run for the first time on Wasm. - [502] arXiv:2312.07669 (replaced) [pdf, html, other]
-
Title: GMTalker: Gaussian Mixture-based Audio-Driven Emotional Talking Video PortraitsComments: Project page: this https URL. This work has been submitted to the IEEE journal for possible publicationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Synthesizing high-fidelity and emotion-controllable talking video portraits, with audio-lip sync, vivid expressions, realistic head poses, and eye blinks, has been an important and challenging task in recent years. Most existing methods suffer in achieving personalized and precise emotion control, smooth transitions between different emotion states, and the generation of diverse motions. To tackle these challenges, we present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework. Specifically, we propose a Gaussian mixture-based expression generator that can construct a continuous and disentangled latent space, achieving more flexible emotion manipulation. Furthermore, we introduce a normalizing flow-based motion generator pretrained on a large dataset with a wide-range motion to generate diverse head poses, blinks, and eyeball movements. Finally, we propose a personalized emotion-guided head generator with an emotion mapping network that can synthesize high-fidelity and faithful emotional video portraits. Both quantitative and qualitative experiments demonstrate our method outperforms previous methods in image quality, photo-realism, emotion accuracy, and motion diversity.
- [503] arXiv:2401.13174 (replaced) [pdf, html, other]
-
Title: Towards Complementary Knowledge Distillation for Efficient Dense Image PredictionComments: under submissionSubjects: Computer Vision and Pattern Recognition (cs.CV)
It has been revealed that small efficient dense image prediction (EDIP) models, trained using the knowledge distillation (KD) framework, encounter two key challenges, including maintaining boundary region completeness and preserving target region connectivity, despite their favorable capacity to recognize main object regions. In this work, we propose a complementary boundary and context distillation (BCD) method within the KD framework for EDIPs, which facilitates the targeted knowledge transfer from large accurate teacher models to compact efficient student models. Specifically, the boundary distillation component focuses on extracting explicit object-level semantic boundaries from the hierarchical feature maps of the backbone network to enhance the student model's mask quality in boundary regions. Concurrently, the context distillation component leverages self-relations as a bridge to transfer implicit pixel-level contexts from the teacher model to the student model, ensuring strong connectivity in target regions. Our proposed BCD method is specifically designed for EDIP tasks and is characterized by its simplicity and efficiency. Extensive experimental results across semantic segmentation, object detection, and instance segmentation on various representative datasets demonstrate that our method can outperform existing methods without requiring extra supervisions or incurring increased inference costs, resulting in well-defined object boundaries and smooth connecting regions.
- [504] arXiv:2401.16413 (replaced) [pdf, other]
-
Title: The geometric error is less than the pollution error when solving the high-frequency Helmholtz equation with high-order FEM on curved domainsSubjects: Numerical Analysis (math.NA)
We consider the $h$-version of the finite-element method, where accuracy is increased by decreasing the meshwidth $h$ while keeping the polynomial degree $p$ constant, applied to the Helmholtz equation. Although the question "how quickly must $h$ decrease as the wavenumber $k$ increases to maintain accuracy?" has been studied intensively since the 1990s, none of the existing rigorous wavenumber-explicit analyses take into account the approximation of the geometry. In this paper we prove that for nontrapping problems solved using straight elements the geometric error is order $kh$, which is then less than the pollution error $k(kh)^{2p}$ when $k$ is large; this fact is then illustrated in numerical experiments. More generally, we prove that, even for problems with strong trapping, using degree four (in 2-d) or degree five (in 3-d) polynomials and isoparametric elements ensures that the geometric error is smaller than the pollution error for most large wavenumbers.
- [505] arXiv:2402.03664 (replaced) [pdf, html, other]
-
Title: Partial Gromov-Wasserstein MetricComments: Published at ICLR 2025Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The Gromov-Wasserstein (GW) distance has gained increasing interest in the machine learning community in recent years, as it allows for the comparison of measures in different metric spaces. To overcome the limitations imposed by the equal mass requirements of the classical GW problem, researchers have begun exploring its application in unbalanced settings. However, Unbalanced GW (UGW) can only be regarded as a discrepancy rather than a rigorous metric/distance between two metric measure spaces (mm-spaces). In this paper, we propose a particular case of the UGW problem, termed Partial Gromov-Wasserstein (PGW). We establish that PGW is a well-defined metric between mm-spaces and discuss its theoretical properties, including the existence of a minimizer for the PGW problem and the relationship between PGW and GW, among others. We then propose two variants of the Frank-Wolfe algorithm for solving the PGW problem and show that they are mathematically and computationally equivalent. Moreover, based on our PGW metric, we introduce the analogous concept of barycenters for mm-spaces. Finally, we validate the effectiveness of our PGW metric and related solvers in applications such as shape matching, shape retrieval, and shape interpolation, comparing them against existing baselines. Our code is available at this https URL.
- [506] arXiv:2402.06289 (replaced) [pdf, html, other]
-
Title: FedMIA: An Effective Membership Inference Attack Exploiting "All for One" Principle in Federated LearningComments: 14 pages, 6 figures; Accepted by CVPR 2025Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Federated Learning (FL) is a promising approach for training machine learning models on decentralized data while preserving privacy. However, privacy risks, particularly Membership Inference Attacks (MIAs), which aim to determine whether a specific data point belongs to a target client's training set, remain a significant concern. Existing methods for implementing MIAs in FL primarily analyze updates from the target client, focusing on metrics such as loss, gradient norm, and gradient difference. However, these methods fail to leverage updates from non-target clients, potentially underutilizing available information. In this paper, we first formulate a one-tailed likelihood-ratio hypothesis test based on the likelihood of updates from non-target clients. Building upon this formulation, we introduce a three-step Membership Inference Attack (MIA) method, called FedMIA, which follows the "all for one"--leveraging updates from all clients across multiple communication rounds to enhance MIA effectiveness. Both theoretical analysis and extensive experimental results demonstrate that FedMIA outperforms existing MIAs in both classification and generative tasks. Additionally, it can be integrated as an extension to existing methods and is robust against various defense strategies, Non-IID data, and different federated structures. Our code is available in this https URL.
- [507] arXiv:2402.08514 (replaced) [pdf, other]
-
Title: Counterfactual Influence in Markov Decision ProcessesComments: 12 pages, 6 figuresSubjects: Artificial Intelligence (cs.AI)
Our work addresses a fundamental problem in the context of counterfactual inference for Markov Decision Processes (MDPs). Given an MDP path $\tau$, this kind of inference allows us to derive counterfactual paths $\tau'$ describing what-if versions of $\tau$ obtained under different action sequences than those observed in $\tau$. However, as the counterfactual states and actions deviate from the observed ones over time, the observation $\tau$ may no longer influence the counterfactual world, meaning that the analysis is no longer tailored to the individual observation, resulting in interventional outcomes rather than counterfactual ones. Even though this issue specifically affects the popular Gumbel-max structural causal model used for MDP counterfactuals, it has remained overlooked until now. In this work, we introduce a formal characterisation of influence based on comparing counterfactual and interventional distributions. We devise an algorithm to construct counterfactual models that automatically satisfy influence constraints. Leveraging such models, we derive counterfactual policies that are not just optimal for a given reward structure but also remain tailored to the observed path. Even though there is an unavoidable trade-off between policy optimality and strength of influence constraints, our experiments demonstrate that it is possible to derive (near-)optimal policies while remaining under the influence of the observation.
- [508] arXiv:2402.09117 (replaced) [pdf, html, other]
-
Title: Deterministic identification over channels with finite output: a dimensional perspective on superlinear ratesComments: 24 pages, 5 figures. This work has been acepted for publication in IEEE Transactions on Information Theory, and a preliminary version was presented at ISIT 2024, Athens (Greece)Subjects: Information Theory (cs.IT); Quantum Physics (quant-ph)
Following initial work by JaJa, Ahlswede and Cai, and inspired by a recent renewed surge in interest in deterministic identification (DI) via noisy channels, we consider the problem in its generality for memoryless channels with finite output, but arbitrary input alphabets. Such a channel is essentially given by its output distributions as a subset in the probability simplex. Our main findings are that the maximum length of messages thus identifiable scales superlinearly as $R\,n\log n$ with the block length $n$, and that the optimal rate $R$ is bounded in terms of the covering (aka Minkowski, or Kolmogorov, or entropy) dimension $d$ of a certain algebraic transformation of the output set: $\frac14 d \leq R \leq \frac12 d$. Remarkably, both the lower and upper Minkowski dimensions play a role in this result. Along the way, we present a "Hypothesis Testing Lemma" showing that it is sufficient to ensure pairwise reliable distinguishability of the output distributions to construct a DI code. Although we do not know the exact capacity formula, we can conclude that the DI capacity exhibits superactivation: there exist channels whose capacities individually are zero, but whose product has positive capacity. We also generalise these results to classical-quantum channels with finite-dimensional output quantum system, in particular to quantum channels on finite-dimensional quantum systems under the constraint that the identification code can only use tensor product inputs.
- [509] arXiv:2402.11317 (replaced) [pdf, html, other]
-
Title: Debiased Offline Representation Learning for Fast Online Adaptation in Non-stationary DynamicsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Developing policies that can adjust to non-stationary environments is essential for real-world reinforcement learning applications. However, learning such adaptable policies in offline settings, with only a limited set of pre-collected trajectories, presents significant challenges. A key difficulty arises because the limited offline data makes it hard for the context encoder to differentiate between changes in the environment dynamics and shifts in the behavior policy, often leading to context misassociations. To address this issue, we introduce a novel approach called Debiased Offline Representation for fast online Adaptation (DORA). DORA incorporates an information bottleneck principle that maximizes mutual information between the dynamics encoding and the environmental data, while minimizing mutual information between the dynamics encoding and the actions of the behavior policy. We present a practical implementation of DORA, leveraging tractable bounds of the information bottleneck principle. Our experimental evaluation across six benchmark MuJoCo tasks with variable parameters demonstrates that DORA not only achieves a more precise dynamics encoding but also significantly outperforms existing baselines in terms of performance.
- [510] arXiv:2402.13255 (replaced) [pdf, other]
-
Title: How NeRFs and 3D Gaussian Splatting are Reshaping SLAM: a SurveyFabio Tosi, Youmin Zhang, Ziren Gong, Erik Sandström, Stefano Mattoccia, Martin R. Oswald, Matteo PoggiComments: Updated to November 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Over the past two decades, research in the field of Simultaneous Localization and Mapping (SLAM) has undergone a significant evolution, highlighting its critical role in enabling autonomous exploration of unknown environments. This evolution ranges from hand-crafted methods, through the era of deep learning, to more recent developments focused on Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) representations. Recognizing the growing body of research and the absence of a comprehensive survey on the topic, this paper aims to provide the first comprehensive overview of SLAM progress through the lens of the latest advancements in radiance fields. It sheds light on the background, evolutionary path, inherent strengths and limitations, and serves as a fundamental reference to highlight the dynamic progress and specific challenges.
- [511] arXiv:2402.13901 (replaced) [pdf, other]
-
Title: Broadening Target Distributions for Accelerated Diffusion Models via a Novel Analysis ApproachSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
Accelerated diffusion models hold the potential to significantly enhance the efficiency of standard diffusion processes. Theoretically, these models have been shown to achieve faster convergence rates than the standard $\mathcal O(1/\epsilon^2)$ rate of vanilla diffusion models, where $\epsilon$ denotes the target accuracy. However, current theoretical studies have established the acceleration advantage only for restrictive target distribution classes, such as those with smoothness conditions imposed along the entire sampling path or with bounded support. In this work, we significantly broaden the target distribution classes with a new accelerated stochastic DDPM sampler. In particular, we show that it achieves accelerated performance for three broad distribution classes not considered before. Our first class relies on the smoothness condition posed only to the target density $q_0$, which is far more relaxed than the existing smoothness conditions posed to all $q_t$ along the entire sampling path. Our second class requires only a finite second moment condition, allowing for a much wider class of target distributions than the existing finite-support condition. Our third class is Gaussian mixture, for which our result establishes the first acceleration guarantee. Moreover, among accelerated DDPM type samplers, our results specialized for bounded-support distributions show an improved dependency on the data dimension $d$. Our analysis introduces a novel technique for establishing performance guarantees via constructing a tilting factor representation of the convergence error and utilizing Tweedie's formula to handle Taylor expansion terms. This new analytical framework may be of independent interest.
- [512] arXiv:2403.04635 (replaced) [pdf, html, other]
-
Title: Virtuoso: Enabling Fast and Accurate Virtual Memory Research via an Imitation-based Operating System Simulation MethodologyKonstantinos Kanellopoulos, Konstantinos Sgouras, F. Nisa Bostanci, Andreas Kosmas Kakolyris, Berkin Kerim Konar, Rahul Bera, Mohammad Sadrosadati, Rakesh Kumar, Nandita Vijaykumar, Onur MutluSubjects: Hardware Architecture (cs.AR); Operating Systems (cs.OS)
The unprecedented growth in data demand from emerging applications has turned virtual memory (VM) into a major performance bottleneck. Researchers explore new hardware/OS co-designs to optimize VM across diverse applications and systems. To evaluate such designs, researchers rely on various simulation methodologies to model VM this http URL, current simulation tools (i) either lack the desired accuracy in modeling VM's software components or (ii) are too slow and complex to prototype and evaluate schemes that span across the hardware/software boundary.
We introduce Virtuoso, a new simulation framework that enables quick and accurate prototyping and evaluation of the software and hardware components of the VM subsystem. The key idea of Virtuoso is to employ a lightweight userspace OS kernel, called MimicOS, that (i) accelerates simulation time by imitating only the desired kernel functionalities, (ii) facilitates the development of new OS routines that imitate real ones, using an accessible high-level programming interface, (iii) enables accurate and flexible evaluation of the application- and system-level implications of VM after integrating Virtuoso to a desired architectural simulator.
We integrate Virtuoso into five diverse architectural simulators, each specializing in different aspects of system design, and heavily enrich it with multiple state-of-the-art VM schemes. Our validation shows that Virtuoso ported on top of Sniper, a state-of-the-art microarchitectural simulator, models the memory management unit of a real high-end server-grade page fault latency of a real Linux kernel with high accuracy . Consequently, Virtuoso models the IPC performance of a real high-end server-grade CPU with 21% higher accuracy than the baseline version of Sniper. The source code of Virtuoso is freely available at this https URL. - [513] arXiv:2403.05944 (replaced) [pdf, html, other]
-
Title: Model-Predictive Trajectory Generation for Aerial Search and CoverageSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
This paper introduces a trajectory planning algorithm for search and coverage missions with an Unmanned Aerial Vehicle (UAV) based on an uncertainty map that represents prior knowledge of the target region, modeled by a Gaussian Mixture Model (GMM). The trajectory planning problem is formulated as an Optimal Control Problem (OCP), which aims to maximize the uncertainty reduction within a specified mission duration. However, this results in an intractable OCP whose objective functional cannot be expressed in closed form. To address this, we propose a Model Predictive Control (MPC) algorithm based on a relaxed formulation of the objective function to approximate the optimal solutions. This relaxation promotes efficient map exploration by penalizing overlaps in the UAV's visibility regions along the trajectory. The algorithm can produce efficient and smooth trajectories, and it can be efficiently implemented using standard Nonlinear Programming solvers, being suitable for real-time planning. Unlike traditional methods, which often rely on discretizing the mission space and using complex mixed-integer formulations, our approach is computationally efficient and easier to implement. The MPC algorithm is initially assessed in MATLAB, followed by Gazebo simulations and actual experimental tests conducted in an outdoor environment. The results demonstrate that the proposed strategy can generate efficient and smooth trajectories for search and coverage missions.
- [514] arXiv:2403.12922 (replaced) [pdf, html, other]
-
Title: Contextual AD Narration with Interleaved Multimodal SequenceSubjects: Computer Vision and Pattern Recognition (cs.CV)
The Audio Description (AD) task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video content, like movies. With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie. To achieve this goal, we propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs with interleaved multimodal sequence as input, termed as Uni-AD. To enhance the alignment of features across various modalities with finer granularity, we introduce a simple and lightweight module that maps video features into the textual feature space. Moreover, we also propose a character-refinement module to provide more precise information by identifying the main characters who play more significant roles in the video context. With these unique designs, we further incorporate contextual information and a contrastive loss into our architecture to generate smoother and more contextually appropriate ADs. Experiments on multiple AD datasets show that Uni-AD performs well on AD generation, which demonstrates the effectiveness of our approach. Our code is available at: this https URL.
- [515] arXiv:2403.13770 (replaced) [pdf, html, other]
-
Title: A convergent adaptive finite element stochastic Galerkin method based on multilevel expansions of random fieldsComments: 27 pages, 4 figuresSubjects: Numerical Analysis (math.NA)
The subject of this work is an adaptive stochastic Galerkin finite element method for parametric or random elliptic partial differential equations, which generates sparse product polynomial expansions with respect to the parametric variables of solutions. For the corresponding spatial approximations, an independently refined finite element mesh is used for each polynomial coefficient. The method relies on multilevel expansions of input random fields and achieves error reduction with uniform rate. In particular, the saturation property for the refinement process is ensured by the algorithm. The results are illustrated by numerical experiments, including cases with random fields of low regularity.
- [516] arXiv:2403.16711 (replaced) [pdf, html, other]
-
Title: Predictable Interval MDPs through Entropy RegularizationComments: This paper has been presented at the 2024 63rd IEEE Conference on Decision and Control (CDC)Subjects: Systems and Control (eess.SY)
Regularization of control policies using entropy can be instrumental in adjusting predictability of real-world systems. Applications benefiting from such approaches range from, e.g., cybersecurity, which aims at maximal unpredictability, to human-robot interaction, where predictable behavior is highly desirable. In this paper, we consider entropy regularization for interval Markov decision processes (IMDPs). IMDPs are uncertain MDPs, where transition probabilities are only known to belong to intervals. Lately, IMDPs have gained significant popularity in the context of abstracting stochastic systems for control design. In this work, we address robust minimization of the linear combination of entropy and a standard cumulative cost in IMDPs, thereby establishing a trade-off between optimality and predictability. We show that optimal deterministic policies exist, and devise a value-iteration algorithm to compute them. The algorithm solves a number of convex programs at each step. Finally, through an illustrative example we show the benefits of penalizing entropy in IMDPs.
- [517] arXiv:2403.18886 (replaced) [pdf, html, other]
-
Title: Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual LearningComments: Code available at https: this https URLSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Continual learning (CL) aims to continually accumulate knowledge from a non-stationary data stream without catastrophic forgetting of learned knowledge, requiring a balance between stability and adaptability. Relying on the generalizable representation in pre-trained models (PTMs), PTM-based CL methods perform effective continual adaptation on downstream tasks by adding learnable adapters or prompts upon the frozen PTMs. However, many existing PTM-based CL methods use restricted adaptation on a fixed set of these modules to avoid forgetting, suffering from limited CL ability. Periodically adding task-specific modules results in linear model growth rate and impaired knowledge reuse. We propose Self-Expansion of pre-trained models with Modularized Adaptation (SEMA), a novel approach to enhance the control of stability-plasticity balance in PTM-based CL. SEMA automatically decides to reuse or add adapter modules on demand in CL, depending on whether significant distribution shift that cannot be handled is detected at different representation levels. We design modular adapter consisting of a functional adapter and a representation descriptor. The representation descriptors are trained as a distribution shift indicator and used to trigger self-expansion signals. For better composing the adapters, an expandable weighting router is learned jointly for mixture of adapter outputs. SEMA enables better knowledge reuse and sub-linear expansion rate. Extensive experiments demonstrate the effectiveness of the proposed self-expansion method, achieving state-of-the-art performance compared to PTM-based CL methods without memory rehearsal. Code is available at this https URL.
- [518] arXiv:2403.19647 (replaced) [pdf, html, other]
-
Title: Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language ModelsJournal-ref: International Conference on Learning Representations, 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms. Because they are based on fine-grained units, sparse feature circuits are useful for downstream tasks: We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors.
- [519] arXiv:2404.01901 (replaced) [pdf, html, other]
-
Title: Learning-based model augmentation with LFRsComments: Accepted for ECC 2025Subjects: Systems and Control (eess.SY)
Nonlinear system identification (NL-SI) has proven to be effective in obtaining accurate models for highly complex systems. In particular, recent encoder-based methods for artificial neural networks state-space (ANN-SS) models have achieved state-of-the-art performance on various benchmarks, while offering consistency and computational efficiency. Inclusion of prior knowledge of the system can be exploited to increase (i) estimation speed, (ii) accuracy, and (iii) interpretability of the resulting models. This paper proposes an encoder-based model augmentation method that incorporates prior knowledge from first-principles (FP) models. We introduce a novel \linear-fractional-representation (LFR) model structure that allows for the unified representation of various augmentation structures including the ones that are commonly used in the literature, and an identification algorithm for estimating the proposed structure together with appropriate initialization methods. The performance and generalization capabilities of the proposed method are demonstrated in a hardening mass-spring-damper simulation.
- [520] arXiv:2404.06511 (replaced) [pdf, html, other]
-
Title: MoReVQA: Exploring Modular Reasoning Models for Video Question AnsweringComments: CVPR 2024; updated NExT-GQA results in AppendixSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
- [521] arXiv:2404.07977 (replaced) [pdf, html, other]
-
Title: Gaga: Group Any Gaussians via 3D-aware Memory BankComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot class-agnostic segmentation models. Contrasted to prior 3D scene segmentation approaches that rely on video object tracking or contrastive learning methods, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses through a novel 3D-aware memory bank. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot class-agnostic segmentation models, significantly enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as 3D scene understanding and manipulation.
- [522] arXiv:2404.14221 (replaced) [pdf, html, other]
-
Title: Sequential Outlier Hypothesis Testing under Universality ConstraintsComments: v2 was published in ITW 2024, v3 is the full version with results for both cases of known and unknown number of outliers, and v4 presents the results for the known number of outliersSubjects: Information Theory (cs.IT)
We revisit sequential outlier hypothesis testing and derive bounds on achievable exponents when both the nominal and anomalous distributions are \emph{unknown}. The task of outlier hypothesis testing is to identify the set of outliers that are generated from an anomalous distribution among all observed sequences where the rest majority are generated from a nominal distribution. In the sequential setting, one obtains a sample from each sequence per unit time until a reliable decision could be made. For the case with exactly one outlier, our exponent bounds on are tight, providing exact large deviations characterization of sequential tests and strengthening a previous result of Li, Nitinawarat and Veeravalli (2017). In particular, the average sample size of our sequential test is bounded universally under any pair of nominal and anomalous distributions and our sequential test achieves larger Bayesian exponent than the fixed-length test, which could not be guaranteed by the sequential test of Li, Nitinawarat and Veeravalli (2017). For the case with at most one outlier, we propose a threshold-based test that has bounded expected stopping time under mild conditions and we bound the error exponents under each non-null and the null hypotheses. Our sequential test resolves the error exponents tradeoff for the fixed-length test of Zhou, Wei and Hero (TIT 2022). Finally, with a further step towards practical applications, we generalize our results to the cases of multiple outliers and show that there is a penalty in the error exponents when the number of outliers is unknown.
- [523] arXiv:2404.14963 (replaced) [pdf, html, other]
-
Title: Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word ProblemsComments: The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: { https://doi.org/10.1007/s11704-025-41102-z }Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks. However, CoT still falls short in dealing with complex math word problems, as it usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors. Prior studies involve addressing the calculation errors and step-missing errors, but neglect the semantic misunderstanding errors, which is the major factor limiting the reasoning performance of LLMs. To this end, we propose a simple-yet-effective method, namely Deeply Understanding the Problems (DUP), to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors. The core of our method is to encourage the LLMs to deeply understand the problems and extract the key problem-solving information used for better reasoning. Extensive experiments on 10 diverse reasoning benchmarks show that our DUP method consistently outperforms the other counterparts by a large margin. More encouragingly, DUP achieves a new SOTA result on the GSM8K benchmark, with an accuracy of 97.1% under the zero-shot setting.
- [524] arXiv:2404.16224 (replaced) [pdf, html, other]
-
Title: Tractable Conjunctive Queries over Static and Dynamic RelationsComments: Polished versionSubjects: Databases (cs.DB)
We investigate the evaluation of conjunctive queries over static and dynamic relations. While static relations are given as input and do not change, dynamic relations are subject to inserts and deletes.
We characterise syntactically three classes of queries that admit constant update time and constant enumeration delay. We call such queries tractable. Depending on the class, the preprocessing time is linear, polynomial, or exponential (under data complexity, so the query size is constant).
To decide whether a query is tractable, it does not suffice to analyse separately the sub-queries over the static relations and over the dynamic relations, respectively. Instead, we need to take the interaction between the static and the dynamic relations into account. Even when the sub-query over the dynamic relations is not tractable, the overall query can become tractable if the dynamic relations are sufficiently constrained by the static ones. - [525] arXiv:2405.00205 (replaced) [pdf, html, other]
-
Title: A Logic for Reasoning About Aggregate-Combine Graph Neural NetworksComments: arXiv admin note: text overlap with arXiv:2307.05150Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
We propose a modal logic in which counting modalities appear in linear inequalities. We show that each formula can be transformed into an equivalent graph neural network (GNN). We also show that a broad class of GNNs can be transformed efficiently into a formula, thus significantly improving upon the literature about the logical expressiveness of GNNs. We also show that the satisfiability problem is PSPACE-complete. These results bring together the promise of using standard logical methods for reasoning about GNNs and their properties, particularly in applications such as GNN querying, equivalence checking, etc. We prove that such natural problems can be solved in polynomial space.
- [526] arXiv:2405.00393 (replaced) [pdf, html, other]
-
Title: Unleashing the Power of LLM to Infer State Machine from the Protocol ImplementationHaiyang Wei, Ligeng Chen, Zhengjie Du, Yuhan Wu, Haohui Huang, Yue Liu, Guang Cheng, Fengyuan Xu, Linzhang Wang, Bing MaoSubjects: Cryptography and Security (cs.CR)
State machines are essential for enhancing protocol analysis to identify vulnerabilities. However, inferring state machines from network protocol implementations is challenging due to complex code syntax and semantics. Traditional dynamic analysis methods often miss critical state transitions due to limited coverage, while static analysis faces path explosion issues. To overcome these challenges, we introduce a novel state machine inference approach utilizing Large Language Models (LLMs), named ProtocolGPT. This method employs retrieval augmented generation technology to enhance a pre-trained model with specific knowledge from protocol implementations. Through effective prompt engineering, we accurately identify and infer state machines. To the best of our knowledge, our approach represents the first state machine inference that leverages the source code of protocol implementations. Our evaluation of six protocol implementations shows that our method achieves a precision of over 90%, outperforming the baselines by more than 30%. Furthermore, integrating our approach with protocol fuzzing improves coverage by more than 20% and uncovers two 0-day vulnerabilities compared to baseline methods.
- [527] arXiv:2405.01105 (replaced) [pdf, html, other]
-
Title: Image segmentation of treated and untreated tumor spheroids by Fully Convolutional NetworksMatthias Streller, Soňa Michlíková, Willy Ciecior, Katharina Lönnecke, Leoni A. Kunz-Schughart, Steffen Lange, Anja Voss-BöhmeComments: 30 pages, 23 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM); Tissues and Organs (q-bio.TO)
Multicellular tumor spheroids (MCTS) are advanced cell culture systems for assessing the impact of combinatorial radio(chemo)therapy. They exhibit therapeutically relevant in-vivo-like characteristics from 3D cell-cell and cell-matrix interactions to radial pathophysiological gradients related to proliferative activity and nutrient/oxygen supply, altering cellular radioresponse. State-of-the-art assays quantify long-term curative endpoints based on collected brightfield image time series from large treated spheroid populations per irradiation dose and treatment arm. Here, spheroid control probabilities are documented analogous to in-vivo tumor control probabilities based on Kaplan-Meier curves. This analyses require laborious spheroid segmentation of up to 100.000 images per treatment arm to extract relevant structural information from the images, e.g., diameter, area, volume and circularity. While several image analysis algorithms are available for spheroid segmentation, they all focus on compact MCTS with clearly distinguishable outer rim throughout growth. However, treated MCTS may partly be detached and destroyed and are usually obscured by dead cell debris. We successfully train two Fully Convolutional Networks, UNet and HRNet, and optimize their hyperparameters to develop an automatic segmentation for both untreated and treated MCTS. We systematically validate the automatic segmentation on larger, independent data sets of spheroids derived from two human head-and-neck cancer cell lines. We find an excellent overlap between manual and automatic segmentation for most images, quantified by Jaccard indices at around 90%. For images with smaller overlap of the segmentations, we demonstrate that this error is comparable to the variations across segmentations from different biological experts, suggesting that these images represent biologically unclear or ambiguous cases.
- [528] arXiv:2405.04118 (replaced) [pdf, html, other]
-
Title: Policy Learning with a Language BottleneckComments: 21 pages, 15 figures, updated with robot manipulation taskSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Modern AI systems such as self-driving cars and game-playing agents achieve superhuman performance, but often lack human-like generalization, interpretability, and inter-operability with human users. Inspired by the rich interactions between language and decision-making in humans, we introduce Policy Learning with a Language Bottleneck (PLLB), a framework enabling AI agents to generate linguistic rules that capture the high-level strategies underlying rewarding behaviors. PLLB alternates between a *rule generation* step guided by language models, and an *update* step where agents learn new policies guided by rules, even when a rule is insufficient to describe an entire complex policy. Across five diverse tasks, including a two-player signaling game, maze navigation, image reconstruction, and robot grasp planning, we show that PLLB agents are not only able to learn more interpretable and generalizable behaviors, but can also share the learned rules with human users, enabling more effective human-AI coordination. We provide source code for our experiments at this https URL .
- [529] arXiv:2405.07538 (replaced) [pdf, other]
-
Title: Mirroring the Parking Target: An Optimal-Control-Based Parking Motion Planner with Strengthened Parking Reliability and Faster Parking CompletionComments: IEEE Transactions on Intelligent Transportation Systems (2024)Subjects: Robotics (cs.RO)
Automated Parking Assist (APA) systems are now facing great challenges of low adoption in applications, due to users' concerns about parking capability, reliability, and completion efficiency. To upgrade the conventional APA planners and enhance user's acceptance, this research proposes an optimal-control-based parking motion planner. Its highlight lies in its control logic: planning trajectories by mirroring the parking target. This method enables: i) parking capability in narrow spaces; ii) better parking reliability by expanding Operation Design Domain (ODD); iii) faster completion of parking process; iv) enhanced computational efficiency; v) universal to all types of parking. A comprehensive evaluation is conducted. Results demonstrate the proposed planner does enhance parking success rate by 40.6%, improve parking completion efficiency by 18.0%, and expand ODD by 86.1%. It shows its superiority in difficult parking cases, such as the parallel parking scenario and narrow spaces. Moreover, the average computation time of the proposed planner is 74 milliseconds. Results indicate that the proposed planner is ready for real-time commercial applications.
- [530] arXiv:2405.07556 (replaced) [pdf, other]
-
Title: Safety-Aware Human-Lead Vehicle Platooning by Proactively Reacting to Uncertain Human BehavingJournal-ref: Transportation Research Part C: Emerging Technologies, 170, 104941 (2025)Subjects: Robotics (cs.RO)
Human-Lead Cooperative Adaptive Cruise Control (HL-CACC) is regarded as a promising vehicle platooning technology in real-world implementation. By utilizing a Human-driven Vehicle (HV) as the platoon leader, HL-CACC reduces the cost and enhances the reliability of perception and decision-making. However, state-of-the-art HL-CACC technology still has a great limitation on driving safety due to the lack of considering the leading human driver's uncertain behavior. In this study, a HL-CACC controller is designed based on Stochastic Model Predictive Control (SMPC). It is enabled to predict the driving intention of the leading Connected Human-Driven Vehicle (CHV). The proposed controller has the following features: i) enhanced perceived safety in oscillating traffic; ii) guaranteed safety against hard brakes; iii) computational efficiency for real-time implementation. The proposed controller is evaluated on a PreScan&Simulink simulation platform. Real vehicle trajectory data is collected for the calibration of the simulation. Results reveal that the proposed controller: i) improves perceived safety by 19.17% in oscillating traffic; ii) enhances actual safety by 7.76% against hard brakes; iii) is confirmed with string stability. The computation time is approximately 3.2 milliseconds when running on a laptop equipped with an Intel i5-13500H CPU. This indicates the proposed controller is ready for real-time implementation.
- [531] arXiv:2405.09400 (replaced) [pdf, html, other]
-
Title: Flow updates for domain decomposition of entropic optimal transportComments: RevisionSubjects: Numerical Analysis (math.NA)
Domain decomposition has been shown to be a computationally efficient distributed method for solving large scale entropic optimal transport problems. However, a naive implementation of the algorithm can freeze in the limit of very fine partition cells (i.e. it asymptotically becomes stationary and does not find the global minimizer), since information can only travel slowly between cells. In practice this can be avoided by a coarse-to-fine multiscale scheme. In this article we introduce flow updates as an alternative approach. Flow updates can be interpreted as a variant of the celebrated algorithm by Angenent, Haker, and Tannenbaum, and can be combined canonically with domain decomposition. We prove convergence to the global minimizer and provide a formal discussion of its continuity limit. We give a numerical comparison with naive and multiscale domain decomposition, and show that the flow updates prevent freezing in the regime of very many cells. While the multiscale scheme is observed to be faster than the hybrid approach in general, the latter could be a viable alternative in cases where a good initial coupling is available. Our numerical experiments are based on a novel GPU implementation of domain decomposition that we describe in the appendix.
- [532] arXiv:2405.12802 (replaced) [pdf, html, other]
-
Title: Stochastic Inference of Plate Bending from Heterogeneous Data: Physics-informed Gaussian Processes via Kirchhoff-Love TheoryComments: 25 pages, 11 figuresJournal-ref: ASCE J. Eng. Mech. 151(4) (2025) 04025005Subjects: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
Advancements in machine learning and an abundance of structural monitoring data have inspired the integration of mechanical models with probabilistic models to identify a structure's state and quantify the uncertainty of its physical parameters and response. In this paper, we propose an inference methodology for classical Kirchhoff-Love plates via physics-informed Gaussian Processes (GP). A probabilistic model is formulated as a multi-output GP by placing a GP prior on the deflection and deriving the covariance function using the linear differential operators of the plate governing equations. The posteriors of the flexural rigidity, hyperparameters, and plate response are inferred in a Bayesian manner using Markov chain Monte Carlo (MCMC) sampling from noisy measurements. We demonstrate the applicability with two examples: a simply supported plate subjected to a sinusoidal load and a fixed plate subjected to a uniform load. The results illustrate how the proposed methodology can be employed to perform stochastic inference for plate rigidity and physical quantities by integrating measurements from various sensor types and qualities. Potential applications of the presented methodology are in structural health monitoring and uncertainty quantification of plate-like structures.
- [533] arXiv:2405.13088 (replaced) [pdf, html, other]
-
Title: Combining Relevance and Magnitude for Resource-Aware DNN PruningSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Pruning neural networks, i.e., removing some of their parameters whilst retaining their accuracy, is one of the main ways to reduce the latency of a machine learning pipeline, especially in resource- and/or bandwidth-constrained scenarios. In this context, the pruning technique, i.e., how to choose the parameters to remove, is critical to the system performance. In this paper, we propose a novel pruning approach, called FlexRel and predicated upon combining training-time and inference-time information, namely, parameter magnitude and relevance, in order to improve the resulting accuracy whilst saving both computational resources and bandwidth. Our performance evaluation shows that FlexRel is able to achieve higher pruning factors, saving over 35% bandwidth for typical accuracy targets.
- [534] arXiv:2405.15474 (replaced) [pdf, other]
-
Title: Unlearning during Learning: An Efficient Federated Machine Unlearning MethodComments: Accepted by IJCAI 2024Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
In recent years, Federated Learning (FL) has garnered significant attention as a distributed machine learning paradigm. To facilitate the implementation of the right to be forgotten, the concept of federated machine unlearning (FMU) has also emerged. However, current FMU approaches often involve additional time-consuming steps and may not offer comprehensive unlearning capabilities, which renders them less practical in real FL scenarios. In this paper, we introduce FedAU, an innovative and efficient FMU framework aimed at overcoming these limitations. Specifically, FedAU incorporates a lightweight auxiliary unlearning module into the learning process and employs a straightforward linear operation to facilitate unlearning. This approach eliminates the requirement for extra time-consuming steps, rendering it well-suited for FL. Furthermore, FedAU exhibits remarkable versatility. It not only enables multiple clients to carry out unlearning tasks concurrently but also supports unlearning at various levels of granularity, including individual data samples, specific classes, and even at the client level. We conducted extensive experiments on MNIST, CIFAR10, and CIFAR100 datasets to evaluate the performance of FedAU. The results demonstrate that FedAU effectively achieves the desired unlearning effect while maintaining model accuracy. Our code is availiable at this https URL.
- [535] arXiv:2405.15668 (replaced) [pdf, html, other]
-
Title: What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification. In this paper, we present a simple yet effective approach for zero-shot image classification using multimodal LLMs. Using multimodal LLMs, we generate comprehensive textual representations from input images. These textual representations are then utilized to generate fixed-dimensional features in a cross-modal embedding space. Subsequently, these features are fused together to perform zero-shot classification using a linear classifier. Our method does not require prompt engineering for each dataset; instead, we use a single, straightforward set of prompts across all datasets. We evaluated our method on several datasets and our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets. On average, for ten benchmarks, our method achieved an accuracy gain of 6.2 percentage points, with an increase of 6.8 percentage points on the ImageNet dataset, compared to prior methods re-evaluated with the same setup. Our findings highlight the potential of multimodal LLMs to enhance computer vision tasks such as zero-shot image classification, offering a significant improvement over traditional methods.
- [536] arXiv:2405.16439 (replaced) [pdf, html, other]
-
Title: Multi-Agent Inverse Reinforcement Learning in Real World Unstructured Pedestrian CrowdsSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Social robot navigation in crowded public spaces such as university campuses, restaurants, grocery stores, and hospitals, is an increasingly important area of research. One of the core strategies for achieving this goal is to understand humans' intent--underlying psychological factors that govern their motion--by learning their reward functions, typically via inverse reinforcement learning (IRL). Despite significant progress in IRL, learning reward functions of multiple agents simultaneously in dense unstructured pedestrian crowds has remained intractable due to the nature of the tightly coupled social interactions that occur in these scenarios \textit{e.g.} passing, intersections, swerving, weaving, etc. In this paper, we present a new multi-agent maximum entropy inverse reinforcement learning algorithm for real world unstructured pedestrian crowds. Key to our approach is a simple, but effective, mathematical trick which we name the so-called tractability-rationality trade-off trick that achieves tractability at the cost of a slight reduction in accuracy. We compare our approach to the classical single-agent MaxEnt IRL as well as state-of-the-art trajectory prediction methods on several datasets including the ETH, UCY, SCAND, JRDB, and a new dataset, called Speedway, collected at a busy intersection on a University campus focusing on dense, complex agent interactions. Our key findings show that, on the dense Speedway dataset, our approach ranks 1st among top 7 baselines with >2X improvement over single-agent IRL, and is competitive with state-of-the-art large transformer-based encoder-decoder models on sparser datasets such as ETH/UCY (ranks 3rd among top 7 baselines).
- [537] arXiv:2405.17712 (replaced) [pdf, html, other]
-
Title: A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language ModelsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
This paper presents a novel approach named \textbf{C}ontextually \textbf{R}elevant \textbf{I}mputation leveraging pre-trained \textbf{L}anguage \textbf{M}odels (\textbf{CRILM}) for handling missing data in tabular datasets. Instead of relying on traditional numerical estimations, CRILM uses pre-trained language models (LMs) to create contextually relevant descriptors for missing values. This method aligns datasets with LMs' strengths, allowing large LMs to generate these descriptors and small LMs to be fine-tuned on the enriched datasets for enhanced downstream task performance. Our evaluations demonstrate CRILM's superior performance and robustness across MCAR, MAR, and challenging MNAR scenarios, with up to a 10\% improvement over the best-performing baselines. By mitigating biases, particularly in MNAR settings, CRILM improves downstream task performance and offers a cost-effective solution for resource-constrained environments.
- [538] arXiv:2406.02166 (replaced) [pdf, html, other]
-
Title: Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic SupervisionComments: Accepted by IEEE-TASLPSubjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pretraining with phonetic or graphemic transcription, and self-supervised pretraining. We find that pretraining with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency. It is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we release the code, models and data for the entire pipeline of Whistle at this https URL.
- [539] arXiv:2406.08756 (replaced) [pdf, html, other]
-
Title: Optimizing Large Model Training through Overlapped Activation RecomputationPing Chen, Wenjie Zhang, Shuibing He, Weijian Chen, Siling Yang, Kexin Huang, Yanlong Yin, Xuan Zhan, Yingjie Gu, Zhuwei Peng, Yi Zheng, Zhefeng Wang, Gang Chen Yingjie Gu, Zhuwei Peng, Kexin Huang, Xuan Zhan, Weijian Chen, Yi Zheng, Zhefeng Wang, Yanlong Yin, Gang ChenComments: 13 pagesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Large model training often uses recomputation to alleviate memory pressure and pipelines to exploit the parallelism of data, tensors, and devices. However, existing recomputation approaches may incur high overhead when training real-world models, as they are executed on demand in the critical training path. In this paper, we present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines. To reduce the large search space for recomputation strategies, we propose a heuristic-based recomputation scheduling algorithm, which is based on the observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all such structures. Additionally, we propose a recomputation-aware model partitioning method to balance each stage's execution time for improved training throughput. Our comprehensive evaluation using GPT models with 1.3B-23B parameters shows that Lynx outperforms existing recomputation approaches by up to 1.37x.
- [540] arXiv:2406.09631 (replaced) [pdf, html, other]
-
Title: Towards Optimizing a Convex Cover of Collision-Free Space for Trajectory GenerationSubjects: Robotics (cs.RO)
We propose an online iterative algorithm to optimize a convex cover to under-approximate the free space for autonomous navigation to delineate Safe Flight Corridors (SFC). The convex cover consists of a set of polytopes such that the union of the polytopes represents obstacle-free space, allowing us to find trajectories for robots that lie within the convex cover. In order to find the SFC that facilitates trajectory optimization, we iteratively find overlapping polytopes of maximum volumes that include specified waypoints initialized by a geometric or kinematic planner. Constraints at waypoints appear in two alternating stages of a joint optimization problem, which is solved by a novel heuristic-based iterative algorithm with partially distributed variables. We validate the effectiveness of our proposed algorithm using a range of parameterized environments and show its applications for two-stage motion planning.
- [541] arXiv:2406.12257 (replaced) [pdf, html, other]
-
Title: CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language ModelsYuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha PoovendranComments: This paper is presented at EMNLP 2024Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
The remarkable performance of large language models (LLMs) in generation tasks has enabled practitioners to leverage publicly available models to power custom applications, such as chatbots and virtual assistants. However, the data used to train or fine-tune these LLMs is often undisclosed, allowing an attacker to compromise the data and inject backdoors into the models. In this paper, we develop a novel inference time defense, named CLEANGEN, to mitigate backdoor attacks for generation tasks in LLMs. CLEANGEN is a lightweight and effective decoding strategy that is compatible with the state-of-the-art (SOTA) LLMs. Our insight behind CLEANGEN is that compared to other LLMs, backdoored LLMs assign significantly higher probabilities to tokens representing the attacker-desired contents. These discrepancies in token probabilities enable CLEANGEN to identify suspicious tokens favored by the attacker and replace them with tokens generated by another LLM that is not compromised by the same attacker, thereby avoiding generation of attacker-desired content. We evaluate CLEANGEN against five SOTA backdoor attacks. Our results show that CLEANGEN achieves lower attack success rates (ASR) compared to five SOTA baseline defenses for all five backdoor attacks. Moreover, LLMs deploying CLEANGEN maintain helpfulness in their responses when serving benign user queries with minimal added computational overhead.
- [542] arXiv:2406.12831 (replaced) [pdf, html, other]
-
Title: VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video EditingComments: 18 pages, 16 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Video editing serves as a fundamental pillar of digital media, spanning applications in entertainment, education, and professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistent edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemporal Video Adaptation framework for global and local video editing, pushing the limits of consistently editing minute-long videos. First, to ensure local consistency within individual frames, we designed test-time editing adaptation to adapt a pre-trained image editing model for improving consistency between potential editing directions and the text instruction, and adapts masked latent variables for precise local control. Furthermore, to maintain global consistency over the video sequence, we introduce spatiotemporal adaptation that recursively gather consistent attention variables in key frames and strategically applies them across the whole sequence to realize the editing effects. Extensive experiments demonstrate that, compared to baseline methods, our VIA approach produces edits that are more faithful to the source videos, more coherent in the spatiotemporal context, and more precise in local control. More importantly, we show that VIA can achieve consistent long video editing in minutes, unlocking the potential for advanced video editing tasks over long video sequences.
- [543] arXiv:2406.14263 (replaced) [pdf, html, other]
-
Title: Scalable and RISC-V Programmable Near-Memory Computing Architectures for Edge NodesMichele Caon (1), Clément Choné (2), Pasquale Davide Schiavone (2), Alexandre Levisse (2), Guido Masera (1), Maurizio Martina (1), David Atienza (2) ((1) Politecnico di Torino, (2) École Polytechnique Fédérale de Lausanne (EPFL))Comments: 15 pages, 13 figures, accepted in IEEE Transactions on Emerging Topics in ComputingSubjects: Hardware Architecture (cs.AR)
The widespread adoption of data-centric algorithms, particularly Artificial Intelligence (AI) and Machine Learning (ML), has exposed the limitations of centralized processing infrastructures, driving a shift towards edge computing. This necessitates stringent constraints on energy efficiency, which traditional von Neumann architectures struggle to meet. The Compute-In-Memory (CIM) paradigm has emerged as a superior candidate due to its efficient exploitation of available memory bandwidth. However, existing CIM solutions require high implementation effort and lack flexibility from a software integration standpoint. This work proposes a novel, software-friendly, general-purpose, and low-integration-effort Near-Memory Computing (NMC) approach, paving the way for the adoption of CIM-based systems in the next generation of edge computing nodes. Two architectural variants, NM-Caesar and NM-Carus, are proposed and characterized to target different trade-offs in area efficiency, performance, and flexibility, covering a wide range of embedded microcontrollers. Post-layout simulations show up to $28.0\times$ and $53.9\times$ lower execution time and $25.0\times$ and $35.6\times$ higher energy efficiency at the system level, respectively, compared to executing the same tasks on a state-of-the-art RISC-V CPU (RV32IMC). NM-Carus achieves a peak energy efficiency of $306.7$ GOPS/W in 8-bit matrix multiplications, surpassing recent state-of-the-art in- and near-memory circuits.
- [544] arXiv:2406.15341 (replaced) [pdf, html, other]
-
Title: GenoTEX: A Benchmark for Automated Gene Expression Data Analysis in Alignment with BioinformaticiansComments: 29 pages, 3 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automated analysis of gene expression data. GenoTEX provides annotated code and results for solving a wide range of gene identification problems, encompassing dataset selection, preprocessing, and statistical analysis, in a pipeline that follows computational genomics standards. The benchmark includes expert-curated annotations from bioinformaticians to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgent, a team of LLM-based agents that adopt a multi-step programming workflow with flexible self-correction, to collaboratively analyze gene expression datasets. Our experiments demonstrate the potential of LLM-based methods in analyzing genomic data, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing automated methods for gene expression data analysis. The benchmark is available at this https URL.
- [545] arXiv:2406.16685 (replaced) [pdf, other]
-
Title: A locking-free isogeometric thin shell formulation based on higher order accurate diagonalized strain projection via approximate dual splinesSubjects: Computational Engineering, Finance, and Science (cs.CE)
We present a novel isogeometric discretization approach for the Kirchhoff-Love shell formulation based on the Hellinger-Reissner variational principle. For mitigating membrane locking, we discretize the independent strains with spline basis functions that are one degree lower than those used for the displacements. To enable computationally efficient condensation of the independent strains, we first discretize the variations of the independent strains with approximate dual splines to obtain a projection matrix that is close to a diagonal matrix. We then diagonalize this strain projection matrix via row-sum lumping. Due to this diagonalization, the static condensation of the independent strain fields becomes computationally inexpensive, as no matrix needs to be inverted. At the same time, our approach maintains higher-order accuracy at optimal rates of convergence. We illustrate the numerical properties and the performance of our approach through numerical benchmarks, including a curved Euler-Bernoulli beam and the examples of the shell obstacle course.
- [546] arXiv:2406.17819 (replaced) [pdf, other]
-
Title: Automatically Adaptive Conformal Risk ControlVincent Blot (LISN, CNRS), Anastasios N Angelopoulos (UC Berkeley), Michael I Jordan (UC Berkeley, Inria), Nicolas J-B Brunel (ENSIIE)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Science and technology have a growing need for effective mechanisms that ensure reliable, controlled performance from black-box machine learning algorithms. These performance guarantees should ideally hold conditionally on the input-that is the performance guarantees should hold, at least approximately, no matter what the input. However, beyond stylized discrete groupings such as ethnicity and gender, the right notion of conditioning can be difficult to define. For example, in problems such as image segmentation, we want the uncertainty to reflect the intrinsic difficulty of the test sample, but this may be difficult to capture via a conditioning event. Building on the recent work of Gibbs et al. [2023], we propose a methodology for achieving approximate conditional control of statistical risks-the expected value of loss functions-by adapting to the difficulty of test samples. Our framework goes beyond traditional conditional risk control based on user-provided conditioning events to the algorithmic, data-driven determination of appropriate function classes for conditioning. We apply this framework to various regression and segmentation tasks, enabling finer-grained control over model performance and demonstrating that by continuously monitoring and adjusting these parameters, we can achieve superior precision compared to conventional risk-control methods.
- [547] arXiv:2407.03006 (replaced) [pdf, html, other]
-
Title: Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image TranslationComments: Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI 2024)Journal-ref: Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(3), 1824-1832Subjects: Computer Vision and Pattern Recognition (cs.CV)
Recently, large-scale text-to-image (T2I) diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing open-domain image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework that contributes a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which filters the latent features of the source image in the DCT domain, yielding filtered image features bearing different DCT spectral bands as different control signals to the pre-trained Latent Diffusion Model. We reveal that control signals of different DCT spectral bands bridge the source image and the T2I generated image in different correlations (e.g., style, structure, layout, contour, etc.), and thus enable versatile I2I applications emphasizing different I2I correlations, including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related approaches, FCDiffusion establishes a unified text-guided I2I framework suitable for diverse image translation tasks simply by switching among different frequency control branches at inference time. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. Our project is publicly available at: this https URL.
- [548] arXiv:2407.03314 (replaced) [pdf, html, other]
-
Title: BACON: Improving Clarity of Image Captions via Bag-of-Concept GraphsZhantao Yang, Ruili Feng, Keyu Yan, Huangji Wang, Zhicai Wang, Shangwen Zhu, Han Zhang, Jie Xiao, Pingyu Wu, Kai Zhu, Jixuan Chen, Chen-Wei Xie, Yue Yang, Hongyang Zhang, Yu Liu, Fan ChengSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Databases (cs.DB)
Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions. To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information. We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V. Evaluations of overall quality, precision, and recall-as well as user studies-demonstrate that the resulting caption model consistently outperforms other SOTA VLM models in generating high-quality captions. Besides, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, BACON-style captions help GroundingDINO achieve 1.51x higher recall scores on open-vocabulary object detection tasks compared to leading methods.
- [549] arXiv:2407.03608 (replaced) [pdf, html, other]
-
Title: Gaussian process regression with log-linear scaling for common non-stationary kernelsSubjects: Numerical Analysis (math.NA); Computation (stat.CO)
We introduce a fast algorithm for Gaussian process regression in low dimensions, applicable to a widely-used family of non-stationary kernels. The non-stationarity of these kernels is induced by arbitrary spatially-varying vertical and horizontal scales. In particular, any stationary kernel can be accommodated as a special case, and we focus especially on the generalization of the standard Matérn kernel. Our subroutine for kernel matrix-vector multiplications scales almost optimally as $O(N\log N)$, where $N$ is the number of regression points. Like the recently developed equispaced Fourier Gaussian process (EFGP) methodology, which is applicable only to stationary kernels, our approach exploits non-uniform fast Fourier transforms (NUFFTs). We offer a complete analysis controlling the approximation error of our method, and we validate the method's practical performance with numerical experiments. In particular we demonstrate improved scalability compared to to state-of-the-art rank-structured approaches in spatial dimension $d>1$.
- [550] arXiv:2407.05608 (replaced) [pdf, html, other]
-
Title: A Benchmark for Multi-speaker AnonymizationComments: Accepted by TIFSSubjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Privacy-preserving voice protection approaches primarily suppress privacy-related information derived from paralinguistic attributes while preserving the linguistic content. Existing solutions focus particularly on single-speaker scenarios. However, they lack practicality for real-world applications, i.e., multi-speaker scenarios. In this paper, we present an initial attempt to provide a multi-speaker anonymization benchmark by defining the task and evaluation protocol, proposing benchmarking solutions, and discussing the privacy leakage of overlapping conversations. The proposed benchmark solutions are based on a cascaded system that integrates spectral-clustering-based speaker diarization and disentanglement-based speaker anonymization using a selection-based anonymizer. To improve utility, the benchmark solutions are further enhanced by two conversation-level speaker vector anonymization methods. The first method minimizes the differential similarity across speaker pairs in the original and anonymized conversations, which maintains original speaker relationships in the anonymized version. The other minimizes the aggregated similarity across anonymized speakers, which achieves better differentiation between this http URL conducted on both non-overlap simulated and real-world datasets demonstrate the effectiveness of the multi-speaker anonymization system with the proposed speaker anonymizers. Additionally, we analyzed overlapping speech regarding privacy leakage and provided potential solutions
- [551] arXiv:2407.05715 (replaced) [pdf, other]
-
Title: The Size-Change Principle for Mixed Inductive and Coinductive typesPierre Hyvernat (LAMA)Subjects: Logic in Computer Science (cs.LO)
This paper shows how to use Lee, Jones and Ben Amram's size-change principle to check correctness of arbitrary recursive definitions in an ML / Haskell like programming language with inductive and coinductive this http URL using the size-change principle to check productivity and termination is straightforward but unsound when inductive and coinductive types are nested. We can however adapt the size-change principle to check ``totality'', which corresponds exactly to correctness with respect to the corresponding (co)inductive type.
- [552] arXiv:2407.09362 (replaced) [pdf, html, other]
-
Title: Structure and Independence in Hyperbolic Uniform Disk GraphsThomas Bläsius, Jean-Pierre von der Heydt, Sándor Kisfaludi-Bak, Marcus Wilhelm, Geert van WordragenComments: 31 pages, 11 figures, full version of extended abstract accepted at SoCG 2025Subjects: Computational Geometry (cs.CG); Data Structures and Algorithms (cs.DS)
We consider intersection graphs of disks of radius $r$ in the hyperbolic plane. Unlike the Euclidean setting, these graph classes are different for different values of $r$, where very small $r$ corresponds to an almost-Euclidean setting and $r \in \Omega(\log n)$ corresponds to a firmly hyperbolic setting. We observe that larger values of $r$ create simpler graph classes, at least in terms of separators and the computational complexity of the \textsc{Independent Set} problem.
First, we show that intersection graphs of disks of radius $r$ in the hyperbolic plane can be separated with $\mathcal{O}((1+1/r)\log n)$ cliques in a balanced manner. Our second structural insight concerns Delaunay complexes in the hyperbolic plane and may be of independent interest. We show that for any set $S$ of $n$ points with pairwise distance at least $2r$ in the hyperbolic plane the corresponding Delaunay complex has outerplanarity $1+\mathcal{O}(\frac{\log n}{r})$, which implies a similar bound on the balanced separators and treewidth of such Delaunay complexes.
Using this outerplanarity (and treewidth) bound we prove that \textsc{Independent Set} can be solved in $n^{\mathcal{O}(1+\frac{\log n}{r})}$ time. The algorithm is based on dynamic programming on some unknown sphere cut decomposition that is based on the solution. The resulting algorithm is a far-reaching generalization of a result of Kisfaludi-Bak (SODA 2020), and it is tight under the Exponential Time Hypothesis. In particular, \textsc{Independent Set} is polynomial-time solvable in the firmly hyperbolic setting of $r\in \Omega(\log n)$. Finally, in the case when the disks have ply (depth) at most $\ell$, we give a PTAS for \textsc{Maximum Independent Set} that has only quasi-polynomial dependence on $1/\varepsilon$ and $\ell$. Our PTAS is a further generalization of our exact algorithm. - [553] arXiv:2407.09891 (replaced) [pdf, html, other]
-
Title: Blow-up in Non-Deterministic AutomataSubjects: Formal Languages and Automata Theory (cs.FL)
In this paper we examine the difficulty of finding an equivalent deterministic automaton when confronted with a non-deterministic one. While for some automata the exponential blow-up in their number of states is unavoidable, we show that in general, any approximation of state complexity with polynomial precision remains PSPACE-hard. The same is true when using the subset construction to determinize the NFA, meaning that it is PSPACE-hard to predict whether subset construction will produce an exponential ''blow-up'' in the number of states or not. To give an explanation for its behaviour, we propose the notion of subset complexity, which serves as an upper bound on the size of subset construction. Due to it simple and intuitive nature it allows to identify large classes of automata which can have limited non-determinism and completely avoid the ''blow-up''. Subset complexity also remains invariant under NFA reversal and allows to predict how the introduction or removal of transitions from the NFA will affect its size.
- [554] arXiv:2407.11309 (replaced) [pdf, html, other]
-
Title: Gaussian Splatting Lucas-KanadeComments: International Conference on Learning RepresentationsSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Gaussian Splatting and its dynamic extensions are effective for reconstructing 3D scenes from 2D images when there is significant camera movement to facilitate motion parallax and when scene objects remain relatively static. However, in many real-world scenarios, these conditions are not met. As a consequence, data-driven semantic and geometric priors have been favored as regularizers, despite their bias toward training data and their neglect of broader movement dynamics.
Departing from this practice, we propose a novel analytical approach that adapts the classical Lucas-Kanade method to dynamic Gaussian splatting. By leveraging the intrinsic properties of the forward warp field network, we derive an analytical velocity field that, through time integration, facilitates accurate scene flow computation. This enables the precise enforcement of motion constraints on warp fields, thus constraining both 2D motion and 3D positions of the Gaussians. Our method excels in reconstructing highly dynamic scenes with minimal camera movement, as demonstrated through experiments on both synthetic and real-world scenes. - [555] arXiv:2407.15260 (replaced) [pdf, html, other]
-
Title: On the Viability of Semi-Supervised Segmentation Methods for Statistical Shape ModelingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Statistical Shape Models (SSMs) excel at identifying population level anatomical variations, which is at the core of various clinical and biomedical applications, including morphology-based diagnostics and surgical planning. However, the effectiveness of SSM is often constrained by the necessity for expert-driven manual segmentation, a process that is both time-intensive and expensive, thereby restricting their broader application and utility. Recent deep learning approaches enable the direct estimation of Statistical Shape Models (SSMs) from unsegmented images. While these models can predict SSMs without segmentation during deployment, they do not address the challenge of acquiring the manual annotations needed for training, particularly in resource-limited settings. Semi-supervised models for anatomy segmentation can mitigate the annotation burden. Yet, despite the abundance of available approaches, there are no established guidelines to inform end-users on their effectiveness for the downstream task of constructing SSMs. In this study, we systematically evaluate the potential of semi-supervised methods as viable alternatives to manual segmentations for building SSMs. We establish a new performance benchmark by employing various semi-supervised methods for anatomy segmentation under low annotation settings, utilizing the predicted segmentations for the task of SSM. Our results indicate that some methods produce noisy segmentation, which is very unfavorable for SSM tasks, while others can capture the correct modes of variations in the population cohort with 60-80% reduction in required manual annotation
- [556] arXiv:2408.00279 (replaced) [pdf, other]
-
Title: MESA: Effective Matching Redundancy Reduction by Semantic Area SegmentationComments: 18pages+supplSubjects: Computer Vision and Pattern Recognition (cs.CV)
We propose MESA and DMESA as novel feature matching methods, which utilize Segment Anything Model (SAM) to effectively mitigate matching redundancy. The key insight of our methods is to establish implicit-semantic area matching prior to point matching, based on advanced image understanding of SAM. Then, informative area matches with consistent internal semantic are able to undergo dense feature comparison, facilitating precise inside-area point matching. Specifically, MESA adopts a sparse matching framework and first obtains candidate areas from SAM results through a novel Area Graph (AG). Then, area matching among the candidates is formulated as graph energy minimization and solved by graphical models derived from AG. To address the efficiency issue of MESA, we further propose DMESA as its dense counterpart, applying a dense matching framework. After candidate areas are identified by AG, DMESA establishes area matches through generating dense matching distributions. The distributions are produced from off-the-shelf patch matching utilizing the Gaussian Mixture Model and refined via the Expectation Maximization. With less repetitive computation, DMESA showcases a speed improvement of nearly five times compared to MESA, while maintaining competitive accuracy. Our methods are extensively evaluated on five datasets encompassing indoor and outdoor scenes. The results illustrate consistent performance improvements from our methods for five distinct point matching baselines across all datasets. Furthermore, our methods exhibit promise generalization and improved robustness against image resolution variations. The code is publicly available at this https URL.
- [557] arXiv:2408.01072 (replaced) [pdf, html, other]
-
Title: A Survey on Self-play Methods in Reinforcement LearningRuize Zhang, Zelai Xu, Chengdong Ma, Chao Yu, Wei-Wei Tu, Wenhao Tang, Shiyu Huang, Deheng Ye, Wenbo Ding, Yaodong Yang, Yu WangSubjects: Artificial Intelligence (cs.AI)
Self-play, characterized by agents' interactions with copies or past versions of themselves, has recently gained prominence in reinforcement learning (RL). This paper first clarifies the preliminaries of self-play, including the multi-agent reinforcement learning framework and basic game theory concepts. Then, it provides a unified framework and classifies existing self-play algorithms within this framework. Moreover, the paper bridges the gap between the algorithms and their practical implications by illustrating the role of self-play in different scenarios. Finally, the survey highlights open challenges and future research directions in self-play. This paper is an essential guide map for understanding the multifaceted landscape of self-play in RL.
- [558] arXiv:2408.09769 (replaced) [pdf, html, other]
-
Title: Integrating Naturalistic Insights in Objective Multi-Vehicle Safety FrameworkSubjects: Robotics (cs.RO)
As autonomous vehicle technology advances, the precise assessment of safety in complex traffic scenarios becomes crucial, especially in mixed-vehicle environments where human perception of safety must be taken into account. This paper presents a framework designed for assessing traffic safety in multi-vehicle situations, facilitating the simultaneous utilization of diverse objective safety metrics. Additionally, it allows the integration of subjective perception of safety by adjusting model parameters. The framework was applied to evaluate various model configurations in car-following scenarios on a highway, utilizing naturalistic driving datasets. The evaluation of the model showed an outstanding performance, particularly when integrating multiple objective safety measures. Furthermore, the performance was significantly enhanced when considering all surrounding vehicles.
- [559] arXiv:2408.09833 (replaced) [pdf, html, other]
-
Title: Automated Vehicle Driver Monitoring Dataset from Real-World ScenariosComments: 6 pagesSubjects: Robotics (cs.RO)
From SAE Level 3 of automation onwards, drivers are allowed to engage in activities that are not directly related to driving during their travel. However, in level 3, a misunderstanding of the capabilities of the system might lead drivers to engage in secondary tasks, which could impair their ability to react to challenging traffic situations.
Anticipating driver activity allows for early detection of risky behaviors, to prevent accidents. To be able to predict the driver activity, a Deep Learning network needs to be trained on a dataset. However, the use of datasets based on simulation for training and the migration to real-world data for prediction has proven to be suboptimal. Hence, this paper presents a real-world driver activity dataset, openly accessible on IEEE Dataport, which encompasses various activities that occur in autonomous driving scenarios under various illumination and weather conditions. Results from the training process showed that the dataset provides an excellent benchmark for implementing models for driver activity recognition. - [560] arXiv:2408.10672 (replaced) [pdf, html, other]
-
Title: Neural Exploratory Landscape Analysis for Meta-Black-Box-OptimizationSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Recent research in Meta-Black-Box Optimization (MetaBBO) have shown that meta-trained neural networks can effectively guide the design of black-box optimizers, significantly reducing the need for expert tuning and delivering robust performance across complex problem distributions. Despite their success, a paradox remains: MetaBBO still rely on human-crafted Exploratory Landscape Analysis features to inform the meta-level agent about the low-level optimization progress. To address the gap, this paper proposes Neural Exploratory Landscape Analysis (NeurELA), a novel framework that dynamically profiles landscape features through a two-stage, attention-based neural network, executed in an entirely end-to-end fashion. NeurELA is pre-trained over a variety of MetaBBO algorithms using a multi-task neuroevolution strategy. Extensive experiments show that NeurELA achieves consistently superior performance when integrated into different and even unseen MetaBBO tasks and can be efficiently fine-tuned for further performance boost. This advancement marks a pivotal step in making MetaBBO algorithms more autonomous and broadly applicable. The source code of NeurELA can be accessed at this https URL.
- [561] arXiv:2408.12692 (replaced) [pdf, html, other]
-
Title: Rethinking Training for De-biasing Text-to-Image Generation: Unlocking the Potential of Stable DiffusionComments: 19 pages; First two authors contributed equally; Accepted at CVPR 2025Subjects: Artificial Intelligence (cs.AI)
Recent advancements in text-to-image models, such as Stable Diffusion, show significant demographic biases. Existing de-biasing techniques rely heavily on additional training, which imposes high computational costs and risks of compromising core image generation functionality. This hinders them from being widely adopted to real-world applications. In this paper, we explore Stable Diffusion's overlooked potential to reduce bias without requiring additional training. Through our analysis, we uncover that initial noises associated with minority attributes form "minority regions" rather than scattered. We view these "minority regions" as opportunities in SD to reduce bias. To unlock the potential, we propose a novel de-biasing method called 'weak guidance,' carefully designed to guide a random noise to the minority regions without compromising semantic integrity. Through analysis and experiments on various versions of SD, we demonstrate that our proposed approach effectively reduces bias without additional training, achieving both efficiency and preservation of core image generation functionality.
- [562] arXiv:2408.16315 (replaced) [pdf, other]
-
Title: Passenger hazard perception based on EEG signals for highly automated driving vehiclesAshton Yu Xuan Tan, Yingkai Yang, Xiaofei Zhang, Bowen Li, Xiaorong Gao, Sifa Zheng, Jianqiang Wang, Xinyu Gu, Jun Li, Yang Zhao, Yuxin Zhang, Tania StathakiComments: We have decided to withdraw this submission due to ongoing revisions and further refinements in our research. A revised version may be resubmitted in the future. We appreciate the feedback and interest from the communitySubjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
Enhancing the safety of autonomous vehicles is crucial, especially given recent accidents involving automated systems. As passengers in these vehicles, humans' sensory perception and decision-making can be integrated with autonomous systems to improve safety. This study explores neural mechanisms in passenger-vehicle interactions, leading to the development of a Passenger Cognitive Model (PCM) and the Passenger EEG Decoding Strategy (PEDS). Central to PEDS is a novel Convolutional Recurrent Neural Network (CRNN) that captures spatial and temporal EEG data patterns. The CRNN, combined with stacking algorithms, achieves an accuracy of $85.0\% \pm 3.18\%$. Our findings highlight the predictive power of pre-event EEG data, enhancing the detection of hazardous scenarios and offering a network-driven framework for safer autonomous vehicles.
- [563] arXiv:2408.16863 (replaced) [pdf, html, other]
-
Title: Data-Driven Law Firm Rankings to Reduce Information Asymmetry in Legal DisputesSubjects: Computers and Society (cs.CY)
Selecting capable counsel can shape the outcome of litigation, yet evaluating law firm performance remains challenging. Widely used rankings prioritize prestige, size, and revenue rather than empirical litigation outcomes, offering little practical guidance. To address this gap, we build on the Bradley-Terry model and introduce a new ranking framework that treats each lawsuit as a competitive game between plaintiff and defendant law firms. Leveraging a newly constructed dataset of 60,540 U.S. civil lawsuits involving 54,541 law firms, our findings show that existing reputation-based rankings correlate poorly with actual litigation success, whereas our outcome-based ranking substantially improves predictive accuracy. These findings establish a foundation for more transparent, data-driven assessments of legal performance.
- [564] arXiv:2408.17258 (replaced) [pdf, html, other]
-
Title: Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning ApproachJournal-ref: Transportation Research Part E: Logistics and Transportation Review, 2025Subjects: Machine Learning (cs.LG)
The proliferation of e-commerce and urbanization has significantly intensified delivery operations in urban areas, boosting the volume and complexity of delivery demand. Data-driven predictive methods, especially those utilizing machine learning techniques, have emerged to handle these complexities in urban delivery demand management problems. One particularly pressing issue that has yet to be sufficiently addressed is the joint estimation and prediction of city-wide delivery demand, as well as the generalization of the model to new cities. To this end, we formulate this problem as a transferable graph-based spatiotemporal learning task. First, an individual-collective message-passing neural network model is formalized to capture the interaction between demand patterns of associated regions. Second, by exploiting recent advances in large language models (LLMs), we extract general geospatial knowledge encodings from the unstructured locational data using the embedding generated by LLMs. Last, to encourage the cross-city generalization of the model, we integrate the encoding into the demand predictor in a transferable way. Comprehensive empirical evaluation results on two real-world delivery datasets, including eight cities in China and the US, demonstrate that our model significantly outperforms state-of-the-art baselines in accuracy, efficiency, and transferability.
- [565] arXiv:2409.00346 (replaced) [pdf, html, other]
-
Title: SMAFormer: Synergistic Multi-Attention Transformer for Medical Image SegmentationFuchen Zheng, Xuhang Chen, Weihuang Liu, Haolun Li, Yingtie Lei, Jiahui He, Chi-Man Pun, Shounjun ZhouComments: Accepted by IEEE BIBM 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
In medical image segmentation, specialized computer vision techniques, notably transformers grounded in attention mechanisms and residual networks employing skip connections, have been instrumental in advancing performance. Nonetheless, previous models often falter when segmenting small, irregularly shaped tumors. To this end, we introduce SMAFormer, an efficient, Transformer-based architecture that fuses multiple attention mechanisms for enhanced segmentation of small tumors and organs. SMAFormer can capture both local and global features for medical image segmentation. The architecture comprises two pivotal components. First, a Synergistic Multi-Attention (SMA) Transformer block is proposed, which has the benefits of Pixel Attention, Channel Attention, and Spatial Attention for feature enrichment. Second, addressing the challenge of information loss incurred during attention mechanism transitions and feature fusion, we design a Feature Fusion Modulator. This module bolsters the integration between the channel and spatial attention by mitigating reshaping-induced information attrition. To evaluate our method, we conduct extensive experiments on various medical image segmentation tasks, including multi-organ, liver tumor, and bladder tumor segmentation, achieving state-of-the-art results. Code and models are available at: this https URL.
- [566] arXiv:2409.02482 (replaced) [pdf, html, other]
-
Title: Volumetric Surfaces: Representing Fuzzy Geometries with Layered MeshesStefano Esposito, Anpei Chen, Christian Reiser, Samuel Rota Bulò, Lorenzo Porzi, Katja Schwarz, Christian Richardt, Michael Zollhöfer, Peter Kontschieder, Andreas GeigerSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
High-quality view synthesis relies on volume rendering, splatting, or surface rendering. While surface rendering is typically the fastest, it struggles to accurately model fuzzy geometry like hair. In turn, alpha-blending techniques excel at representing fuzzy materials but require an unbounded number of samples per ray (P1). Further overheads are induced by empty space skipping in volume rendering (P2) and sorting input primitives in splatting (P3). We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. We achieve this by representing objects as semi-transparent multi-layer meshes rendered in a fixed order. First, we model surface layers as signed distance function (SDF) shells with optimal spacing learned during training. Then, we bake them as meshes and fit UV textures. Unlike single-surface methods, our multi-layer representation effectively models fuzzy objects. In contrast to volume and splatting-based methods, our approach enables real-time rendering on low-power laptops and smartphones.
- [567] arXiv:2409.07067 (replaced) [pdf, html, other]
-
Title: Structure Modeling Activation Free Fourier Network for Spacecraft Image DenoisingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Spacecraft image denoising is a crucial fundamental technology closely related to aerospace research. However, the existing deep learning-based image denoising methods are primarily designed for natural image and fail to adequately consider the characteristics of spacecraft image(e.g. low-light conditions, repetitive periodic structures), resulting in suboptimal performance in the spacecraft image denoising task. To address the aforementioned problems, we propose a Structure modeling Activation Free Fourier Network (SAFFN), which is an efficient spacecraft image denoising method including Structure Modeling Block (SMB) and Activation Free Fourier Block (AFFB). We present SMB to effectively extract edge information and model the structure for better identification of spacecraft components from dark regions in spacecraft noise image. We present AFFB and utilize an improved Fast Fourier block to extract repetitive periodic features and long-range information in noisy spacecraft image. Extensive experimental results demonstrate that our SAFFN performs competitively compared to the state-of-the-art methods on spacecraft noise image datasets. The codes are available at: this https URL.
- [568] arXiv:2409.09430 (replaced) [pdf, html, other]
-
Title: Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image RetrievalComments: 37 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Medical image retrieval refers to the task of finding similar images for given query images in a database, with applications such as diagnosis support. While traditional medical image retrieval relied on clinical metadata, content-based medical image retrieval (CBMIR) depends on image features, which can be extracted automatically or semi-automatically. Many approaches have been proposed for CBMIR, and among them, using pre-trained convolutional neural networks (CNNs) is a widely utilized approach. However, considering the recent advances in the development of foundation models for various computer vision tasks, their application for CBMIR can also be investigated.
In this study, we used several pre-trained feature extractors from well-known pre-trained CNNs and pre-trained foundation models and investigated the CBMIR performance on eight types of two-dimensional (2D) and three-dimensional (3D) medical images. Furthermore, we investigated the effect of image size on the CBMIR performance.
Our results show that, overall, for the 2D datasets, foundation models deliver superior performance by a large margin compared to CNNs, with the general-purpose self-supervised model for computational pathology (UNI) providing the best overall performance across all datasets and image sizes. For 3D datasets, CNNs and foundation models deliver more competitive performance, with contrastive learning from captions for histopathology model (CONCH) achieving the best overall performance. Moreover, our findings confirm that while using larger image sizes (especially for 2D datasets) yields slightly better performance, competitive CBMIR performance can still be achieved even with smaller image sizes. Our codes to reproduce the results are available at: this https URL. - [569] arXiv:2409.11538 (replaced) [pdf, html, other]
-
Title: Chain-of-Thought Prompting for Speech TranslationKe Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris GinsburgSubjects: Computation and Language (cs.CL)
Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation. Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance in automatic speech recognition (ASR) and automatic speech translation (AST). In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM model consists of a speech encoder and an encoder-decoder structure Megatron-T5. By first decoding speech to generate ASR transcripts and subsequently using these transcripts along with encoded speech for prompting, we guide the speech translation in a two-step process like chain-of-thought (CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model adaptation and shows superior performance to full model fine-tuning. Experimental results show that the proposed CoT prompting significantly improves AST performance, achieving an average increase of 2.4 BLEU points across 6 En->X or X->En AST tasks compared to speech prompting alone. Additionally, compared to a related CoT prediction method that predicts a concatenated sequence of ASR and AST transcripts, our method performs better by an average of 2 BLEU points.
- [570] arXiv:2409.11593 (replaced) [pdf, html, other]
-
Title: Self-Contrastive Forward-Forward AlgorithmSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
Agents that operate autonomously benefit from lifelong learning capabilities. However, compatible training algorithms must comply with the decentralized nature of these systems, which imposes constraints on both the parameter counts and the computational resources. The Forward-Forward (FF) algorithm is one of these. FF relies only on feedforward operations, the same used for inference, for optimizing layer-wise objectives. This purely forward approach eliminates the need for transpose operations required in traditional backpropagation. Despite its potential, FF has failed to reach state-of-the-art performance on most standard benchmark tasks, in part due to unreliable negative data generation methods for unsupervised learning.
In this work, we propose the Self-Contrastive Forward-Forward (SCFF) algorithm, a competitive training method aimed at closing this performance gap. Inspired by standard self-supervised contrastive learning for vision tasks, SCFF generates positive and negative inputs applicable across various datasets. The method demonstrates superior performance compared to existing unsupervised local learning algorithms on several benchmark datasets, including MNIST, CIFAR-10, STL-10, and Tiny ImageNet. We extend FF's application to training recurrent neural networks, expanding its utility to sequential data tasks. These findings pave the way for high-accuracy, real-time learning on resource-constrained edge devices. - [571] arXiv:2409.11867 (replaced) [pdf, html, other]
-
Title: StableMamba: Distillation-free Scaling of Large SSMs for Images and VideosSubjects: Computer Vision and Pattern Recognition (cs.CV)
State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.
- [572] arXiv:2409.12249 (replaced) [pdf, html, other]
-
Title: GCA-SUNet: A Gated Context-Aware Swin-UNet for Exemplar-Free CountingComments: Accepted by ICME 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Exemplar-Free Counting aims to count objects of interest without intensive annotations of objects or exemplars. To achieve this, we propose a Gated Context-Aware Swin-UNet (GCA-SUNet) to directly map an input image to the density map of countable objects. Specifically, a set of Swin transformers form an encoder to derive a robust feature representation, and a Gated Context-Aware Modulation block is designed to suppress irrelevant objects or background through a gate mechanism and exploit the attentive support of objects of interest through a self-similarity matrix. The gate strategy is also incorporated into the bottleneck network and the decoder of the Swin-UNet to highlight the features most relevant to objects of interest. By explicitly exploiting the attentive support among countable objects and eliminating irrelevant features through the gate mechanisms, the proposed GCA-SUNet focuses on and counts objects of interest without relying on predefined categories or exemplars. Experimental results on the real-world datasets such as FSC-147 and CARPK demonstrate that GCA-SUNet significantly and consistently outperforms state-of-the-art methods. The code is available at this https URL.
- [573] arXiv:2409.12259 (replaced) [pdf, html, other]
-
Title: WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wildComments: CVPR 2025, Project Page this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
In recent years, 3D hand pose estimation methods have garnered significant attention due to their extensive applications in human-computer interaction, virtual reality, and robotics. In contrast, there has been a notable gap in hand detection pipelines, posing significant challenges in constructing effective real-world multi-hand reconstruction systems. In this work, we present a data-driven pipeline for efficient multi-hand reconstruction in the wild. The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model. To tackle the limitations of previous methods and build a robust and stable detection network, we introduce a large-scale dataset with over than 2M in-the-wild hand images with diverse lighting, illumination, and occlusion conditions. Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks. Finally, we showcase the effectiveness of our pipeline to achieve smooth 3D hand tracking from monocular videos, without utilizing any temporal components. Code, models, and dataset are available this https URL.
- [574] arXiv:2409.15272 (replaced) [pdf, html, other]
-
Title: OmniBench: Towards The Future of Universal Omni-Language ModelsYizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, Siwei Wu, Xingwei Qu, Jinjie Shi, Xinyue Zhang, Zhenzhu Yang, Xiangzhou Wang, Zhaoxiang Zhang, Zachary Liu, Emmanouil Benetos, Wenhao Huang, Chenghua LinSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (around 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at our repo (this https URL).
- [575] arXiv:2409.15404 (replaced) [pdf, html, other]
-
Title: Renaming in distributed certificationComments: 14 pages, 1 figure: v2: added a number of applicationsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS)
Local certification is the area of distributed network computing asking the following question: How to certify to the nodes of a network that a global property holds, if they are limited to a local verification?
In this area, it is often essential to have identifiers, that is, unique integers assigned to the nodes. In this short paper, we show how to reduce the range of the identifiers, in three different settings. More precisely, we show how to rename identifiers in the classical local certification setting, when we can (resp.\ cannot) choose the new identifiers, and we show how a global certificate can help to encode very compactly a new identifier assignment that is not injective in general, but still useful in applications.
We conclude with a number of applications of these results: For every $\ell$, there are local certification schemes for the properties of having clique number at most $\ell$, having diameter at most $\ell$, and having independence number at most~2, with certificates of size $O(n)$. We also show that there is a global certification scheme for bipartiteness with certificates of size $O(n)$. All these results are optimal. - [576] arXiv:2409.15848 (replaced) [pdf, html, other]
-
Title: iGAiVA: Integrated Generative AI and Visual Analytics in a Machine Learning Workflow for Text ClassificationSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
In developing machine learning (ML) models for text classification, one common challenge is that the collected data is often not ideally distributed, especially when new classes are introduced in response to changes of data and tasks. In this paper, we present a solution for using visual analytics (VA) to guide the generation of synthetic data using large language models. As VA enables model developers to identify data-related deficiency, data synthesis can be targeted to address such deficiency. We discuss different types of data deficiency, describe different VA techniques for supporting their identification, and demonstrate the effectiveness of targeted data synthesis in improving model accuracy. In addition, we present a software tool, iGAiVA, which maps four groups of ML tasks into four VA views, integrating generative AI and VA into an ML workflow for developing and improving text classification models.
- [577] arXiv:2409.17606 (replaced) [pdf, html, other]
-
Title: FlooNoC: A 645 Gbps/link 0.15 pJ/B/hop Open-Source NoC with Wide Physical Links and End-to-End AXI4 Parallel Multi-Stream SupportJournal-ref: IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( Volume: 33, Issue: 4, April 2025)Subjects: Hardware Architecture (cs.AR)
The new generation of domain-specific AI accelerators is characterized by rapidly increasing demands for bulk data transfers, as opposed to small, latency-critical cache line transfers typical of traditional cache-coherent systems. In this paper, we address this critical need by introducing the FlooNoC Network-on-Chip (NoC), featuring very wide, fully Advanced eXtensible Interface (AXI4) compliant links designed to meet the massive bandwidth needs at high energy efficiency. At the transport level, non-blocking transactions are supported for latency tolerance. Additionally, a novel end-to-end ordering approach for AXI4, enabled by a multi-stream capable Direct Memory Access (DMA) engine simplifies network interfaces and eliminates inter-stream dependencies. Furthermore, dedicated physical links are instantiated for short, latency-critical messages. A complete end-to-end reference implementation in 12nm FinFET technology demonstrates the physical feasibility and power performance area (PPA) benefits of our approach. Utilizing wide links on high levels of metal, we achieve a bandwidth of 645 Gbps per link and a total aggregate bandwidth of 103 Tbps for an 8x4 mesh of processors cluster tiles, with a total of 288 RISC-V cores. The NoC imposes a minimal area overhead of only 3.5% per compute tile and achieves a leading-edge energy efficiency of 0.15 pJ/B/hop at 0.8 V. Compared to state-of-the-art NoCs, our system offers three times the energy efficiency and more than double the link bandwidth. Furthermore, compared to a traditional AXI4-based multi-layer interconnect, our NoC achieves a 30% reduction in area, corresponding to a 47% increase in GFLOPSDP within the same floorplan.
- [578] arXiv:2409.17924 (replaced) [pdf, html, other]
-
Title: Neural Light Spheres for Implicit Image Stitching and View SynthesisComments: Project site: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Challenging to capture, and challenging to display on a cellphone screen, the panorama paradoxically remains both a staple and underused feature of modern mobile camera applications. In this work we address both of these challenges with a spherical neural light field model for implicit panoramic image stitching and re-rendering; able to accommodate for depth parallax, view-dependent lighting, and local scene motion and color changes during capture. Fit during test-time to an arbitrary path panoramic video capture -- vertical, horizontal, random-walk -- these neural light spheres jointly estimate the camera path and a high-resolution scene reconstruction to produce novel wide field-of-view projections of the environment. Our single-layer model avoids expensive volumetric sampling, and decomposes the scene into compact view-dependent ray offset and color components, with a total model size of 80 MB per scene, and real-time (50 FPS) rendering at 1080p resolution. We demonstrate improved reconstruction quality over traditional image stitching and radiance field methods, with significantly higher tolerance to scene motion and non-ideal capture settings.
- [579] arXiv:2409.18119 (replaced) [pdf, html, other]
-
Title: Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in MammographyComments: This paper is accepted by IPMI 2025 for Oral PresentationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Contrastive Language-Image Pre-training (CLIP) demonstrates strong potential in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities underexplored. Here, we propose one of the first adaptations of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and class-wise imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline. The code is available at this https URL
- [580] arXiv:2409.19804 (replaced) [pdf, html, other]
-
Title: Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation SystemsComments: Published at COLING 2025Subjects: Computation and Language (cs.CL)
Retrieval-Augmented Generation (RAG) has recently gained significant attention for its enhanced ability to integrate external knowledge sources into open-domain question answering (QA) tasks. However, it remains unclear how these models address fairness concerns, particularly with respect to sensitive attributes such as gender, geographic location, and other demographic factors. First, as language models evolve to prioritize utility, like improving exact match accuracy, fairness considerations may have been largely overlooked. Second, the complex, multi-component architecture of RAG methods poses challenges in identifying and mitigating biases, as each component is optimized for distinct objectives. In this paper, we aim to empirically evaluate fairness in several RAG methods. We propose a fairness evaluation framework tailored to RAG, using scenario-based questions and analyzing disparities across demographic attributes. Our experimental results indicate that, despite recent advances in utility-driven optimization, fairness issues persist in both the retrieval and generation stages. These findings underscore the need for targeted interventions to address fairness concerns throughout the RAG pipeline. The dataset and code used in this study are publicly available at this GitHub Repository this https URL .
- [581] arXiv:2410.01672 (replaced) [pdf, html, other]
-
Title: Practicing Stress Relief for the Everyday: Designing Social Simulation Using VR, AR, and LLMsSubjects: Human-Computer Interaction (cs.HC)
Stress is an inevitable part of day-to-day life yet many find themselves unable to manage it themselves, particularly when professional or peer support are not always readily available. As self-care becomes increasingly vital for mental well-being, this paper explores the potential of social simulation as a safe, virtual environment for practicing stress relief for everyday situations. Leveraging the immersive capabilities of VR, AR, and LLMs, we developed eight interactive prototypes for various everyday stressful scenarios (e.g. public speaking) then conducted prototype-driven semi-structured interviews with 19 participants. We reveal that people currently lack effective means to support themselves through everyday stress and found that social simulation fills a gap for simulating real environments for training mental health practices. We outline key considerations for future development of simulation for self-care, including risks of trauma from hyper-realism, distrust of LLM-recommended timing for mental health recommendations, and the value of accessibility for self-care interventions.
- [582] arXiv:2410.02619 (replaced) [pdf, html, other]
-
Title: GI-GS: Global Illumination Decomposition on Gaussian Splatting for Inverse RenderingComments: Camera-ready version. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
We present GI-GS, a novel inverse rendering framework that leverages 3D Gaussian Splatting (3DGS) and deferred shading to achieve photo-realistic novel view synthesis and relighting. In inverse rendering, accurately modeling the shading processes of objects is essential for achieving high-fidelity results. Therefore, it is critical to incorporate global illumination to account for indirect lighting that reaches an object after multiple bounces across the scene. Previous 3DGS-based methods have attempted to model indirect lighting by characterizing indirect illumination as learnable lighting volumes or additional attributes of each Gaussian, while using baked occlusion to represent shadow effects. These methods, however, fail to accurately model the complex physical interactions between light and objects, making it impossible to construct realistic indirect illumination during relighting. To address this limitation, we propose to calculate indirect lighting using efficient path tracing with deferred shading. In our framework, we first render a G-buffer to capture the detailed geometry and material properties of the scene. Then, we perform physically-based rendering (PBR) only for direct lighting. With the G-buffer and previous rendering results, the indirect lighting can be calculated through a lightweight path tracing. Our method effectively models indirect lighting under any given lighting conditions, thereby achieving better novel view synthesis and competitive relighting. Quantitative and qualitative results show that our GI-GS outperforms existing baselines in both rendering quality and efficiency.
- [583] arXiv:2410.03973 (replaced) [pdf, html, other]
-
Title: Efficient Training of Neural Stochastic Differential Equations by Matching Finite Dimensional DistributionsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Neural Stochastic Differential Equations (Neural SDEs) have emerged as powerful mesh-free generative models for continuous stochastic processes, with critical applications in fields such as finance, physics, and biology. Previous state-of-the-art methods have relied on adversarial training, such as GANs, or on minimizing distance measures between processes using signature kernels. However, GANs suffer from issues like instability, mode collapse, and the need for specialized training techniques, while signature kernel-based methods require solving linear PDEs and backpropagating gradients through the solver, whose computational complexity scales quadratically with the discretization steps. In this paper, we identify a novel class of strictly proper scoring rules for comparing continuous Markov processes. This theoretical finding naturally leads to a novel approach called Finite Dimensional Matching (FDM) for training Neural SDEs. Our method leverages the Markov property of SDEs to provide a computationally efficient training objective. This scoring rule allows us to bypass the computational overhead associated with signature kernels and reduces the training complexity from $O(D^2)$ to $O(D)$ per epoch, where $D$ represents the number of discretization steps of the process. We demonstrate that FDM achieves superior performance, consistently outperforming existing methods in terms of both computational efficiency and generative quality.
- [584] arXiv:2410.08522 (replaced) [pdf, html, other]
-
Title: Evaluating the effects of Data Sparsity on the Link-level Bicycling Volume Estimation: A Graph Convolutional Neural Network ApproachSubjects: Machine Learning (cs.LG)
Accurate bicycling volume estimation is crucial for making informed decisions and planning about future investments in bicycling infrastructure. However, traditional link-level volume estimation models are effective for motorized traffic but face significant challenges when applied to the bicycling context because of sparse data and the intricate nature of bicycling mobility patterns. To the best of our knowledge, we present the first study to utilize a Graph Convolutional Network (GCN) architecture to model link-level bicycling volumes and systematically investigate the impact of varying levels of data sparsity (0%--99%) on model performance, simulating real-world scenarios. We have leveraged Strava Metro data as the primary source of bicycling counts across 15,933 road segments/links in the City of Melbourne, Australia. To evaluate the effectiveness of the GCN model, we benchmark it against traditional machine learning models, such as linear regression, support vector machines, and random forest. Our results show that the GCN model outperforms these traditional models in predicting Annual Average Daily Bicycle (AADB) counts, demonstrating its ability to capture the spatial dependencies inherent in bicycle traffic networks. While GCN remains robust up to 80% sparsity, its performance declines sharply beyond this threshold, highlighting the challenges of extreme data sparsity. These findings underscore the potential of GCNs in enhancing bicycling volume estimation, while also emphasizing the need for further research on methods to improve model resilience under high-sparsity conditions. Our findings offer valuable insights for city planners aiming to improve bicycling infrastructure and promote sustainable transportation.
- [585] arXiv:2410.12399 (replaced) [pdf, html, other]
-
Title: SF-Speech: Straightened Flow for Zero-Shot Voice CloneComments: Accepted by IEEE Transactions on Audio, Speech and Language ProcessingSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recently, neural ordinary differential equations (ODE) models trained with flow matching have achieved impressive performance on the zero-shot voice clone task. Nevertheless, postulating standard Gaussian noise as the initial distribution of ODE gives rise to numerous intersections within the fitted targets of flow matching, which presents challenges to model training and enhances the curvature of the learned generated trajectories. These curved trajectories restrict the capacity of ODE models for generating desirable samples with a few steps. This paper proposes SF-Speech, a novel voice clone model based on ODE and in-context learning. Unlike the previous works, SF-Speech adopts a lightweight multi-stage module to generate a more deterministic initial distribution for ODE. Without introducing any additional loss function, we effectively straighten the curved reverse trajectories of the ODE model by jointly training it with the proposed module. Experiment results on datasets of various scales show that SF-Speech outperforms the state-of-the-art zero-shot TTS methods and requires only a quarter of the solver steps, resulting in a generation speed approximately 3.7 times that of Voicebox and E2 TTS. Audio samples are available at the demo page\footnote{[Online] Available: this https URL}.
- [586] arXiv:2410.13746 (replaced) [pdf, other]
-
Title: Theory on Score-Mismatched Diffusion Models and Zero-Shot Conditional SamplersSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The denoising diffusion model has recently emerged as a powerful generative technique, capable of transforming noise into meaningful data. While theoretical convergence guarantees for diffusion models are well established when the target distribution aligns with the training distribution, practical scenarios often present mismatches. One common case is in the zero-shot conditional diffusion sampling, where the target conditional distribution is different from the (unconditional) training distribution. These score-mismatched diffusion models remain largely unexplored from a theoretical perspective. In this paper, we present the first performance guarantee with explicit dimensional dependencies for general score-mismatched diffusion samplers, focusing on target distributions with finite second moments. We show that score mismatches result in an asymptotic distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions. This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise. Interestingly, the derived convergence upper bound offers useful guidance for designing a novel bias-optimal zero-shot sampler in linear conditional models that minimizes the asymptotic bias. For such bias-optimal samplers, we further establish convergence guarantees with explicit dependencies on dimension and conditioning, applied to several interesting target distributions, including those with bounded support and Gaussian mixtures. Our findings are supported by numerical studies.
- [587] arXiv:2410.14138 (replaced) [pdf, html, other]
-
Title: ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and WisdomSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., insufficient and irrelevant visual descriptions, and limited multi-modal capacities). We then decompose visual reasoning process into two stages: visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features multi-run proactive perception and decoupled vision-reasoning capabilities. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms both existing multi-step reasoning frameworks and passive peer methods on a wide range of benchmarks for both open-source and closed-source models. In addition, with the assistance of LLMs, ProReason achieves a performance improvement of up to 15% on MMMU benchmark. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.
- [588] arXiv:2410.14379 (replaced) [pdf, html, other]
-
Title: AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial ScenariosComments: Accepted at CVPR2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Recently, multi-class anomaly classification has garnered increasing attention. Previous methods directly cluster anomalies but often struggle due to the lack of anomaly-prior knowledge. Acquiring this knowledge faces two issues: the non-prominent and weak-semantics anomalies. In this paper, we propose AnomalyNCD, a multi-class anomaly classification network compatible with different anomaly detection methods. To address the non-prominence of anomalies, we design main element binarization (MEBin) to obtain anomaly-centered images, ensuring anomalies are learned while avoiding the impact of incorrect detections. Next, to learn anomalies with weak semantics, we design mask-guided representation learning, which focuses on isolated anomalies guided by masks and reduces confusion from erroneous inputs through corrected pseudo labels. Finally, to enable flexible classification at both region and image levels, we develop a region merging strategy that determines the overall image category based on the classified anomaly regions. Our method outperforms the state-of-the-art works on the MVTec AD and MTD datasets. Compared with the current methods, AnomalyNCD combined with zero-shot anomaly detection method achieves a 10.8% $F_1$ gain, 8.8% NMI gain, and 9.5% ARI gain on MVTec AD, and 12.8% $F_1$ gain, 5.7% NMI gain, and 10.8% ARI gain on MTD. Code is available at this https URL.
- [589] arXiv:2410.14770 (replaced) [pdf, html, other]
-
Title: A Survey on Computational Solutions for Reconstructing Complete Objects by Reassembling Their Fractured PartsComments: 36 pages, 22 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Reconstructing a complete object from its parts is a fundamental problem in many scientific domains. The purpose of this article is to provide a systematic survey on this topic. The reassembly problem requires understanding the attributes of individual pieces and establishing matches between different pieces. Many approaches also model priors of the underlying complete object. Existing approaches are tightly connected problems of shape segmentation, shape matching, and learning shape priors. We provide existing algorithms in this context and emphasize their similarities and differences to general-purpose approaches. We also survey the trends from early non-deep learning approaches to more recent deep learning approaches. In addition to algorithms, this survey will also describe existing datasets, open-source software packages, and applications. To the best of our knowledge, this is the first comprehensive survey on this topic in computer graphics.
- [590] arXiv:2410.15660 (replaced) [pdf, html, other]
-
Title: SPARC: Prediction-Based Safe Control for Coupled Controllable and Uncontrollable Agents with Conformal PredictionsSubjects: Systems and Control (eess.SY)
We investigate the problem of safe control synthesis for systems operating in environments with uncontrollable agents whose dynamics are unknown but coupled with those of the controlled system. This scenario naturally arises in various applications, such as autonomous driving and human-robot collaboration, where the behavior of uncontrollable agents, like pedestrians, cannot be directly controlled but is influenced by the actions of the autonomous vehicle or robot. In this paper, we present SPARC (Safe Prediction-Based Robust Controller for Coupled Agents), a novel framework designed to ensure safe control in the presence of coupled uncontrollable agents. SPARC leverages conformal prediction to quantify uncertainty in data-driven prediction of agent behavior. Particularly, we introduce a joint distribution-based approach to account for the coupled dynamics of the controlled system and uncontrollable agents. By integrating the control barrier function (CBF) technique, SPARC provides provable safety guarantees at a high confidence level. We illustrate our framework with a case study involving an autonomous driving scenario with walking pedestrians.
- [591] arXiv:2410.17821 (replaced) [pdf, other]
-
Title: On the formalization of the notion of a concurrent algorithmComments: There are several flaws in the definitions and proofs that are very serious and that most likely can only be remedied by starting from scratchSubjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Logic in Computer Science (cs.LO)
Previous papers give accounts of quests for satisfactory formalizations of the classical informal notion of an algorithm and the contemporary informal notion of an interactive algoritm. In this paper, an attempt is made to generalize the results of the former quest to the contemporary informal notion of a concurrent algorithm. The notion of a concurrent proto-algorithm is introduced. The thought is that concurrent algorithms are equivalence classes of concurrent proto-algorithms under an appropriate equivalence relation. Three equivalence relations are defined. Two of them are deemed to be bounds for an appropriate equivalence relation and the third is likely an appropriate one. The connection between concurrency and non-determinism in the presented setting is also addressed.
- [592] arXiv:2410.21630 (replaced) [pdf, html, other]
-
Title: Constrained Nonlinear Kaczmarz Projection on Intersections of Manifolds for Coordinated Multi-Robot Mobile ManipulationComments: Accepted for publication at IEEE International Conference on Robotics and Automation (ICRA) 2025Subjects: Robotics (cs.RO)
Cooperative manipulation tasks impose various structure-, task-, and robot-specific constraints on mobile manipulators. However, current methods struggle to model and solve these myriad constraints simultaneously. We propose a twofold solution: first, we model constraints as a family of manifolds amenable to simultaneous solving. Second, we introduce the constrained nonlinear Kaczmarz (cNKZ) projection technique to produce constraint-satisfying solutions. Experiments show that cNKZ dramatically outperforms baseline approaches, which cannot find solutions at all. We integrate cNKZ with a sampling-based motion planning algorithm to generate complex, coordinated motions for 3 to 6 mobile manipulators (18--36 DoF), with cNKZ solving up to 80 nonlinear constraints simultaneously and achieving up to a 92% success rate in cluttered environments. We also demonstrate our approach on hardware using three Turtlebot3 Waffle Pi robots with OpenMANIPULATOR-X arms.
- [593] arXiv:2410.21897 (replaced) [pdf, html, other]
-
Title: Semi-Supervised Self-Learning Enhanced Music Emotion RecognitionComments: 12 pages, 2 figuresJournal-ref: Proceedings of the 11th Conference on Sound and Music Technology. CSMT 2024. Lecture Notes in Electrical Engineering. Springer, SingaporeSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Music emotion recognition (MER) aims to identify the emotions conveyed in a given musical piece. However, currently, in the field of MER, the available public datasets have limited sample sizes. Recently, segment-based methods for emotion-related tasks have been proposed, which train backbone networks on shorter segments instead of entire audio clips, thereby naturally augmenting training samples without requiring additional resources. Then, the predicted segment-level results are aggregated to obtain the entire song prediction. The most commonly used method is that the segment inherits the label of the clip containing it, but music emotion is not constant during the whole clip. Doing so will introduce label noise and make the training easy to overfit. To handle the noisy label issue, we propose a semi-supervised self-learning (SSSL) method, which can differentiate between samples with correct and incorrect labels in a self-learning manner, thus effectively utilizing the augmented segment-level data. Experiments on three public emotional datasets demonstrate that the proposed method can achieve better or comparable performance.
- [594] arXiv:2410.23749 (replaced) [pdf, html, other]
-
Title: LSEAttention is All You Need for Time Series ForecastingComments: 8 pages with referencing, 1 figure, 5 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Transformer-based architectures have achieved remarkable success in natural language processing and computer vision. However, their performance in multivariate long-term forecasting often falls short compared to simpler linear baselines. Previous research has identified the traditional attention mechanism as a key factor limiting their effectiveness in this domain. To bridge this gap, we introduce LATST, a novel approach designed to mitigate entropy collapse and training instability common challenges in Transformer-based time series forecasting. We rigorously evaluate LATST across multiple real-world multivariate time series datasets, demonstrating its ability to outperform existing state-of-the-art Transformer models. Notably, LATST manages to achieve competitive performance with fewer parameters than some linear models on certain datasets, highlighting its efficiency and effectiveness.
- [595] arXiv:2411.01739 (replaced) [pdf, html, other]
-
Title: Not Just Object, But State: Compositional Incremental Learning without ForgettingComments: NeurIPS 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Most incremental learners excessively prioritize coarse classes of objects while neglecting various kinds of states (e.g. color and material) attached to the objects. As a result, they are limited in the ability to reason fine-grained compositionality of state-object pairs. To remedy this limitation, we propose a novel task called Compositional Incremental Learning (composition-IL), enabling the model to recognize state-object compositions as a whole in an incremental learning fashion. Since the lack of suitable benchmarks, we re-organize two existing datasets and make them tailored for composition-IL. Then, we propose a prompt-based Composition Incremental Learner (CompILer), to overcome the ambiguous composition boundary problem which challenges composition-IL largely. Specifically, we exploit multi-pool prompt learning, which is regularized by inter-pool prompt discrepancy and intra-pool prompt diversity. Besides, we devise object-injected state prompting by using object prompts to guide the selection of state prompts. Furthermore, we fuse the selected prompts by a generalized-mean strategy, to eliminate irrelevant information learned in the prompts. Extensive experiments on two datasets exhibit state-of-the-art performance achieved by CompILer.
- [596] arXiv:2411.03055 (replaced) [pdf, html, other]
-
Title: ATM: Improving Model Merging by Alternating Tuning and MergingLuca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Fabrizio Silvestri, Emanuele RodolàComments: Main paper: 9 Pages, 9 figures, 1 tableSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, if the optimization is performed via gradient descent, task vectors are after one step mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that the effectiveness of task vectors is largely driven by the first epoch's gradient. Given this parallel between task vectors and gradients, we propose viewing model merging as a single step in an iterative process that alternates between tuning and merging (ATM). We then propose two ways to utilize ATM. The first is to replace multi-task learning with ATM in scenarios where data sharing is prohibited, such as federated learning. The second is to improve the outcome of any model merging algorithm by applying a few post-hoc iterations of ATM on a small validation dataset, which is commonly available for hyperparameter tuning. Finally, we provide both empirical and theoretical support for the effectiveness of ATM, demonstrating that it minimizes an upper bound on the loss obtained by jointly finetuning all tasks.
- [597] arXiv:2411.07286 (replaced) [pdf, other]
-
Title: Multiple scales analysis of a nonlinear timestepping instability in simulations of solitonsComments: 28 pages, 13 figures, 3 tablesJournal-ref: Journal of Computational Physics, 531, 113923 (2025)Subjects: Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
The susceptibility of timestepping algorithms to numerical instabilities is an important consideration when simulating partial differential equations (PDEs). Here we identify and analyze a pernicious numerical instability arising in pseudospectral simulations of nonlinear wave propagation resulting in finite-time blow-up. The blow-up time scale is independent of the spatial resolution and spectral basis but sensitive to the timestepping scheme and the timestep size. The instability appears in multi-step and multi-stage implicit-explicit (IMEX) timestepping schemes of different orders of accuracy and has been found to manifest in simulations of soliton solutions of the Korteweg-de Vries (KdV) equation and traveling wave solutions of a nonlinear generalized Klein-Gordon equation. Focusing on the case of KdV solitons, we show that modal predictions from linear stability theory are unable to explain the instability because the spurious growth from linear dispersion is small and nonlinear sources of error growth converge too slowly in the limit of small timestep size. We then develop a novel multi-scale asymptotic framework that captures the slow, nonlinear accumulation of timestepping errors. The framework allows the solution to vary with respect to multiple time scales related to the timestep size and thus recovers the instability as a function of a slow time scale dictated by the order of accuracy of the timestepping scheme. We show that this approach correctly describes our simulations of solitons by making accurate predictions of the blow-up time scale and transient features of the instability. Our work demonstrates that studies of long-time simulations of nonlinear waves should exercise caution when validating their timestepping schemes.
- [598] arXiv:2411.10107 (replaced) [pdf, other]
-
Title: Monotone ContractionsComments: To appear in STOC'25Subjects: Computational Complexity (cs.CC)
We study functions $f : [0, 1]^d \rightarrow [0, 1]^d$ that are both monotone and contracting, and we consider the problem of finding an $\varepsilon$-approximate fixed point of $f$. We show that the problem lies in the complexity class UEOPL. We give an algorithm that finds an $\varepsilon$-approximate fixed point of a three-dimensional monotone contraction using $O(\log (1/\varepsilon))$ queries to $f$. We also give a decomposition theorem that allows us to use this result to obtain an algorithm that finds an $\varepsilon$-approximate fixed point of a $d$-dimensional monotone contraction using $O((c \cdot \log (1/\varepsilon))^{\lceil d / 3 \rceil})$ queries to $f$ for some constant $c$. Moreover, each step of both of our algorithms takes time that is polynomial in the representation of $f$. These results are strictly better than the best-known results for functions that are only monotone, or only contracting.
All of our results also apply to Shapley stochastic games, which are known to be reducible to the monotone contraction problem. Thus we put Shapley games in UEOPL, and we give a faster algorithm for approximating the value of a Shapley game. - [599] arXiv:2411.10659 (replaced) [pdf, html, other]
-
Title: Spineless Traversal for Layout InvalidationSubjects: Programming Languages (cs.PL)
Latency is a major concern for web rendering engines like those in Chrome, Safari, and Firefox. These engines reduce latency by using an incremental layout algorithm to redraw the page when the user interacts with it. In such an algorithm, elements that change frame-to-frame are marked dirty; only the dirty elements need be processed to draw the next frame, dramatically reducing latency. However, the standard incremental layout algorithm must search the page for dirty elements, accessing a number of auxiliary elements in the process. These auxiliary elements add cache misses and stalled cycles, and are responsible for a sizable fraction of all layout latency. We introduce a new, faster incremental layout algorithm called Spineless Traversal. Spineless Traversal uses a more computationally demanding priority queue algorithm to avoid the need to access auxiliary nodes and thus reduces cache traffic and stalls. This leads to dramatic speedups on the most latency-critical interactions such as hovering, typing, or animations. Moreover, thanks to numerous low-level optimizations, we are able to make Spineless Traversal competitive across the whole spectrum of incremental layout workloads. As a result, across 2216 benchmarks, Spineless Traversal is faster on 78.2% of the benchmark, with a mean speedup of 3.23x concentrated in the most latency-critical interactions such as hovering, typing, and animations.
- [600] arXiv:2411.13768 (replaced) [pdf, html, other]
-
Title: Evaluation-Driven Development of LLM Agents: A Process Model and Reference ArchitectureSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have enabled the emergence of LLM agents: autonomous systems capable of achieving under-specified goals and adapting post-deployment, often without explicit code or model changes. Evaluating these agents is critical to ensuring their performance and safety, especially given their dynamic, probabilistic, and evolving nature. However, traditional approaches such as predefined test cases and standard redevelopment pipelines struggle to address the unique challenges of LLM agent evaluation. These challenges include capturing open-ended behaviors, handling emergent outcomes, and enabling continuous adaptation over the agent's lifecycle. To address these issues, we propose an evaluation-driven development approach, inspired by test-driven and behavior-driven development but reimagined for the unique characteristics of LLM agents. Through a multivocal literature review (MLR), we synthesize the limitations of existing LLM evaluation methods and introduce a novel process model and reference architecture tailored for evaluation-driven development of LLM agents. Our approach integrates online (runtime) and offline (redevelopment) evaluations, enabling adaptive runtime adjustments and systematic iterative refinement of pipelines, artifacts, system architecture, and LLMs themselves. By continuously incorporating evaluation results, including fine-grained feedback from human and AI evaluators, into each stage of development and operation, this framework ensures that LLM agents remain aligned with evolving goals, user needs, and governance standards.
- [601] arXiv:2411.13990 (replaced) [pdf, html, other]
-
Title: Repository-level Code Translation Benchmark Targeting RustSubjects: Software Engineering (cs.SE)
Recent advancements in large language models (LLMs) have demonstrated impressive capabilities in code translation, typically evaluated using benchmarks like CodeTransOcean. However, these benchmarks fail to capture real-world complexities by focusing primarily on simple function-level translations and overlooking repository-level context (e.g., dependencies). Moreover, LLMs' effectiveness in translating to newer, low-resource languages like Rust remains largely underexplored. To address this gap, we introduce RustRepoTrans, the first repository-level code translation benchmark, comprising 375 tasks translating into Rust from C++, Java, and Python. Using this benchmark, we evaluate four state-of-the-art LLMs, analyzing their errors to assess limitations in complex translation scenarios. Among them, Claude-3.5 performs best with 43.5% Pass@1, excelling in both basic functionality and additional translation abilities, such as noise robustness and syntactical difference identification. However, even Claude-3.5 experiences a 30.8% performance drop (Pass@1 from 74.3% to 43.5%) when handling repository-level context compared to previous benchmarks without such context. We also find that LLMs struggle with language differences in complex tasks, and dependencies further increase translation difficulty.
- [602] arXiv:2411.14522 (replaced) [pdf, html, other]
-
Title: GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AITianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun HeSubjects: Computer Vision and Pattern Recognition (cs.CV)
Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.
- [603] arXiv:2411.14847 (replaced) [pdf, html, other]
-
Title: Dynamics-Aware Gaussian Splatting Streaming Towards Fast On-the-Fly 4D ReconstructionComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The recent development of 3D Gaussian Splatting (3DGS) has led to great interest in 4D dynamic spatial reconstruction. Existing approaches mainly rely on full-length multi-view videos, while there has been limited exploration of online reconstruction methods that enable on-the-fly training and per-timestep streaming. Current 3DGS-based streaming methods treat the Gaussian primitives uniformly and constantly renew the densified Gaussians, thereby overlooking the difference between dynamic and static features as well as neglecting the temporal continuity in the scene. To address these limitations, we propose a novel three-stage pipeline for iterative streamable 4D dynamic spatial reconstruction. Our pipeline comprises a selective inheritance stage to preserve temporal continuity, a dynamics-aware shift stage to distinguish dynamic and static primitives and optimize their movements, and an error-guided densification stage to accommodate emerging objects. Our method achieves state-of-the-art performance in online 4D reconstruction, demonstrating the fastest on-the-fly training, superior representation quality, and real-time rendering capability. Project page: this https URL
- [604] arXiv:2411.15482 (replaced) [pdf, html, other]
-
Title: SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous DrivingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Most existing Dynamic Gaussian Splatting methods for complex dynamic urban scenarios rely on accurate object-level supervision from expensive manual labeling, limiting their scalability in real-world applications. In this paper, we introduce SplatFlow, a Self-Supervised Dynamic Gaussian Splatting within Neural Motion Flow Fields (NMFF) to learn 4D space-time representations without requiring tracked 3D bounding boxes, enabling accurate dynamic scene reconstruction and novel view RGB/depth/flow synthesis. SplatFlow designs a unified framework to seamlessly integrate time-dependent 4D Gaussian representation within NMFF, where NMFF is a set of implicit functions to model temporal motions of both LiDAR points and Gaussians as continuous motion flow fields. Leveraging NMFF, SplatFlow effectively decomposes static background and dynamic objects, representing them with 3D and 4D Gaussian primitives, respectively. NMFF also models the correspondences of each 4D Gaussian across time, which aggregates temporal features to enhance cross-view consistency of dynamic components. SplatFlow further improves dynamic object identification by distilling features from 2D foundation models into 4D space-time representation. Comprehensive evaluations conducted on the Waymo and KITTI Datasets validate SplatFlow's state-of-the-art (SOTA) performance for both image reconstruction and novel view synthesis in dynamic urban scenarios.
- [605] arXiv:2411.15638 (replaced) [pdf, other]
-
Title: Learning state and proposal dynamics in state-space models using differentiable particle filters and neural networksComments: update to accepted versionSubjects: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
State-space models are a popular statistical framework for analysing sequential data. Within this framework, particle filters are often used to perform inference on non-linear state-space models. We introduce a new method, StateMixNN, that uses a pair of neural networks to learn the proposal distribution and transition distribution of a particle filter. Both distributions are approximated using multivariate Gaussian mixtures. The component means and covariances of these mixtures are learnt as outputs of learned functions. Our method is trained targeting the log-likelihood, thereby requiring only the observation series, and combines the interpretability of state-space models with the flexibility and approximation power of artificial neural networks. The proposed method significantly improves recovery of the hidden state in comparison with the state-of-the-art, showing greater improvement in highly non-linear scenarios.
- [606] arXiv:2411.16180 (replaced) [pdf, html, other]
-
Title: Event-boosted Deformable 3D Gaussians for Dynamic Scene ReconstructionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Deformable 3D Gaussian Splatting (3D-GS) is limited by missing intermediate motion information due to the low temporal resolution of RGB cameras. To address this, we introduce the first approach combining event cameras, which capture high-temporal-resolution, continuous motion data, with deformable 3D-GS for dynamic scene reconstruction. We observe that threshold modeling for events plays a crucial role in achieving high-quality reconstruction. Therefore, we propose a GS-Threshold Joint Modeling strategy, creating a mutually reinforcing process that greatly improves both 3D reconstruction and threshold modeling. Moreover, we introduce a Dynamic-Static Decomposition strategy that first identifies dynamic areas by exploiting the inability of static Gaussians to represent motions, then applies a buffer-based soft decomposition to separate dynamic and static areas. This strategy accelerates rendering by avoiding unnecessary deformation in static areas, and focuses on dynamic areas to enhance fidelity. Additionally, we contribute the first event-inclusive 4D benchmark with synthetic and real-world dynamic scenes, on which our method achieves state-of-the-art performance.
- [607] arXiv:2411.16199 (replaced) [pdf, html, other]
-
Title: VIRES: Video Instance Repainting via Sketch and Text Guided GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. We propose the Sequential ControlNet with the standardized self-scaling, which effectively extracts structure layouts and adaptively captures high-contrast sketch details. We further augment the diffusion transformer backbone with the sketch attention to interpret and inject fine-grained sketch semantics. A sketch-aware encoder ensures that repainted results are aligned with the provided sketch sequence. Additionally, we contribute the VireSet, a dataset with detailed annotations tailored for training and evaluating video instance editing methods. Experimental results demonstrate the effectiveness of VIRES, which outperforms state-of-the-art methods in visual quality, temporal consistency, condition alignment, and human ratings. Project page: this https URL
- [608] arXiv:2411.18620 (replaced) [pdf, html, other]
-
Title: Cross-modal Information Flow in Multimodal Large Language ModelsJournal-ref: CVPR2025Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities -- language and vision -- in MLLMs, focusing on visual question answering. Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction. Overall, our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs, thereby facilitating future research into multimodal information localization and editing. Our code and collected dataset are released here: this https URL.
- [609] arXiv:2411.18711 (replaced) [pdf, other]
-
Title: Evaluating Vision-Language Models as Evaluators in Path PlanningSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Despite their promise to perform complex reasoning, large language models (LLMs) have been shown to have limited effectiveness in end-to-end planning. This has inspired an intriguing question: if these models cannot plan well, can they still contribute to the planning framework as a helpful plan evaluator? In this work, we generalize this question to consider LLMs augmented with visual understanding, i.e., Vision-Language Models (VLMs). We introduce PathEval, a novel benchmark evaluating VLMs as plan evaluators in complex path-planning scenarios. Succeeding in the benchmark requires a VLM to be able to abstract traits of optimal paths from the scenario description, demonstrate precise low-level perception on each path, and integrate this information to decide the better path. Our analysis of state-of-the-art VLMs reveals that these models face significant challenges on the benchmark. We observe that the VLMs can precisely abstract given scenarios to identify the desired traits and exhibit mixed performance in integrating the provided information. Yet, their vision component presents a critical bottleneck, with models struggling to perceive low-level details about a path. Our experimental results show that this issue cannot be trivially addressed via end-to-end fine-tuning; rather, task-specific discriminative adaptation of these vision encoders is needed for these VLMs to become effective path evaluators.
- [610] arXiv:2411.19472 (replaced) [pdf, html, other]
-
Title: A Catalog of Micro Frontends Anti-patternsSubjects: Software Engineering (cs.SE)
Micro frontend (MFE) architectures have gained significant popularity for promoting independence and modularity in development. Despite their widespread adoption, the field remains relatively unexplored, especially concerning identifying problems and documenting best practices. Drawing on both established microservice (MS) anti-patterns and the analysis of real problems faced by software development teams that adopt MFE, this paper presents a catalog of 12 MFE anti-patterns. We composed an initial version of the catalog by recognizing parallels between MS anti-patterns and recurring issues in MFE projects to map and adapt MS anti-patterns to the context of MFE. To validate the identified problems and proposed solutions, we conducted a survey with industry practitioners, collecting valuable feedback to refine the anti-patterns. Additionally, we asked participants if they had encountered these problems in practice and to rate their harmfulness on a 10-point Likert scale. The survey results revealed that participants had encountered all the proposed anti-patterns in real-world MFE architectures, with only one reported by less than 50\% of participants. They stated that the catalog can serve as a valuable guide for both new and experienced developers, with the potential to enhance MFE development quality. The collected feedback led to the development of an improved version of the anti-patterns catalog. Furthermore, we developed a web application designed to not only showcase the anti-patterns but also to actively foster collaboration and engagement within the MFE community. The proposed catalog is a valuable resource for identifying and mitigating potential pitfalls in MFE development. It empowers developers of all experience levels to create more robust, maintainable, and well-designed MFE applications.
- [611] arXiv:2411.19835 (replaced) [pdf, html, other]
-
Title: Feedback-driven object detection and iterative model improvementJournal-ref: https://www.gfai.de/fileadmin/Downloads/Tagungsband/gfai-tagungsband-2024.pdfSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Automated object detection has become increasingly valuable across diverse applications, yet efficient, high-quality annotation remains a persistent challenge. In this paper, we present the development and evaluation of a platform designed to interactively improve object detection models. The platform allows uploading and annotating images as well as fine-tuning object detection models. Users can then manually review and refine annotations, further creating improved snapshots that are used for automatic object detection on subsequent image uploads - a process we refer to as semi-automatic annotation resulting in a significant gain in annotation efficiency.
Whereas iterative refinement of model results to speed up annotation has become common practice, we are the first to quantitatively evaluate its benefits with respect to time, effort, and interaction savings. Our experimental results show clear evidence for a significant time reduction of up to 53% for semi-automatic compared to manual annotation. Importantly, these efficiency gains did not compromise annotation quality, while matching or occasionally even exceeding the accuracy of manual annotations. These findings demonstrate the potential of our lightweight annotation platform for creating high-quality object detection datasets and provide best practices to guide future development of annotation platforms.
The platform is open-source, with the frontend and backend repositories available on GitHub. To support the understanding of our labeling process, we have created an explanatory video demonstrating the methodology using microscopy images of E. coli bacteria as an example. - [612] arXiv:2412.00493 (replaced) [pdf, html, other]
-
Title: Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene UnderstandingComments: Accepted by CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly impacted various multimodal tasks. However, these models face challenges in tasks that require spatial understanding within 3D environments. Efforts to enhance MLLMs, such as incorporating point cloud features, have been made, yet a considerable gap remains between the models' learned representations and the inherent complexity of 3D scenes. This discrepancy largely stems from the training of MLLMs on predominantly 2D data, which restricts their effectiveness in comprehending 3D spaces. To address this issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM, for 3D scene understanding. By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more accurately. In addition, we have implemented a maximum coverage sampling technique to optimize the trade-off between computational cost and performance. Extensive experiments demonstrate that our model achieves state-of-the-art performance on several 3D scene understanding benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
- [613] arXiv:2412.00692 (replaced) [pdf, html, other]
-
Title: MCBLT: Multi-Camera Multi-Object 3D Tracking in Long VideosYizhou Wang, Tim Meinhardt, Orcun Cetintas, Cheng-Yen Yang, Sameer Satish Pusegaonkar, Benjamin Missaoui, Sujit Biswas, Zheng Tang, Laura Leal-TaixéSubjects: Computer Vision and Pattern Recognition (cs.CV)
Object perception from multi-view cameras is crucial for intelligent systems, particularly in indoor environments, e.g., warehouses, retail stores, and hospitals. Most traditional multi-target multi-camera (MTMC) detection and tracking methods rely on 2D object detection, single-view multi-object tracking (MOT), and cross-view re-identification (ReID) techniques, without properly handling important 3D information by multi-view image aggregation. In this paper, we propose a 3D object detection and tracking framework, named MCBLT, which first aggregates multi-view images with necessary camera calibration parameters to obtain 3D object detections in bird's-eye view (BEV). Then, we introduce hierarchical graph neural networks (GNNs) to track these 3D detections in BEV for MTMC tracking results. Unlike existing methods, MCBLT has impressive generalizability across different scenes and diverse camera settings, with exceptional capability for long-term association handling. As a result, our proposed MCBLT establishes a new state-of-the-art on the AICity'24 dataset with $81.22$ HOTA, and on the WildTrack dataset with $95.6$ IDF1.
- [614] arXiv:2412.01095 (replaced) [pdf, html, other]
-
Title: VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language ModelsComments: Accepted in CVPR 2025Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The rapid advancement of vision-language models (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions of VERA are highly adaptable, significantly improving both detection performance and explainability of VLMs for VAD.
- [615] arXiv:2412.01477 (replaced) [pdf, html, other]
-
Title: Improving Object Detection by Modifying Synthetic Data with Explainable AISubjects: Computer Vision and Pattern Recognition (cs.CV)
Limited real-world data severely impacts model performance in many computer vision domains, particularly for samples that are underrepresented in training. Synthetically generated images are a promising solution, but 1) it remains unclear how to design synthetic training data to optimally improve model performance (e.g, whether and where to introduce more realism or more abstraction) and 2) the domain expertise, time and effort required from human operators for this design and optimisation process represents a major practical challenge. Here we propose a novel conceptual approach to improve the efficiency of designing synthetic images, by using robust Explainable AI (XAI) techniques to guide a human-in-the-loop process of modifying 3D mesh models used to generate these images. Importantly, this framework allows both modifications that increase and decrease realism in synthetic data, which can both improve model performance. We illustrate this concept using a real-world example where data are sparse; detection of vehicles in infrared imagery. We fine-tune an initial YOLOv8 model on the ATR DSIAC infrared dataset and synthetic images generated from 3D mesh models in the Unity gaming engine, and then use XAI saliency maps to guide modification of our Unity models. We show that synthetic data can improve detection of vehicles in orientations unseen in training by 4.6% (to mAP50 = 94.6%). We further improve performance by an additional 1.5% (to 96.1%) through our new XAI-guided approach, which reduces misclassifications through both increasing and decreasing the realism of different parts of the synthetic data. Our proof-of-concept results pave the way for fine, XAI-controlled curation of synthetic datasets tailored to improve object detection performance, whilst simultaneously reducing the burden on human operators in designing and optimising these datasets.
- [616] arXiv:2412.02340 (replaced) [pdf, other]
-
Title: PAPAYA Federated Analytics Stack: Engineering Privacy, Scalability and PracticalityHarish Srinivas, Graham Cormode, Mehrdad Honarkhah, Samuel Lurye, Jonathan Hehir, Lunwen He, George Hong, Ahmed Magdy, Dzmitry Huba, Kaikai Wang, Shen Guo, Shoubhik BhattacharyaSubjects: Machine Learning (cs.LG)
Cross-device Federated Analytics (FA) is a distributed computation paradigm designed to answer analytics queries about and derive insights from data held locally on users' devices. On-device computations combined with other privacy and security measures ensure that only minimal data is transmitted off-device, achieving a high standard of data protection. Despite FA's broad relevance, the applicability of existing FA systems is limited by compromised accuracy; lack of flexibility for data analytics; and an inability to scale effectively. In this paper, we describe our approach to combine privacy, scalability, and practicality to build and deploy a system that overcomes these limitations. Our FA system leverages trusted execution environments (TEEs) and optimizes the use of on-device computing resources to facilitate federated data processing across large fleets of devices, while ensuring robust, defensible, and verifiable privacy safeguards. We focus on federated analytics (statistics and monitoring), in contrast to systems for federated learning (ML workloads), and we flag the key differences.
- [617] arXiv:2412.02479 (replaced) [pdf, html, other]
-
Title: OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance VariationsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
With the rise of deep learning, facial recognition technology has seen extensive research and rapid development. Although facial recognition is considered a mature technology, we find that existing open-source models and commercial algorithms lack robustness in certain complex Out-of-Distribution (OOD) scenarios, raising concerns about the reliability of these systems. In this paper, we introduce OODFace, which explores the OOD challenges faced by facial recognition models from two perspectives: common corruptions and appearance variations. We systematically design 30 OOD scenarios across 9 major categories tailored for facial recognition. By simulating these challenges on public datasets, we establish three robustness benchmarks: LFW-C/V, CFP-FP-C/V, and YTF-C/V. We then conduct extensive experiments on 19 facial recognition models and 3 commercial APIs, along with extended physical experiments on face masks to assess their robustness. Next, we explore potential solutions from two perspectives: defense strategies and Vision-Language Models (VLMs). Based on the results, we draw several key insights, highlighting the vulnerability of facial recognition systems to OOD data and suggesting possible solutions. Additionally, we offer a unified toolkit that includes all corruption and variation types, easily extendable to other datasets. We hope that our benchmarks and findings can provide guidance for future improvements in facial recognition model robustness.
- [618] arXiv:2412.02798 (replaced) [pdf, html, other]
-
Title: Grayscale to Hyperspectral at Any Resolution Using a Phase-Only LensSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optics (physics.optics)
We consider the problem of reconstructing a HxWx31 hyperspectral image from a HxW grayscale snapshot measurement that is captured using only a single diffractive optic and a filterless panchromatic photosensor. This problem is severely ill-posed, but we present the first model that produces high-quality results. We make efficient use of limited data by training a conditional denoising diffusion model that operates on small patches in a shift-invariant manner. During inference, we synchronize per-patch hyperspectral predictions using guidance derived from the optical point spread function. Surprisingly, our experiments reveal that patch sizes as small as the PSFs support achieve excellent results, and they show that local optical cues are sufficient to capture full spectral information. Moreover, by drawing multiple samples, our model provides per-pixel uncertainty estimates that strongly correlate with reconstruction error. Our work lays the foundation for a new class of high-resolution snapshot hyperspectral imagers that are compact and light-efficient.
- [619] arXiv:2412.03044 (replaced) [pdf, html, other]
-
Title: Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Video anomaly detection (VAD) is a vital yet complex open-set task in computer vision, commonly tackled through reconstruction-based methods. However, these methods struggle with two key limitations: (1) insufficient robustness in open-set scenarios, where unseen normal motions are frequently misclassified as anomalies, and (2) an overemphasis on, but restricted capacity for, local motion reconstruction, which are inherently difficult to capture accurately due to their diversity. To overcome these challenges, we introduce a novel frequency-guided diffusion model with perturbation training. First, we enhance robustness by training a generator to produce perturbed samples, which are similar to normal samples and target the weakness of the reconstruction model. This training paradigm expands the reconstruction domain of the model, improving its generalization to unseen normal motions. Second, to address the overemphasis on motion details, we employ the 2D Discrete Cosine Transform (DCT) to separate high-frequency (local) and low-frequency (global) motion components. By guiding the diffusion model with observed high-frequency information, we prioritize the reconstruction of low-frequency components, enabling more accurate and robust anomaly detection. Extensive experiments on five widely used VAD datasets demonstrate that our approach surpasses state-of-the-art methods, underscoring its effectiveness in open-set scenarios and diverse motion contexts. Our project website is this https URL.
- [620] arXiv:2412.03215 (replaced) [pdf, other]
-
Title: Beyond [cls]: Exploring the true potential of Masked Image Modeling representationsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual representations. However, the out-of-the-box performance of MIMs is typically inferior to competing approaches. Most users cannot afford fine-tuning due to the need for large amounts of data, high GPU consumption, and specialized user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason for the poor out-of-the-box performance of MIMs. Is it due to weaker features produced by MIM models, or is it due to suboptimal usage? Through detailed analysis, we show that attention in MIMs is spread almost uniformly over many patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective Aggregation to better capture the rich semantic information retained in patch tokens, which significantly improves the out-of-the-box performance of MIM.
- [621] arXiv:2412.06602 (replaced) [pdf, html, other]
-
Title: Towards Controllable Speech Synthesis in the Era of Large Language Models: A SurveyComments: A comprehensive survey on controllable TTS, 26 pages, 7 tables, 6 figures, 317 references. Under reviewSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that aims to generate natural-sounding human speech from text. Recently, with the increasing industrial demand, TTS technologies have evolved beyond synthesizing human-like speech to enabling controllable speech generation. This includes fine-grained control over various attributes of synthesized speech such as emotion, prosody, timbre, and duration. In addition, advancements in deep learning, such as diffusion and large language models, have significantly enhanced controllable TTS over the past several years. In this work, we conduct a comprehensive survey of controllable TTS, covering approaches ranging from basic control techniques to methods utilizing natural language prompts, aiming to provide a clear understanding of the current state of research. We examine the general controllable TTS pipeline, challenges, model architectures, and control strategies, offering a comprehensive and clear taxonomy of existing methods. Additionally, we provide a detailed summary of datasets and evaluation metrics and shed some light on the applications and future directions of controllable TTS. To the best of our knowledge, this survey paper provides the first comprehensive review of emerging controllable TTS methods, which can serve as a beneficial resource for both academic researchers and industrial practitioners.
- [622] arXiv:2412.06779 (replaced) [pdf, html, other]
-
Title: AnyBimanual: Transferring Unimanual Policy for General Bimanual ManipulationComments: Project page: this https URLSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Performing general language-conditioned bimanual manipulation tasks is of great importance for many applications ranging from household service to industrial assembly. However, collecting bimanual manipulation data is expensive due to the high-dimensional action space, which poses challenges for conventional methods to handle general bimanual manipulation tasks. In contrast, unimanual policy has recently demonstrated impressive generalizability across a wide range of tasks because of scaled model parameters and training data, which can provide sharable manipulation knowledge for bimanual systems. To this end, we propose a plug-and-play method named AnyBimanual, which transfers pre-trained unimanual policy to general bimanual manipulation policy with few bimanual demonstrations. Specifically, we first introduce a skill manager to dynamically schedule the skill representations discovered from pre-trained unimanual policy for bimanual manipulation tasks, which linearly combines skill primitives with task-oriented compensation to represent the bimanual manipulation instruction. To mitigate the observation discrepancy between unimanual and bimanual systems, we present a visual aligner to generate soft masks for visual embedding of the workspace, which aims to align visual input of unimanual policy model for each arm with those during pretraining stage. AnyBimanual shows superiority on 12 simulated tasks from RLBench2 with a sizable 12.67% improvement in success rate over previous methods. Experiments on 9 real-world tasks further verify its practicality with an average success rate of 84.62%.
- [623] arXiv:2412.07534 (replaced) [pdf, html, other]
-
Title: ReCap: Better Gaussian Relighting with Cross-Environment CapturesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Accurate 3D objects relighting in diverse unseen environments is crucial for realistic virtual object placement. Due to the albedo-lighting ambiguity, existing methods often fall short in producing faithful relights. Without proper constraints, observed training views can be explained by numerous combinations of lighting and material attributes, lacking physical correspondence with the actual environment maps used for relighting. In this work, we present ReCap, treating cross-environment captures as multi-task target to provide the missing supervision that cuts through the entanglement. Specifically, ReCap jointly optimizes multiple lighting representations that share a common set of material attributes. This naturally harmonizes a coherent set of lighting representations around the mutual material attributes, exploiting commonalities and differences across varied object appearances. Such coherence enables physically sound lighting reconstruction and robust material estimation - both essential for accurate relighting. Together with a streamlined shading function and effective post-processing, ReCap outperforms all leading competitors on an expanded relighting benchmark.
- [624] arXiv:2412.07776 (replaced) [pdf, html, other]
-
Title: Video Motion Transfer with Diffusion TransformersComments: CVPR 2025 - Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.
- [625] arXiv:2412.08503 (replaced) [pdf, html, other]
-
Title: StyleStudio: Text-Driven Style Transfer with Selective Control of Style ElementsComments: Accepted by CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Text-driven style transfer aims to merge the style of a reference image with content described by a text prompt. Recent advancements in text-to-image models have improved the nuance of style transformations, yet significant challenges remain, particularly with overfitting to reference styles, limiting stylistic control, and misaligning with textual content. In this paper, we propose three complementary strategies to address these issues. First, we introduce a cross-modal Adaptive Instance Normalization (AdaIN) mechanism for better integration of style and text features, enhancing alignment. Second, we develop a Style-based Classifier-Free Guidance (SCFG) approach that enables selective control over stylistic elements, reducing irrelevant influences. Finally, we incorporate a teacher model during early generation stages to stabilize spatial layouts and mitigate artifacts. Our extensive evaluations demonstrate significant improvements in style transfer quality and alignment with textual prompts. Furthermore, our approach can be integrated into existing style transfer frameworks without fine-tuning.
- [626] arXiv:2412.09599 (replaced) [pdf, html, other]
-
Title: RatBodyFormer: Rat Body Surface from KeypointsAyaka Higami, Karin Oshima, Tomoyo Isoguchi Shiramatsu, Hirokazu Takahashi, Shohei Nobuhara, Ko NishinoComments: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Analyzing rat behavior lies at the heart of many scientific studies. Past methods for automated rodent modeling have focused on 3D pose estimation from keypoints, e.g., face and appendages. The pose, however, does not capture the rich body surface movement encoding the subtle rat behaviors like curling and stretching. The body surface lacks features that can be visually defined, evading these established keypoint-based methods. In this paper, we introduce the first method for reconstructing the rat body surface as a dense set of points by learning to predict it from the sparse keypoints that can be detected with past methods. Our method consists of two key contributions. The first is RatDome, a novel multi-camera system for rat behavior capture, and a large-scale dataset captured with it that consists of pairs of 3D keypoints and 3D body surface points. The second is RatBodyFormer, a novel network to transform detected keypoints to 3D body surface points. RatBodyFormer is agnostic to the exact locations of the 3D body surface points in the training data and is trained with masked-learning. We experimentally validate our framework with a number of real-world experiments. Our results collectively serve as a novel foundation for automated rat behavior analysis.
- [627] arXiv:2412.09603 (replaced) [pdf, html, other]
-
Title: Do Multimodal Large Language Models See Like Humans?Comments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models. However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans? Current benchmarks lack the ability to evaluate MLLMs from this perspective. To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision. HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs. Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs. Diverse human participants attained strong performance, significantly outperforming MLLMs, which further underscores the benchmark's high quality. We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.
- [628] arXiv:2412.10655 (replaced) [pdf, html, other]
-
Title: Optimal Static Dictionary with Worst-Case Constant Query TimeComments: 31 pages, 4 figures, in STOC 2025Subjects: Data Structures and Algorithms (cs.DS)
In this paper, we design a new succinct static dictionary with worst-case constant query time. A dictionary data structure stores a set of key-value pairs with distinct keys in $[U]$ and values in $[\sigma]$, such that given a query $x\in [U]$, it quickly returns if $x$ is one of the input keys, and if so, also returns its associated value. The textbook solution to dictionaries is hash tables. On the other hand, the (information-theoretical) optimal space to encode such a set of key-value pairs is only $\text{OPT} := \log\binom{U}{n}+n\log \sigma$.
We construct a dictionary that uses $\text{OPT} + n^{\epsilon}$ bits of space, and answers queries in constant time in worst case. Previously, constant-time dictionaries are only known with $\text{OPT} + n/\text{poly}\log n$ space [Pǎtraşcu 2008], or with $\text{OPT}+n^{\epsilon}$ space but expected constant query time [Yu 2020]. We emphasize that most of the extra $n^{\epsilon}$ bits are used to store a lookup table that does not depend on the input, and random bits for hash functions. The "main" data structure only occupies $\text{OPT}+\text{poly}\log n$ bits. - [629] arXiv:2412.11890 (replaced) [pdf, html, other]
-
Title: SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic SegmentationComments: CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. Our SegMAN-B Encoder achieves 85.1% ImageNet-1k accuracy (+1.5% over VMamba-S with fewer parameters). When paired with our decoder, the full SegMAN-B model achieves 52.6% mIoU on ADE20K (+1.6% over SegNeXt-L with 15% fewer GFLOPs), 83.8% mIoU on Cityscapes (+2.1% over SegFormer-B3 with half the GFLOPs), and 1.6% higher mIoU than VWFormer-B3 on COCO-Stuff with lower GFLOPs. Our code is available at this https URL.
- [630] arXiv:2412.15215 (replaced) [pdf, html, other]
-
Title: EnvGS: Modeling View-Dependent Appearance with Environment GaussianComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Reconstructing complex reflections in real-world scenes from 2D images is essential for achieving photorealistic novel view synthesis. Existing methods that utilize environment maps to model reflections from distant lighting often struggle with high-frequency reflection details and fail to account for near-field reflections. In this work, we introduce EnvGS, a novel approach that employs a set of Gaussian primitives as an explicit 3D representation for capturing reflections of environments. These environment Gaussian primitives are incorporated with base Gaussian primitives to model the appearance of the whole scene. To efficiently render these environment Gaussian primitives, we developed a ray-tracing-based renderer that leverages the GPU's RT core for fast rendering. This allows us to jointly optimize our model for high-quality reconstruction while maintaining real-time rendering speeds. Results from multiple real-world and synthetic datasets demonstrate that our method produces significantly more detailed reflections, achieving the best rendering quality in real-time novel view synthesis. The code is available at this https URL.
- [631] arXiv:2412.15239 (replaced) [pdf, html, other]
-
Title: Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLMsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN); Methodology (stat.ME)
Understanding when and why consumers engage with stories is crucial for content creators and platforms. While existing theories suggest that audience beliefs of what is going to happen should play an important role in engagement decisions, empirical work has mostly focused on developing techniques to directly extract features from actual content, rather than capturing forward-looking beliefs, due to the lack of a principled way to model such beliefs in unstructured narrative data. To complement existing feature extraction techniques, this paper introduces a novel framework that leverages large language models to model audience forward-looking beliefs about how stories might unfold. Our method generates multiple potential continuations for each story and extracts features related to expectations, uncertainty, and surprise using established content analysis techniques. Applying our method to over 30,000 book chapters, we demonstrate that our framework complements existing feature engineering techniques by amplifying their marginal explanatory power on average by 31%. The results reveal that different types of engagement-continuing to read, commenting, and voting-are driven by distinct combinations of current and anticipated content features. Our framework provides a novel way to study and explore how audience forward-looking beliefs shape their engagement with narrative media, with implications for marketing strategy in content-focused industries.
- [632] arXiv:2412.15814 (replaced) [pdf, other]
-
Title: Unveiling the Mechanisms of DAI: A Logic-Based Approach to Stablecoin AnalysisSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Logic in Computer Science (cs.LO)
Stablecoins are digital assets designed to maintain a stable value, typically pegged to traditional currencies. Despite their growing prominence, many stablecoins have struggled to consistently meet stability expectations, and their underlying mechanisms often remain opaque and challenging to analyze. This paper focuses on the DAI stablecoin, which combines crypto-collateralization and algorithmic mechanisms. We propose a formal logic-based framework for representing the policies and operations of DAI, implemented in Prolog and released as open-source software. Our framework enables detailed analysis and simulation of DAI's stability mechanisms, providing a foundation for understanding its robustness and identifying potential vulnerabilities.
- [633] arXiv:2412.16218 (replaced) [pdf, html, other]
-
Title: GNN-Transformer Cooperative Architecture for Trustworthy Graph Contrastive LearningComments: In Proceedings of AAAI 2025Subjects: Machine Learning (cs.LG)
Graph contrastive learning (GCL) has become a hot topic in the field of graph representation learning. In contrast to traditional supervised learning relying on a large number of labels, GCL exploits augmentation strategies to generate multiple views and positive/negative pairs, both of which greatly influence the performance. Unfortunately, commonly used random augmentations may disturb the underlying semantics of graphs. Moreover, traditional GNNs, a type of widely employed encoders in GCL, are inevitably confronted with over-smoothing and over-squashing problems. To address these issues, we propose GNN-Transformer Cooperative Architecture for Trustworthy Graph Contrastive Learning (GTCA), which inherits the advantages of both GNN and Transformer, incorporating graph topology to obtain comprehensive graph representations. Theoretical analysis verifies the trustworthiness of the proposed method. Extensive experiments on benchmark datasets demonstrate state-of-the-art empirical performance.
- [634] arXiv:2412.16604 (replaced) [pdf, html, other]
-
Title: OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable CapabilitiesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Feed-forward 3D Gaussian splatting (3DGS) models have gained significant popularity due to their ability to generate scenes immediately without needing per-scene optimization. Although omnidirectional images are becoming more popular since they reduce the computation required for image stitching to composite a holistic scene, existing feed-forward models are only designed for perspective images. The unique optical properties of omnidirectional images make it difficult for feature encoders to correctly understand the context of the image and make the Gaussian non-uniform in space, which hinders the image quality synthesized from novel views. We propose OmniSplat, a training-free fast feed-forward 3DGS generation framework for omnidirectional images. We adopt a Yin-Yang grid and decompose images based on it to reduce the domain gap between omnidirectional and perspective images. The Yin-Yang grid can use the existing CNN structure as it is, but its quasi-uniform characteristic allows the decomposed image to be similar to a perspective image, so it can exploit the strong prior knowledge of the learned feed-forward network. OmniSplat demonstrates higher reconstruction accuracy than existing feed-forward networks trained on perspective images. Our project page is available on: this https URL.
- [635] arXiv:2412.16822 (replaced) [pdf, html, other]
-
Title: Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion TransformersHaoran You, Connelly Barnes, Yuqian Zhou, Yan Kang, Zhenbang Du, Wei Zhou, Lingzhi Zhang, Yotam Nitzan, Xiaoyang Liu, Zhe Lin, Eli Shechtman, Sohrab Amirghodsi, Yingyan Celine LinComments: Accepted by CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency, making them difficult to deploy on resource-constrained devices. One major efficiency bottleneck is that existing DiTs apply equal computation across all regions of an image. However, not all image tokens are equally important, and certain localized areas require more computation, such as objects. To address this, we propose DiffCR, a dynamic DiT inference framework with differentiable compression ratios, which automatically learns to dynamically route computation across layers and timesteps for each image token, resulting in efficient DiTs. Specifically, DiffCR integrates three features: (1) A token-level routing scheme where each DiT layer includes a router that is fine-tuned jointly with model weights to predict token importance scores. In this way, unimportant tokens bypass the entire layer's computation; (2) A layer-wise differentiable ratio mechanism where different DiT layers automatically learn varying compression ratios from a zero initialization, resulting in large compression ratios in redundant layers while others remain less compressed or even uncompressed; (3) A timestep-wise differentiable ratio mechanism where each denoising timestep learns its own compression ratio. The resulting pattern shows higher ratios for noisier timesteps and lower ratios as the image becomes clearer. Extensive experiments on text-to-image and inpainting tasks show that DiffCR effectively captures dynamism across token, layer, and timestep axes, achieving superior trade-offs between generation quality and efficiency compared to prior works. The project website is available at this https URL.
- [636] arXiv:2412.17411 (replaced) [pdf, html, other]
-
Title: Pretraining with random noise for uncertainty calibrationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Uncertainty calibration is crucial for various machine learning applications, yet it remains challenging. Many models exhibit hallucinations - confident yet inaccurate responses - due to miscalibrated confidence. Here, we show that the common practice of random initialization in deep learning, often considered a standard technique, is an underlying cause of this miscalibration, leading to excessively high confidence in untrained networks. Our method, inspired by developmental neuroscience, addresses this issue by simply pretraining networks with random noise and labels, reducing overconfidence and bringing initial confidence levels closer to chance. This ensures optimal calibration, aligning confidence with accuracy during subsequent data training, without the need for additional pre- or post-processing. Pre-calibrated networks excel at identifying "unknown data," showing low confidence for out-of-distribution inputs, thereby resolving confidence miscalibration.
- [637] arXiv:2412.17667 (replaced) [pdf, html, other]
-
Title: VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and MusicJiatong Shi, Hye-jin Shim, Jinchuan Tian, Siddhant Arora, Haibin Wu, Darius Petermann, Jia Qi Yip, You Zhang, Yuxun Tang, Wangyou Zhang, Dareen Safar Alharthi, Yichen Huang, Koichi Saito, Jionghao Han, Yiwen Zhao, Chris Donahue, Shinji WatanabeSubjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at this https URL.
- [638] arXiv:2412.17696 (replaced) [pdf, html, other]
-
Title: Understanding the Logic of Direct Preference Alignment through LogicSubjects: Computation and Language (cs.CL)
Recent direct preference alignment algorithms (DPA), such as DPO, have shown great promise in aligning large language models to human preferences. While this has motivated the development of many new variants of the original DPO loss, understanding the differences between these recent proposals, as well as developing new DPA loss functions, remains difficult given the lack of a technical and conceptual framework for reasoning about the underlying semantics of these algorithms. In this paper, we attempt to remedy this by formalizing DPA losses in terms of discrete reasoning problems. Specifically, we ask: Given an existing DPA loss, can we systematically derive a symbolic program that characterizes its semantics? We propose a novel formalism for characterizing preference losses for single model and reference model based approaches, and identify symbolic forms for a number of commonly used DPA variants. Further, we show how this formal view of preference learning sheds new light on both the size and structure of the DPA loss landscape, making it possible to not only rigorously characterize the relationships between recent loss proposals but also to systematically explore the landscape and derive new loss functions from first principles. We hope our framework and findings will help provide useful guidance to those working on human AI alignment.
- [639] arXiv:2412.18609 (replaced) [pdf, html, other]
-
Title: Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language ModelsComments: CVPR 2025 camera-ready versionSubjects: Computer Vision and Pattern Recognition (cs.CV)
We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5$\times$ reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks. The fine-grained video question-answering evaluation demonstrates our model's effectiveness, outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key aspects like correctness and temporal understanding. Extensive ablation studies validate our architectural choices and demonstrate the effectiveness of our spatio-temporal modeling approach while achieving 3-4$\times$ faster processing speeds than previous methods. Code is available at this https URL.
- [640] arXiv:2412.20104 (replaced) [pdf, html, other]
-
Title: SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction SynthesisComments: 26 pages, 10 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. This complexity introduces significant challenges in synchronizing motions due to the high correlations and mutual influences among bodies. To address these challenges, we introduce SyncDiff, a novel method for multi-body interaction synthesis using a synchronized motion diffusion strategy. SyncDiff employs a single diffusion model to capture the joint distribution of multi-body motions. To enhance motion fidelity, we propose a frequency-domain motion decomposition scheme. Additionally, we introduce a new set of alignment scores to emphasize the synchronization of different body motions. SyncDiff jointly optimizes both data sample likelihood and alignment likelihood through an explicit synchronization strategy. Extensive experiments across four datasets with various multi-body configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.
- [641] arXiv:2501.01409 (replaced) [pdf, html, other]
-
Title: JOG3R: Towards 3D-Consistent Video GeneratorsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation (i.e.,DUSt3R [79]) networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named \nameMethod, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.
- [642] arXiv:2501.01855 (replaced) [pdf, html, other]
-
Title: UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle ImagerySubjects: Computer Vision and Pattern Recognition (cs.CV)
Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i.e., UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused down-sampling module is presented to retain critical spatial details during down-sampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3.1\% and $\text{AP}_{50}$ by 4.2\% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page: this https URL
- [643] arXiv:2501.02471 (replaced) [pdf, html, other]
-
Title: Hengqin-RA-v1: Advanced Large Language Model for Diagnosis and Treatment of Rheumatoid Arthritis with Dataset based Traditional Chinese MedicineYishen Liu, Shengda Luo, Zishao Zhong, Tongtong Wu, Jianguo Zhang, Peiyao Ou, Yong Liang, Liang Liu, Hudan PanComments: 8 pages, 5 figures, AAAI-2025 WorkshopSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models (LLMs) primarily trained on English texts, often face biases and inaccuracies in Chinese contexts. Their limitations are pronounced in fields like Traditional Chinese Medicine (TCM), where cultural and clinical subtleties are vital, further hindered by a lack of domain-specific data, such as rheumatoid arthritis (RA). To address these issues, this paper introduces Hengqin-RA-v1, the first large language model specifically tailored for TCM with a focus on diagnosing and treating RA. We also present HQ-GCM-RA-C1, a comprehensive RA-specific dataset curated from ancient Chinese medical literature, classical texts, and modern clinical studies. This dataset empowers Hengqin-RA-v1 to deliver accurate and culturally informed responses, effectively bridging the gaps left by general-purpose models. Extensive experiments demonstrate that Hengqin-RA-v1 outperforms state-of-the-art models, even surpassing the diagnostic accuracy of TCM practitioners in certain cases.
- [644] arXiv:2501.04765 (replaced) [pdf, html, other]
-
Title: TREAD: Token Routing for Efficient Architecture-agnostic Diffusion TrainingSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and personalization were quickly adopted by the community. However, training these models in the first place remains very costly. While several recent approaches - including masking, distillation, and architectural modifications - have been proposed to improve training efficiency, each of these methods comes with a tradeoff: they achieve enhanced performance at the expense of increased computational cost or vice versa. In contrast, this work aims to improve training efficiency as well as generative performance at the same time through routes that act as a transport mechanism for randomly selected tokens from early layers to deeper layers of the model. Our method is not limited to the common transformer-based model - it can also be applied to state-space models and achieves this without architectural modifications or additional parameters. Finally, we show that TREAD reduces computational cost and simultaneously boosts model performance on the standard ImageNet-256 benchmark in class-conditional synthesis. Both of these benefits multiply to a convergence speedup of 14x at 400K training iterations compared to DiT and 37x compared to the best benchmark performance of DiT at 7M training iterations. Furthermore, we achieve a competitive FID of 2.09 in a guided and 3.93 in an unguided setting, which improves upon the DiT, without architectural changes.
- [645] arXiv:2501.05037 (replaced) [pdf, html, other]
-
Title: LongViTU: Instruction Tuning for Long-Form Video UnderstandingSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
This paper introduces LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We propose a systematic approach that organizes videos into a hierarchical tree structure for QA generation and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.)). We also offer explicit timestamp annotations of relevant events for each QA pair. We have conducted extensive human studies on LongViTU, and the results prove the quality of our dataset. To better evaluate the challenges posed by LongViTU's emphasis on long-term context and condensed reasoning, we manually curate a subset of LongViTU into a benchmark. Evaluations using a state-of-the-art open-source model (LongVU), a proprietary model (Gemini-1.5-Pro), and human annotators yield GPT-4 scores of 49.9, 52.3, and 81.0, respectively, underscoring the substantial difficulty presented by LongViTU questions. Performing supervised fine-tuning (SFT) of LongVU and LLaVA-Video on LongViTU data results in average performance gains of 2.5% and 3.7%, respectively, across a suite of long video understanding benchmarks (EgoSchema, VideoMME-Long, MLVU, LVBench).
- [646] arXiv:2501.05510 (replaced) [pdf, html, other]
-
Title: OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi WangComments: CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at this https URL.
- [647] arXiv:2501.09766 (replaced) [pdf, html, other]
-
Title: iTool: Boosting Tool Use of Large Language Models via Iterative Reinforced Fine-TuningComments: under review ACLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Augmenting large language models (LLMs) with external tools is known as a promising approach to enhancing their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve it. Nevertheless, our investigation reveals that (1) training gains significantly decay as synthetic data increases. The model struggles to benefit from more synthetic data due to potential data diversity issues, resulting in poor performance in complex scenarios. Moreover, we find that (2) this challenge primarily manifests as minor discrepancies between the model's output and the ground truth response (termed as deficiency), such as errors in parameter values that require complex reasoning from the context to resolve. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate these challenges. This strategy involves: (1) enhancing the diversity of synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively identifying deficiency-related data, constructing fine-grained preference pairs to pinpoint deficiencies, and then applying preference optimization to optimize these deficiencies. Our experiments show that models trained using our method achieve about 12\% better performance than baseline models, outperforming larger open-source and closed-source models.
- [648] arXiv:2501.10356 (replaced) [pdf, html, other]
-
Title: DexForce: Extracting Force-informed Actions from Kinesthetic Demonstrations for Dexterous ManipulationComments: Videos can be found here: this https URLSubjects: Robotics (cs.RO)
Imitation learning requires high-quality demonstrations consisting of sequences of state-action pairs. For contact-rich dexterous manipulation tasks that require dexterity, the actions in these state-action pairs must produce the right forces. Current widely-used methods for collecting dexterous manipulation demonstrations are difficult to use for demonstrating contact-rich tasks due to unintuitive human-to-robot motion retargeting and the lack of direct haptic feedback. Motivated by these concerns, we propose DexForce. DexForce leverages contact forces, measured during kinesthetic demonstrations, to compute force-informed actions for policy learning. We collect demonstrations for six tasks and show that policies trained on our force-informed actions achieve an average success rate of 76% across all tasks. In contrast, policies trained directly on actions that do not account for contact forces have near-zero success rates. We also conduct a study ablating the inclusion of force data in policy observations. We find that while using force data never hurts policy performance, it helps most for tasks that require advanced levels of precision and coordination, like opening an AirPods case and unscrewing a nut.
- [649] arXiv:2501.11441 (replaced) [pdf, html, other]
-
Title: Ontology Matching with Large Language Models and Prioritized Depth-First SearchSubjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Ontology matching (OM) plays a key role in enabling data interoperability and knowledge sharing, but it remains challenging due to the need for large training datasets and limited vocabulary processing in machine learning approaches. Recently, methods based on Large Language Model (LLMs) have shown great promise in OM, particularly through the use of a retrieve-then-prompt pipeline. In this approach, relevant target entities are first retrieved and then used to prompt the LLM to predict the final matches. Despite their potential, these systems still present limited performance and high computational overhead. To address these issues, we introduce MILA, a novel approach that embeds a retrieve-identify-prompt pipeline within a prioritized depth-first search (PDFS) strategy. This approach efficiently identifies a large number of semantic correspondences with high accuracy, limiting LLM requests to only the most borderline cases. We evaluated MILA using the biomedical challenge proposed in the 2023 and 2024 editions of the Ontology Alignment Evaluation Initiative. Our method achieved the highest F-Measure in four of the five unsupervised tasks, outperforming state-of-the-art OM systems by up to 17%. It also performed better than or comparable to the leading supervised OM systems. MILA further exhibited task-agnostic performance, remaining stable across all tasks and settings, while significantly reducing LLM requests. These findings highlight that high-performance LLM-based OM can be achieved through a combination of programmed (PDFS), learned (embedding vectors), and prompting-based heuristics, without the need of domain-specific heuristics or fine-tuning.
- [650] arXiv:2501.11841 (replaced) [pdf, html, other]
-
Title: Survey on Monocular Metric Depth EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Monocular Depth Estimation (MDE) is fundamental to computer vision, enabling spatial understanding, 3D reconstruction, and autonomous driving. Deep learning-based MDE predicts relative depth from a single image, but the lack of metric scale introduces inconsistencies, limiting applicability in tasks such as visual SLAM, 3D reconstruction, and novel view synthesis. Monocular Metric Depth Estimation (MMDE) overcomes this limitation by enabling precise scene-scale inference, improving depth consistency, enhancing stability in sequential tasks, and streamlining integration into practical systems. This paper systematically reviews the evolution of depth estimation, from traditional geometric methods to deep learning breakthroughs, emphasizing scale-agnostic approaches in zero-shot generalization which is crucial for advancing MMDE. Recent progress in zero-shot MMDE is examined, focusing on challenges such as model generalization and boundary detail loss. To address these issues, researchers have explored unlabeled data augmentation, image patching, architectural optimization, and generative techniques. This review analyzes these developments, assessing their impact and limitations. Key findings are synthesized, unresolved challenges outlined, and future research direction proposal. By providing a clear technical roadmap and insight into emerging trends, this work aims to drive innovation and expand the real-world applications of MMDE.
- [651] arXiv:2501.12911 (replaced) [pdf, html, other]
-
Title: A Selective Homomorphic Encryption Approach for Faster Privacy-Preserving Federated LearningComments: 23 pages, 32 figuresSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Federated learning is a machine learning method that supports training models on decentralized devices or servers, where each holds its local data, removing the need for data exchange. This approach is especially useful in healthcare, as it enables training on sensitive data without needing to share them. The nature of federated learning necessitates robust security precautions due to data leakage concerns during communication. To address this issue, we propose a new approach that employs selective encryption, homomorphic encryption, differential privacy, and bit-wise scrambling to minimize data leakage while achieving good execution performance. Our technique , FAS (fast and secure federated learning) is used to train deep learning models on medical imaging data. We implemented our technique using the Flower framework and compared with a state-of-the-art federated learning approach that also uses selective homomorphic encryption. Our experiments were run in a cluster of eleven physical machines to create a real-world federated learning scenario on different datasets. We observed that our approach is up to 90\% faster than applying fully homomorphic encryption on the model weights. In addition, we can avoid the pretraining step that is required by our competitor and can save up to 46% in terms of total execution time. While our approach was faster, it obtained similar security results as the competitor.
- [652] arXiv:2501.13470 (replaced) [pdf, html, other]
-
Title: Leveraging Textual Anatomical Knowledge for Class-Imbalanced Semi-Supervised Multi-Organ SegmentationComments: 11 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Annotating 3D medical images demands substantial time and expertise, driving the adoption of semi-supervised learning (SSL) for segmentation tasks. However, the complex anatomical structures of organs often lead to significant class imbalances, posing major challenges for deploying SSL in real-world scenarios. Despite the availability of valuable prior information, such as inter-organ relative positions and organ shape priors, existing SSL methods have yet to fully leverage these insights. To address this gap, we propose a novel approach that integrates textual anatomical knowledge (TAK) into the segmentation model. Specifically, we use GPT-4o to generate textual descriptions of anatomical priors, which are then encoded using a CLIP-based model. These encoded priors are injected into the segmentation model as parameters of the segmentation head. Additionally, contrastive learning is employed to enhance the alignment between textual priors and visual features. Extensive experiments demonstrate the superior performance of our method, significantly surpassing state-of-the-art approaches. The source code will be available at: this https URL.
- [653] arXiv:2501.15303 (replaced) [pdf, other]
-
Title: Guarded Negation Transitive Closure Logic is 2-EXPTIME-completeSubjects: Logic in Computer Science (cs.LO)
We consider guarded negation transitive closure logic (GNTC). In this paper, we show that the satisfiability problem for GNTC is in 2-EXPTIME (hence, 2-EXPTIME-complete from existing lower bound results), which improves the previously known non-elementary time upper bound. This extends previously known 2-EXPTIME upper bound results, e.g., for the guarded negation fragment of first-order logic, the unary negation fragment of first-order logic with regular path expressions, propositional dynamic logic (PDL) with intersection and converse, and CPDL+ (an extension of PDL with conjunctive queries) of bounded treewidth. To this end, we present a sound and complete local model checker on tree decompositions. This system has a closure property of size single exponential, and it induces a reduction from the satisfiability problem for GNTC into the non-emptiness problem for 2-way (weak) alternating parity tree automata in single exponential time. Additionally, we investigate the complexity of satisfiability and model checking for fragments of GNTC, such as guarded (quantification) fragments, unary negation fragments, and existential positive fragments.
- [654] arXiv:2501.16391 (replaced) [pdf, html, other]
-
Title: Inductive-Associative Meta-learning Pipeline with Human Cognitive Patterns for Unseen Drug-Target Interaction PredictionXiaoqing Lian, Jie Zhu, Tianxu Lv, Shiyun Nie, Hang Fan, Guosheng Wu, Yunjun Ge, Lihua Li, Xiangxiang Zeng, Xiang PanSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Significant differences in protein structures hinder the generalization of existing drug-target interaction (DTI) models, which often rely heavily on pre-learned binding principles or detailed annotations. In contrast, BioBridge designs an Inductive-Associative pipeline inspired by the workflow of scientists who base their accumulated expertise on drawing insights into novel drug-target pairs from weakly related references. BioBridge predicts novel drug-target interactions using limited sequence data, incorporating multi-level encoders with adversarial training to accumulate transferable binding principles. On these principles basis, BioBridge employs a dynamic prototype meta-learning framework to associate insights from weakly related annotations, enabling robust predictions for previously unseen drug-target pairs. Extensive experiments demonstrate that BioBridge surpasses existing models, especially for unseen proteins. Notably, when only homologous protein binding data is available, BioBridge proves effective for virtual screening of the epidermal growth factor receptor and adenosine receptor, underscoring its potential in drug discovery.
- [655] arXiv:2501.18038 (replaced) [pdf, other]
-
Title: A Case Study in Acceleration AI Ethics: The TELUS GenAI Conversational AgentSubjects: Computers and Society (cs.CY)
Acceleration ethics addresses the tension between innovation and safety in artificial intelligence. The acceleration argument is that risks raised by innovation should be answered with still more innovating. This paper summarizes the theoretical position, and then shows how acceleration ethics works in a real case. To begin, the paper summarizes acceleration ethics as composed of five elements: innovation solves innovation problems, innovation is intrinsically valuable, the unknown is encouraging, governance is decentralized, ethics is embedded. Subsequently, the paper illustrates the acceleration framework with a use-case, a generative artificial intelligence language tool developed by the Canadian telecommunications company Telus. While the purity of theoretical positions is blurred by real-world ambiguities, the Telus experience indicates that acceleration AI ethics is a way of maximizing social responsibility through innovation, as opposed to sacrificing social responsibility for innovation, or sacrificing innovation for social responsibility.
- [656] arXiv:2501.18504 (replaced) [pdf, other]
-
Title: CLEAR: Cue Learning using Evolution for Accurate Recognition Applied to Sustainability Data ExtractionComments: 9 pages plus 2 pages of supplemental materialJournal-ref: Proceedings of the Genetic and Evolutionary Computation Conference 2025 (GECCO 25). ACM, Malaga, SpainSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Large Language Model (LLM) image recognition is a powerful tool for extracting data from images, but accuracy depends on providing sufficient cues in the prompt - requiring a domain expert for specialized tasks. We introduce Cue Learning using Evolution for Accurate Recognition (CLEAR), which uses a combination of LLMs and evolutionary computation to generate and optimize cues such that recognition of specialized features in images is improved. It achieves this by auto-generating a novel domain-specific representation and then using it to optimize suitable textual cues with a genetic algorithm. We apply CLEAR to the real-world task of identifying sustainability data from interior and exterior images of buildings. We investigate the effects of using a variable-length representation compared to fixed-length and show how LLM consistency can be improved by refactoring from categorical to real-valued estimates. We show that CLEAR enables higher accuracy compared to expert human recognition and human-authored prompts in every task with error rates improved by up to two orders of magnitude and an ablation study evincing solution concision.
- [657] arXiv:2502.01894 (replaced) [pdf, html, other]
-
Title: SimBEV: A Synthetic Multi-Task Multi-Sensor Driving Data Generation Tool and DatasetSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Bird's-eye view (BEV) perception has garnered significant attention in autonomous driving in recent years, in part because BEV representation facilitates multi-modal sensor fusion. BEV representation enables a variety of perception tasks including BEV segmentation, a concise view of the environment useful for planning a vehicle's trajectory. However, this representation is not fully supported by existing datasets, and creation of new datasets for this purpose can be a time-consuming endeavor. To address this challenge, we introduce SimBEV. SimBEV is a randomized synthetic data generation tool that is extensively configurable and scalable, supports a wide array of sensors, incorporates information from multiple sources to capture accurate BEV ground truth, and enables a variety of perception tasks including BEV segmentation and 3D object detection. SimBEV is used to create the SimBEV dataset, a large collection of annotated perception data from diverse driving scenarios. SimBEV and the SimBEV dataset are open and available to the public.
- [658] arXiv:2502.02389 (replaced) [pdf, html, other]
-
Title: Rate-reliability functions for deterministic identificationComments: 12 pages, 2 figures. A preliminary version of this work has been accepted for presentation at the 2025 IEEE International Conference on Communications, Montreal (Canada) 8-12 June 2025Subjects: Information Theory (cs.IT); Quantum Physics (quant-ph)
We investigate deterministic identification over arbitrary memoryless channels under the constraint that the error probabilities of first and second kind are exponentially small in the block length $n$, controlled by reliability exponents $E_1,E_2 \geq 0$. In contrast to the regime of slowly vanishing errors, where the identifiable message length scales as $\Theta(n\log n)$, here we find that for positive exponents linear scaling is restored, now with a rate that is a function of the reliability exponents. We give upper and lower bounds on the ensuing rate-reliability function in terms of (the logarithm of) the packing and covering numbers of the channel output set, which for small error exponents $E_1,E_2>0$ can be expanded in leading order as the product of the Minkowski dimension of a certain parametrisation the channel output set and $\log\min\{E_1,E_2\}$. These allow us to recover the previously observed slightly superlinear identification rates, and offer a different perspective for understanding them in more traditional information theory terms. We further illustrate our results with a discussion of the case of dimension zero, and extend them to classical-quantum channels and quantum channels with tensor product input restriction.
- [659] arXiv:2502.05628 (replaced) [pdf, html, other]
-
Title: AnyEdit: Edit Any Knowledge Encoded in Language ModelsHoucheng Jiang, Junfeng Fang, Ningyu Zhang, Guojun Ma, Mingyang Wan, Xiang Wang, Xiangnan He, Tat-seng ChuaSubjects: Computation and Language (cs.CL)
Large language models (LLMs) often produce incorrect or outdated information, necessitating efficient and precise knowledge updates. Current model editing methods, however, struggle with long-form knowledge in diverse formats, such as poetry, code snippets, and mathematical derivations. These limitations arise from their reliance on editing a single token's hidden state, a limitation we term "efficacy barrier". To solve this, we propose AnyEdit, a new autoregressive editing paradigm. It decomposes long-form knowledge into sequential chunks and iteratively edits the key token in each chunk, ensuring consistent and accurate outputs. Theoretically, we ground AnyEdit in the Chain Rule of Mutual Information, showing its ability to update any knowledge within LLMs. Empirically, it outperforms strong baselines by 21.5% on benchmarks including UnKEBench, AKEW, and our new EditEverything dataset for long-form diverse-formatted knowledge. Additionally, AnyEdit serves as a plug-and-play framework, enabling current editing methods to update knowledge with arbitrary length and format, significantly advancing the scope and practicality of LLM knowledge editing.
- [660] arXiv:2502.05759 (replaced) [pdf, html, other]
-
Title: Reinforced Lifelong Editing for Language ModelsSubjects: Computation and Language (cs.CL)
Large language models (LLMs) acquire information from pre-training corpora, but their stored knowledge can become inaccurate or outdated over time. Model editing addresses this challenge by modifying model parameters without retraining, and prevalent approaches leverage hypernetworks to generate these parameter updates. However, they face significant challenges in lifelong editing due to their incompatibility with LLM parameters that dynamically change during the editing process. To address this, we observed that hypernetwork-based lifelong editing aligns with reinforcement learning modeling and proposed RLEdit, an RL-based editing method. By treating editing losses as rewards and optimizing hypernetwork parameters at the full knowledge sequence level, we enable it to precisely capture LLM changes and generate appropriate parameter updates. Our extensive empirical evaluation across several LLMs demonstrates that RLEdit outperforms existing methods in lifelong editing with superior effectiveness and efficiency, achieving a 59.24% improvement while requiring only 2.11% of the time compared to most approaches. Our code is available at: this https URL.
- [661] arXiv:2502.06352 (replaced) [pdf, html, other]
-
Title: LANTERN++: Enhancing Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive ModelsComments: ICLR 2025 Workshop at SCOPE (Oral), 16 pages, 5 figures, short paper (6 pages exclude reference and appendix)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Speculative decoding has been widely used to accelerate auto-regressive (AR) text generation. However, its effectiveness for visual AR models remains limited due to token selection ambiguity, where multiple tokens share similarly low probabilities and thus reduce acceptance rates. Recently, relaxed speculative decoding with dynamic tree drafting was proposed to mitigate this ambiguity, demonstrating promising results in accelerating visual AR models. However, we observe that token selection ambiguity still negatively affects dynamic tree drafting, resulting in shallow draft trees and limited acceleration. To overcome this issue, we introduce LANTERN++, a refined framework that integrates static tree drafting with a tailored relaxed acceptance condition, allowing drafts to be selected independently of low-confidence predictions. This enables the acceptance of deeper sequences, improving decoding efficiency while preserving image quality. Extensive experiments on state-of-the-art visual AR models demonstrate that LANTERN++ significantly accelerates inference, achieving up to $\mathbf{\times 2.56}$ speedup over standard AR decoding while maintaining high image quality. The code is publicly available at this https URL.
- [662] arXiv:2502.06608 (replaced) [pdf, html, other]
-
Title: TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow ModelsYangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, Yan-Pei CaoSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Recent advancements in diffusion techniques have propelled image and video generation to unprecedented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data processing, and insufficient exploration of advanced techniques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capability, and alignment with input conditions. We present TripoSG, a new streamlined shape diffusion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high-quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high-quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D generative models. Through comprehensive experiments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit enhanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input images. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong generalization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.
- [663] arXiv:2502.06818 (replaced) [pdf, html, other]
-
Title: Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic SegmentationComments: Under reviewSubjects: Machine Learning (cs.LG)
Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In vanilla CLIP, patch-wise image representations mainly encode homogeneous image-level properties, which hinders the application of CLIP to the dense prediction task. Previous TF-OVSS works sacrifice globality to enhance the locality of CLIP features, by making each patch mainly attend to itself or its neighboring patches within a narrow local window. With their modifications,the ability of CLIP to aggregate global context information is largely weakened. Differently, in this paper, we rethink the global knowledge encoded by CLIP and propose GCLIP to answer how to extract and utilize beneficial global knowledge of CLIP for TF-OVSS. As the representation of each patch is finally determined by the attention weights and the Value embeddings, we propose to reshape the last-block attention and Value embeddings to aggregate useful global context into final features. Firstly, we aim to equip the last-block attention with image-level properties while not introducing homogeneous attention patterns across patches. To realize the goal, we fuse the attention from the global-token emerging blocks with the Query-Query attention. Secondly, we aim to make Value embeddings of the last-block attention module more semantically correlated. To realize this, we design a novel channel suppression this http URL experiments on five standard benchmarks demonstrate that our method consistently outperforms previous state-of-the-arts.
- [664] arXiv:2502.06874 (replaced) [pdf, html, other]
-
Title: Group Reasoning Emission Estimation NetworksSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Accurate greenhouse gas (GHG) emission reporting is critical for governments, businesses, and investors. However, adoption remains limited particularly among small and medium enterprises due to high implementation costs, fragmented emission factor databases, and a lack of robust sector classification methods. To address these challenges, we introduce Group Reasoning Emission Estimation Networks (GREEN), an AI-driven carbon accounting framework that standardizes enterprise-level emission estimation, constructs a large-scale benchmark dataset, and leverages a novel reasoning approach with large language models (LLMs). Specifically, we compile textual descriptions for 20,850 companies with validated North American Industry Classification System (NAICS) labels and align these with an economic model of carbon intensity factors. By reframing sector classification as an information retrieval task, we fine-tune Sentence-BERT models using a contrastive learning loss. To overcome the limitations of single-stage models in handling thousands of hierarchical categories, we propose a Group Reasoning method that ensembles LLM classifiers based on the natural NAICS ontology, decomposing the task into multiple sub-classification steps. We theoretically prove that this approach reduces classification uncertainty and computational complexity. Experiments on 1,114 NAICS categories yield state-of-the-art performance (83.68% Top-1, 91.47% Top-10 accuracy), and case studies on 20 companies report a mean absolute percentage error (MAPE) of 45.88%. The project is available at: this https URL.
- [665] arXiv:2502.08180 (replaced) [pdf, html, other]
-
Title: Enhancing LLM Character-Level Manipulation via Divide and ConquerSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks. However, they exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution. These challenges stem primarily from tokenization constraints, despite the critical role of such operations in data preprocessing and code generation. Through systematic analysis, we derive two key insights: (1) LLMs face significant difficulties in leveraging intrinsic token knowledge for character-level reasoning, and (2) atomized word structures can substantially enhance LLMs' ability to process token-level structural information. Building on these insights, we propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation. Our method decomposes complex operations into explicit character-level subtasks coupled with controlled token reconstruction phases, leading to significant improvements in accuracy. Without additional training, our method significantly improves accuracies on the $\texttt{Deletion}$, $\texttt{Insertion}$, and $\texttt{Substitution}$ tasks. To support further research, we open-source our implementation and benchmarks.
- [666] arXiv:2502.08356 (replaced) [pdf, other]
-
Title: Systematic Knowledge Injection into Large Language Models via Diverse Augmentation for Domain-Specific RAGKushagra Bhushan, Yatin Nandwani, Dinesh Khandelwal, Sonam Gupta, Gaurav Pandey, Dinesh Raghu, Sachindra JoshiComments: 22 pages, 14 tables, to be published in NAACL 2025Subjects: Computation and Language (cs.CL)
Retrieval-Augmented Generation (RAG) has emerged as a prominent method for incorporating domain knowledge into Large Language Models (LLMs). While RAG enhances response relevance by incorporating retrieved domain knowledge in the context, retrieval errors can still lead to hallucinations and incorrect answers. To recover from retriever failures, domain knowledge is injected by fine-tuning the model to generate the correct response, even in the case of retrieval errors. However, we observe that without systematic knowledge augmentation, fine-tuned LLMs may memorize new information but still fail to extract relevant domain knowledge, leading to poor performance. In this work, we present a novel framework that significantly enhances the fine-tuning process by augmenting the training data in two ways -- context augmentation and knowledge paraphrasing. In context augmentation, we create multiple training samples for a given QA pair by varying the relevance of the retrieved information, teaching the model when to ignore and when to rely on retrieved content. In knowledge paraphrasing, we fine-tune with multiple answers to the same question, enabling LLMs to better internalize specialized knowledge. To mitigate catastrophic forgetting due to fine-tuning, we add a domain-specific identifier to a question and also utilize a replay buffer containing general QA pairs. Experimental results demonstrate the efficacy of our method over existing techniques, achieving up to 10\% relative gain in token-level recall while preserving the LLM's generalization capabilities.
- [667] arXiv:2502.08745 (replaced) [pdf, html, other]
-
Title: IHEval: Evaluating Language Models on Following the Instruction HierarchyZhihan Zhang, Shiyang Li, Zixuan Zhang, Xin Liu, Haoming Jiang, Xianfeng Tang, Yifan Gao, Zheng Li, Haodong Wang, Zhaoxuan Tan, Yichuan Li, Qingyu Yin, Bing Yin, Meng JiangComments: Accepted to NAACL 2025 for oral presentation. Our project page is located at this https URLSubjects: Computation and Language (cs.CL)
The instruction hierarchy, which establishes a priority order from system messages to user messages, conversation history, and tool outputs, is essential for ensuring consistent and safe behavior in language models (LMs). Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models' ability to follow the instruction hierarchy. We bridge this gap by introducing IHEval, a novel benchmark comprising 3,538 examples across nine tasks, covering cases where instructions in different priorities either align or conflict. Our evaluation of popular LMs highlights their struggle to recognize instruction priorities. All evaluated models experience a sharp performance decline when facing conflicting instructions, compared to their original instruction-following performance. Moreover, the most competitive open-source model only achieves 48% accuracy in resolving such conflicts. Our results underscore the need for targeted optimization in the future development of LMs.
- [668] arXiv:2502.08972 (replaced) [pdf, html, other]
-
Title: Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context LearningHyundong Cho, Karishma Sharma, Nicolaas Jedema, Leonardo F. R. Ribeiro, Alessandro Moschitti, Ravi Krishnan, Jonathan MayComments: NAACL 2025 FindingsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Language models are aligned to the collective voice of many, resulting in generic outputs that do not align with specific users' styles. In this work, we present Trial-Error-Explain In-Context Learning} (ITCL), a tuning-free method that personalizes language models for text generation tasks with fewer than 10 examples per user. TICL iteratively expands an in-context learning prompt via a trial-error-explain process, adding model-generated negative samples and explanations that provide fine-grained guidance towards a specific user's style. TICL achieves favorable win rates on pairwise comparisons with LLM-as-a-judge up to 91.5% against the previous state-of-the-art and outperforms competitive tuning-free baselines for personalized alignment tasks of writing emails, essays and news articles. Both lexical and qualitative analyses show that the negative samples and explanations enable language models to learn stylistic context more effectively and overcome the bias towards structural and formal phrases observed in their zero-shot outputs. By front-loading inference compute to create a user-specific in-context learning prompt that does not require extra generation steps at test time, TICL presents a novel yet simple approach for personalized alignment.
- [669] arXiv:2502.09042 (replaced) [pdf, html, other]
-
Title: Typhoon T1: An Open Thai Reasoning ModelComments: 25 pages, 6 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
This paper introduces Typhoon T1, an open effort to develop an open Thai reasoning model. A reasoning model is a relatively new type of generative model built on top of large language models (LLMs). A reasoning model generates a long chain of thought before arriving at a final answer, an approach found to improve performance on complex tasks. However, details on developing such a model are limited, especially for reasoning models that can generate traces in a low-resource language. Typhoon T1 presents an open effort that dives into the details of developing a reasoning model in a more cost-effective way by leveraging supervised fine-tuning using open datasets, instead of reinforcement learning. This paper shares the details about synthetic data generation and training, as well as our dataset and model weights. Additionally, we provide insights gained from developing a reasoning model that generalizes across domains and is capable of generating reasoning traces in a low-resource language, using Thai as an example. We hope this open effort provides a foundation for further research in this field.
- [670] arXiv:2502.09056 (replaced) [pdf, html, other]
-
Title: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging -- An Open RecipeComments: 9 pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
This paper investigates data selection and model merging methodologies aimed at incorporating advanced reasoning capabilities such as those of DeepSeek R1 into language-specific large language models (LLMs), with a particular focus on the Thai LLM. Our goal is to enhance the reasoning capabilities of language-specific LLMs while maintaining their target language abilities. DeepSeek R1 excels in reasoning but primarily benefits high-resource languages such as English and Chinese. However, low-resource languages remain underserved due to the dominance of English-centric training data and model optimizations, which limit performance in these languages. This limitation results in unreliable code-switching and diminished effectiveness on tasks in low-resource languages. Meanwhile, local and regional LLM initiatives have attempted to bridge this gap by developing language-specific LLMs that focus on improving local linguistic fidelity. We demonstrate that, with only publicly available datasets and a computational budget of $120, it is possible to enhance the reasoning capabilities of language-specific LLMs to match the level of DeepSeek R1, without compromising their performance on target language tasks.
- [671] arXiv:2502.09535 (replaced) [pdf, html, other]
-
Title: Entropy Collapse in Mobile Sensors: The Hidden Risks of Sensor-Based SecuritySubjects: Cryptography and Security (cs.CR)
Mobile sensor data has been proposed for security-critical applications such as device pairing, proximity detection, and continuous authentication. However, the foundational assumption that these signals provide sufficient entropy remains under-explored. In this work, we systematically analyse the entropy of mobile sensor data across four diverse datasets spanning multiple application contexts. Our findings reveal pervasive biases, with single-sensor mean min-entropy values ranging from 3.408-4.483 bits (S.D.=1.018-1.574) despite Shannon entropy being several multiples higher. We further demonstrate that correlations between sensor modalities reduce the worst-case entropy of using multiple sensors by up to approx. 75% compared to average-case Shannon entropy. This brings joint min-entropy well below 10 bits in many cases and, in the best case, yielding only approx. 24 bits of min-entropy when combining 20 sensor modalities. These results call into question the widely held assumption that adding more sensors inherently yields higher security. We ultimately caution against relying on raw sensor data as a primary source of randomness.
- [672] arXiv:2502.10581 (replaced) [pdf, html, other]
-
Title: Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical PerspectiveSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
As large language models have evolved, it has become crucial to distinguish between process supervision and outcome supervision -- two key reinforcement learning approaches to complex reasoning tasks. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data.
In this paper, we take steps towards resolving this debate. Our main theorem shows that, under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision, up to polynomial factors in horizon. At the core of this result lies the novel Change of Trajectory Measure Lemma -- a technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy's advantage function can serve as an optimal process reward model, providing a direct connection between outcome and process supervision. These findings suggest that the empirically observed performance gap -- if any -- between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data collection and algorithm design for reinforcement learning. - [673] arXiv:2502.11316 (replaced) [pdf, html, other]
-
Title: Standalone FPGA-Based QAOA Emulator for Weighted-MaxCut on Embedded DevicesComments: 9 pages, 6 figures, 3 tablesSubjects: Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
Quantum computing QC emulation is crucial for advancing QC applications, especially given the scalability constraints of current devices. FPGA-based designs offer an efficient and scalable alternative to traditional large-scale platforms, but most are tightly integrated with high-performance systems, limiting their use in mobile and edge environments. This study introduces a compact, standalone FPGA-based QC emulator designed for embedded systems, leveraging the Quantum Approximate Optimization Algorithm (QAOA) to solve the Weighted-MaxCut problem. By restructuring QAOA operations for hardware compatibility, the proposed design reduces time complexity from O(N^2) to O(N), where N equals 2^n for n qubits. This reduction, coupled with a pipeline architecture, significantly minimizes resource consumption, enabling support for up to nine qubits on mid-tier FPGAs, roughly three times more than comparable designs. Additionally, the emulator achieved energy savings ranging from 1.53 times for two-qubit configurations to up to 852 times for nine-qubit configurations, compared to software-based QAOA on embedded processors. These results highlight the practical scalability and resource efficiency of the proposed design, providing a robust foundation for QC emulation in resource-constrained edge devices.
- [674] arXiv:2502.11748 (replaced) [pdf, other]
-
Title: ILIAS: Instance-Level Image retrieval At ScaleGiorgos Kordopatis-Zilos, Vladan Stojnić, Anna Manko, Pavel Šuma, Nikolaos-Antonios Ypsilantis, Nikos Efthymiadis, Zakaria Laskar, Jiří Matas, Ondřej Chum, Giorgos ToliasComments: CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
This work introduces ILIAS, a new test dataset for Instance-Level Image retrieval At Scale. It is designed to evaluate the ability of current and future foundation models and retrieval techniques to recognize particular objects. The key benefits over existing datasets include large scale, domain diversity, accurate ground truth, and a performance that is far from saturated. ILIAS includes query and positive images for 1,000 object instances, manually collected to capture challenging conditions and diverse domains. Large-scale retrieval is conducted against 100 million distractor images from YFCC100M. To avoid false negatives without extra annotation effort, we include only query objects confirmed to have emerged after 2014, i.e. the compilation date of YFCC100M. An extensive benchmarking is performed with the following observations: i) models fine-tuned on specific domains, such as landmarks or products, excel in that domain but fail on ILIAS ii) learning a linear adaptation layer using multi-domain class supervision results in performance improvements, especially for vision-language models iii) local descriptors in retrieval re-ranking are still a key ingredient, especially in the presence of severe background clutter iv) the text-to-image performance of the vision-language foundation models is surprisingly close to the corresponding image-to-image case. website: this https URL
- [675] arXiv:2502.12767 (replaced) [pdf, html, other]
-
Title: R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge GraphsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks are often rigid, struggling to adapt to KG or task changes. They also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning. To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across multiple KG-based reasoning tasks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability while reducing inference cost. However, it also leads to a higher abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning. It reduces reliance on high-capacity LLMs while ensuring trustworthy inference. The code is available at this https URL.
- [676] arXiv:2502.12902 (replaced) [pdf, html, other]
-
Title: Probabilistic neural operators for functional uncertainty quantificationSubjects: Machine Learning (cs.LG)
Neural operators aim to approximate the solution operator of a system of differential equations purely from data. They have shown immense success in modeling complex dynamical systems across various domains. However, the occurrence of uncertainties inherent in both model and data has so far rarely been taken into account\textemdash{}a critical limitation in complex, chaotic systems such as weather forecasting. In this paper, we introduce the probabilistic neural operator (PNO), a framework for learning probability distributions over the output function space of neural operators. PNO extends neural operators with generative modeling based on strictly proper scoring rules, integrating uncertainty information directly into the training process. We provide a theoretical justification for the approach and demonstrate improved performance in quantifying uncertainty across different domains and with respect to different baselines. Furthermore, PNO requires minimal adjustment to existing architectures, shows improved performance for most probabilistic prediction tasks, and leads to well-calibrated predictive distributions and adequate uncertainty representations even for long dynamical trajectories. Implementing our approach into large-scale models for physical applications can lead to improvements in corresponding uncertainty quantification and extreme event identification, ultimately leading to a deeper understanding of the prediction of such surrogate models.
- [677] arXiv:2502.12920 (replaced) [pdf, html, other]
-
Title: Lightweight Online Adaption for Time Series Foundation Model ForecastsComments: 8 pages, PreprintSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Foundation models (FMs) have emerged as a promising approach for time series forecasting. While effective, FMs typically remain fixed during deployment due to the high computational costs of learning them online. Consequently, deployed FMs fail to adapt their forecasts to current data characteristics, despite the availability of online feedback from newly arriving data. This raises the question of whether FM performance can be enhanced by the efficient usage of this feedback. We propose AdapTS to answer this question.
AdapTS is a lightweight mechanism for the online adaption of FM forecasts in response to online feedback. AdapTS consists of two parts: a) the AdapTS-Forecaster which is used to learn the current data distribution; and b) the AdapTS-Weighter which is used to combine the forecasts of the FM and the AdapTS-Forecaster. We evaluate the performance of AdapTS in conjunction with several recent FMs across a suite of standard time series datasets. In all of our experiments we find that using AdapTS improves performance. This work demonstrates how efficient usage of online feedback can be used to improve FM forecasts. - [678] arXiv:2502.13012 (replaced) [pdf, html, other]
-
Title: Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing AgentsChaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Yanfang Ye, Toby Jia-Jun Li, Dakuo WangSubjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that simulates human-like behaviors in a variety of tasks. However, evaluating RPAs is challenging due to diverse task requirements and agent designs. This paper proposes an evidence-based, actionable, and generalizable evaluation design guideline for LLM-based RPA by systematically reviewing 1,676 papers published between Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes, seven task attributes, and seven evaluation metrics from existing literature. Based on these findings, we present an RPA evaluation design guideline to help researchers develop more systematic and consistent evaluation methods.
- [679] arXiv:2502.13731 (replaced) [pdf, other]
-
Title: Robust Counterfactual Inference in Markov Decision ProcessesComments: Fixed typo in Equation (5)Subjects: Artificial Intelligence (cs.AI)
This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.
- [680] arXiv:2502.15682 (replaced) [pdf, html, other]
-
Title: ELIP: Enhanced Visual-Language Foundation Models for Image RetrievalSubjects: Computer Vision and Pattern Recognition (cs.CV)
The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval performance and outperforms BLIP-2 on several benchmarks, as well as providing an easy means to adapt to OOD datasets.
- [681] arXiv:2502.16398 (replaced) [pdf, html, other]
-
Title: Computing the Polytope Diameter is Even Harder than NP-hard (Already for Perfect Matchings)Subjects: Computational Complexity (cs.CC); Computational Geometry (cs.CG)
The diameter of a polytope is a fundamental geometric parameter that plays a crucial role in understanding the efficiency of the simplex method. Despite its central nature, the computational complexity of computing the diameter of a given polytope is poorly understood. Already in 1994, Frieze and Teng [Comp. Compl.] recognized the possibility that this task could potentially be harder than NP-hard, and asked whether the corresponding decision problem is complete for the second stage of the polynomial hierarchy, i.e. $\Pi^p_2$-complete. In the following years, partial results could be obtained. In a cornerstone result, Frieze and Teng themselves proved weak NP-hardness for a family of custom defined polytopes. Sanità [FOCS18] in a break-through result proved that already for the much simpler fractional matching polytope the problem is strongly NP-hard. Very recently, Steiner and Nöbel [SODA25] generalized this result to the even simpler bipartite perfect matching polytope and the circuit diameter. In this paper, we finally show that computing the diameter of the bipartite perfect matching polytope is $\Pi^p_2$-hard. Since the corresponding decision problem is also trivially contained in $\Pi^p_2$, this decidedly answers Frieze and Teng's 30 year old question. Our results also hold when the diameter is replaced by the circuit diameter. As our second main result, we prove that for some $\varepsilon > 0$ the (circuit) diameter of the bipartite perfect matching polytope cannot be approximated by a factor better than $(1 + \varepsilon)$. This answers a recent question by Nöbel and Steiner. It is the first known inapproximability result for the circuit diameter, and extends Sanità's inapproximability result of the diameter to the totally unimodular case.
- [682] arXiv:2502.18410 (replaced) [pdf, html, other]
-
Title: TSKANMixer: Kolmogorov-Arnold Networks with MLP-Mixer Model for Time Series ForecastingComments: 8 pages, 4 figures, 7 tables and accepted at the AI4TS: AI for Time Series Analysis workshop, AAAI 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Time series forecasting has long been a focus of research across diverse fields, including economics, energy, healthcare, and traffic management. Recent works have introduced innovative architectures for time series models, such as the Time-Series Mixer (TSMixer), which leverages multi-layer perceptrons (MLPs) to enhance prediction accuracy by effectively capturing both spatial and temporal dependencies within the data. In this paper, we investigate the capabilities of the Kolmogorov-Arnold Networks (KANs) for time-series forecasting by modifying TSMixer with a KAN layer (TSKANMixer). Experimental results demonstrate that TSKANMixer tends to improve prediction accuracy over the original TSMixer across multiple datasets, ranking among the top-performing models compared to other time series approaches. Our results show that the KANs are promising alternatives to improve the performance of time series forecasting by replacing or extending traditional MLPs.
- [683] arXiv:2502.19680 (replaced) [pdf, html, other]
-
Title: M-LLM Based Video Frame Selection for Efficient Video UnderstandingKai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, Trishul ChilimbiSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-LLM may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight M-LLM -based frame selection method that adaptively select frames that are more relevant to users' queries. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a M-LLM; (ii) Temporal signal, in which multiple frames selection by prompting Large Language Model (LLM) using the captions of all frame candidates. The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering. Empirical results show that the proposed M-LLM video frame selector improves the performances various downstream video Large Language Model (video-LLM) across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.
- [684] arXiv:2503.00379 (replaced) [pdf, html, other]
-
Title: Improving clustering quality evaluation in noisy Gaussian mixturesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Clustering is a well-established technique in machine learning and data analysis, widely used across various domains. Cluster validity indices, such as the Average Silhouette Width, Calinski-Harabasz, and Davies-Bouldin indices, play a crucial role in assessing clustering quality when external ground truth labels are unavailable. However, these measures can be affected by the feature relevance issue, potentially leading to unreliable evaluations in high-dimensional or noisy data sets.
We introduce a theoretically grounded Feature Importance Rescaling (FIR) method that enhances the quality of clustering validation by adjusting feature contributions based on their dispersion. It attenuates noise features, clarifies clustering compactness and separation, and thereby aligns clustering validation more closely with the ground truth. Through extensive experiments on synthetic data sets under different configurations, we demonstrate that FIR consistently improves the correlation between the values of cluster validity indices and the ground truth, particularly in settings with noisy or irrelevant features.
The results show that FIR increases the robustness of clustering evaluation, reduces variability in performance across different data sets, and remains effective even when clusters exhibit significant overlap. These findings highlight the potential of FIR as a valuable enhancement of clustering validation, making it a practical tool for unsupervised learning tasks where labelled data is unavailable. - [685] arXiv:2503.00590 (replaced) [pdf, html, other]
-
Title: Characterizing LLM-Empowered Personalized Story-Reading and Interaction for Children: Insights from Multi-Stakeholder PerspectivesJiaju Chen, Minglong Tang, Yuxuan Lu, Bingsheng Yao, Elissa Fan, Xiaojuan Ma, Ying Xu, Dakuo Wang, Yuling Sun, Liang HeComments: Accepted at CHI 2025Subjects: Human-Computer Interaction (cs.HC)
Personalized interaction is highly valued by parents in their story-reading activities with children. While AI-empowered story-reading tools have been increasingly used, their abilities to support personalized interaction with children are still limited. Recent advances in large language models (LLMs) show promise in facilitating personalized interactions, but little is known about how to effectively and appropriately use LLMs to enhance children's personalized story-reading experiences. This work explores this question through a design-based study. Drawing on a formative study, we designed and developed StoryMate, an LLM-empowered personalized interactive story-reading tool for children, following an empirical study with children, parents, and education experts. Our participants valued the personalized features in StoryMate, and also highlighted the need to support personalized content, guiding mechanisms, reading context variations, and interactive interfaces. Based on these findings, we propose a series of design recommendations for better using LLMs to empower children's personalized story reading and interaction.
- [686] arXiv:2503.01107 (replaced) [pdf, html, other]
-
Title: VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative PriorsComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Generative methods for image and video editing use generative models as priors to perform edits despite incomplete information, such as changing the composition of 3D objects shown in a single image. Recent methods have shown promising composition editing results in the image setting, but in the video setting, editing methods have focused on editing object's appearance and motion, or camera motion, and as a result, methods to edit object composition in videos are still missing. We propose \name as a method for editing 3D object compositions in videos of static scenes with camera motion. Our approach allows editing the 3D position of a 3D object across all frames of a video in a temporally consistent manner. This is achieved by lifting intermediate features of a generative model to a 3D reconstruction that is shared between all frames, editing the reconstruction, and projecting the features on the edited reconstruction back to each frame. To the best of our knowledge, this is the first generative approach to edit object compositions in videos. Our approach is simple and training-free, while outperforming state-of-the-art image editing baselines.
- [687] arXiv:2503.01263 (replaced) [pdf, html, other]
-
Title: Generalizable Prompt Learning of CLIP: A Brief OverviewSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Existing vision-language models (VLMs) such as CLIP have showcased an impressive capability to generalize well across various downstream tasks. These models leverage the synergy between visual and textual information, enabling them to understand and reason about the content present in images and text in a unified manner. This article provides a brief overview of CLIP based on few-shot prompt learning, including experimental data and technical characteristics of some methods. The purpose of this review is to provide a reference for researchers who have just started their research in generalizable prompting of CLIP through few-shot training for classification across 15 datasets and also to facilitate the integration of this field by researchers in other downstream tasks.
- [688] arXiv:2503.01877 (replaced) [pdf, html, other]
-
Title: Starjob: Dataset for LLM-Driven Job Shop SchedulingComments: arXiv admin note: substantial text overlap with arXiv:2408.06993Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have shown remarkable capabilities across various domains, but their potential for solving combinatorial optimization problems remains largely unexplored. In this paper, we investigate the applicability of LLMs to the Job Shop Scheduling Problem (JSSP), a classic challenge in combinatorial optimization that requires efficient job allocation to machines to minimize makespan. To this end, we introduce Starjob, the first supervised dataset for JSSP, comprising 130k instances specifically designed for training LLMs. Leveraging this dataset, we fine-tune the LLaMA 8B 4-bit quantized model with the LoRA method to develop an end-to-end scheduling approach. Our evaluation on standard benchmarks demonstrates that the proposed LLM-based method not only surpasses traditional Priority Dispatching Rules (PDRs) but also achieves notable improvements over state-of-the-art neural approaches like L2D, with an average improvement of 15.36% on DMU and 7.85% on Taillard benchmarks. These results highlight the untapped potential of LLMs in tackling combinatorial optimization problems, paving the way for future advancements in this area.
- [689] arXiv:2503.02841 (replaced) [pdf, html, other]
-
Title: Boltzmann Attention Sampling for Image Analysis with Small ObjectsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Detecting and segmenting small objects, such as lung nodules and tumor lesions, remains a critical challenge in image analysis. These objects often occupy less than 0.1% of an image, making traditional transformer architectures inefficient and prone to performance degradation due to redundant attention computations on irrelevant regions. Existing sparse attention mechanisms rely on rigid hierarchical structures, which are poorly suited for detecting small, variable, and uncertain object locations. In this paper, we propose BoltzFormer, a novel transformer-based architecture designed to address these challenges through dynamic sparse attention. BoltzFormer identifies and focuses attention on relevant areas by modeling uncertainty using a Boltzmann distribution with an annealing schedule. Initially, a higher temperature allows broader area sampling in early layers, when object location uncertainty is greatest. As the temperature decreases in later layers, attention becomes more focused, enhancing efficiency and accuracy. BoltzFormer seamlessly integrates into existing transformer architectures via a modular Boltzmann attention sampling mechanism. Comprehensive evaluations on benchmark datasets demonstrate that BoltzFormer significantly improves segmentation performance for small objects while reducing attention computation by an order of magnitude compared to previous state-of-the-art methods.
- [690] arXiv:2503.03384 (replaced) [pdf, html, other]
-
Title: GNNMerge: Merging of GNN Models Without Accessing Training DataSubjects: Machine Learning (cs.LG)
Model merging has gained prominence in machine learning as a method to integrate multiple trained models into a single model without accessing the original training data. While existing approaches have demonstrated success in domains such as computer vision and NLP, their application to Graph Neural Networks (GNNs) remains unexplored. These methods often rely on the assumption of shared initialization, which is seldom applicable to GNNs. In this work, we undertake the first benchmarking study of model merging algorithms for GNNs, revealing their limited effectiveness in this context. To address these challenges, we propose GNNMerge, which utilizes a task-agnostic node embedding alignment strategy to merge GNNs. Furthermore, we establish that under a mild relaxation, the proposed optimization objective admits direct analytical solutions for widely used GNN architectures, significantly enhancing its computational efficiency. Empirical evaluations across diverse datasets, tasks, and architectures establish GNNMerge to be up to 24% more accurate than existing methods while delivering over 2 orders of magnitude speed-up compared to training from scratch.
- [691] arXiv:2503.03708 (replaced) [pdf, other]
-
Title: Rethinking Video Tokenization: A Conditioned Diffusion-based ApproachNianzu Yang, Pandeng Li, Liming Zhao, Yang Li, Chen-Wei Xie, Yehui Tang, Xudong Lu, Zhihang Liu, Yun Zheng, Yu Liu, Junchi YanSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Existing video tokenizers typically use the traditional Variational Autoencoder (VAE) architecture for video compression and reconstruction. However, to achieve good performance, its training process often relies on complex multi-stage training tricks that go beyond basic reconstruction loss and KL regularization. Among these tricks, the most challenging is the precise tuning of adversarial training with additional Generative Adversarial Networks (GANs) in the final stage, which can hinder stable convergence. In contrast to GANs, diffusion models offer more stable training processes and can generate higher-quality results. Inspired by these advantages, we propose CDT, a novel Conditioned Diffusion-based video Tokenizer, that replaces the GAN-based decoder with a conditional causal diffusion model. The encoder compresses spatio-temporal information into compact latents, while the decoder reconstructs videos through a reverse diffusion process conditioned on these latents. During inference, we incorporate a feature cache mechanism to generate videos of arbitrary length while maintaining temporal continuity and adopt sampling acceleration technique to enhance efficiency. Trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch, extensive experiments demonstrate that CDT achieves state-of-the-art performance in video reconstruction tasks with just a single-step sampling. Even a scaled-down version of CDT (3$\times$ inference speedup) still performs comparably with top baselines. Moreover, the latent video generation model trained with CDT also exhibits superior performance. The source code and pretrained weights are available at this https URL.
- [692] arXiv:2503.05500 (replaced) [pdf, other]
-
Title: EuroBERT: Scaling Multilingual Encoders for European LanguagesNicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, Pierre ColomboComments: 28 pages, 8 figures, 13 tablesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.
- [693] arXiv:2503.05704 (replaced) [pdf, html, other]
-
Title: Evaluating Prediction-based Interventions with Human Decision Makers In MindComments: To be presented at AISTATS 2025Subjects: Computers and Society (cs.CY)
Automated decision systems (ADS) are broadly deployed to inform and support human decision-making across a wide range of consequential settings. However, various context-specific details complicate the goal of establishing meaningful experimental evaluations for prediction-based interventions. Notably, current experiment designs rely on simplifying assumptions about human decision making in order to derive causal estimates. In reality, specific experimental design decisions may induce cognitive biases in human decision makers, which could then significantly alter the observed effect sizes of the prediction intervention. In this paper, we formalize and investigate various models of human decision-making in the presence of a predictive model aid. We show that each of these behavioural models produces dependencies across decision subjects and results in the violation of existing assumptions, with consequences for treatment effect estimation. This work aims to further advance the scientific validity of intervention-based evaluation schemes for the assessment of ADS deployments.
- [694] arXiv:2503.06635 (replaced) [pdf, html, other]
-
Title: Deep Cut-informed Graph Embedding and ClusteringZhiyuan Ning, Zaitian Wang, Ran Zhang, Ping Xu, Kunpeng Liu, Pengyang Wang, Wei Ju, Pengfei Wang, Yuanchun Zhou, Erik Cambria, Chong ChenSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Graph clustering aims to divide the graph into different clusters. The recently emerging deep graph clustering approaches are largely built on graph neural networks (GNN). However, GNN is designed for general graph encoding and there is a common issue of representation collapse in existing GNN-based deep graph clustering algorithms. We attribute two main reasons for such issues: (i) the inductive bias of GNN models: GNNs tend to generate similar representations for proximal nodes. Since graphs often contain a non-negligible amount of inter-cluster links, the bias results in error message passing and leads to biased clustering; (ii) the clustering guided loss function: most traditional approaches strive to make all samples closer to pre-learned cluster centers, which causes a degenerate solution assigning all data points to a single label thus make all samples and less discriminative. To address these challenges, we investigate graph clustering from a graph cut perspective and propose an innovative and non-GNN-based Deep Cut-informed Graph embedding and Clustering framework, namely DCGC. This framework includes two modules: (i) cut-informed graph encoding; (ii) self-supervised graph clustering via optimal transport. For the encoding module, we derive a cut-informed graph embedding objective to fuse graph structure and attributes by minimizing their joint normalized cut. For the clustering module, we utilize the optimal transport theory to obtain the clustering assignments, which can balance the guidance of "proximity to the pre-learned cluster center". With the above two tailored designs, DCGC is more suitable for the graph clustering task, which can effectively alleviate the problem of representation collapse and achieve better performance. We conduct extensive experiments to demonstrate that our method is simple but effective compared with benchmarks.
- [695] arXiv:2503.07091 (replaced) [pdf, html, other]
-
Title: FaceID-6M: A Large-Scale, Open-Source FaceID Customization DatasetShuhe Wang, Xiaoya Li, Jiwei Li, Guoyin Wang, Xiaofei Sun, Bob Zhu, Han Qiu, Mo Yu, Shengjie Shen, Tianwei Zhang, Eduard HovyComments: arXiv admin note: text overlap with arXiv:2501.15407Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Due to the data-driven nature of current face identity (FaceID) customization methods, all state-of-the-art models rely on large-scale datasets containing millions of high-quality text-image pairs for training. However, none of these datasets are publicly available, which restricts transparency and hinders further advancements in the field.
To address this issue, in this paper, we collect and release FaceID-6M, the first large-scale, open-source FaceID dataset containing 6 million high-quality text-image pairs. Filtered from LAION-5B \cite{schuhmann2022laion}, FaceID-6M undergoes a rigorous image and text filtering steps to ensure dataset quality, including resolution filtering to maintain high-quality images and faces, face filtering to remove images that lack human faces, and keyword-based strategy to retain descriptions containing human-related terms (e.g., nationality, professions and names). Through these cleaning processes, FaceID-6M provides a high-quality dataset optimized for training powerful FaceID customization models, facilitating advancements in the field by offering an open resource for research and development.
We conduct extensive experiments to show the effectiveness of our FaceID-6M, demonstrating that models trained on our FaceID-6M dataset achieve performance that is comparable to, and slightly better than currently available industrial models. Additionally, to support and advance research in the FaceID customization community, we make our code, datasets, and models fully publicly available. Our codes, models, and datasets are available at: this https URL. - [696] arXiv:2503.07101 (replaced) [pdf, html, other]
-
Title: SimROD: A Simple Baseline for Raw Object Detection with Global and Local EnhancementsComments: Code is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Most visual models are designed for sRGB images, yet RAW data offers significant advantages for object detection by preserving sensor information before ISP processing. This enables improved detection accuracy and more efficient hardware designs by bypassing the ISP. However, RAW object detection is challenging due to limited training data, unbalanced pixel distributions, and sensor noise. To address this, we propose SimROD, a lightweight and effective approach for RAW object detection. We introduce a Global Gamma Enhancement (GGE) module, which applies a learnable global gamma transformation with only four parameters, improving feature representation while keeping the model efficient. Additionally, we leverage the green channel's richer signal to enhance local details, aligning with the human eye's sensitivity and Bayer filter design. Extensive experiments on multiple RAW object detection datasets and detectors demonstrate that SimROD outperforms state-of-the-art methods like RAW-Adapter and DIAP while maintaining efficiency. Our work highlights the potential of RAW data for real-world object detection. Code is available at this https URL.
- [697] arXiv:2503.10095 (replaced) [pdf, html, other]
-
Title: Cognitive-Mental-LLM: Evaluating Reasoning in Large Language Models for Mental Health Prediction via Online TextComments: 8 pages, 4 Figures, 3 tablesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have demonstrated potential in predicting mental health outcomes from online text, yet traditional classification methods often lack interpretability and robustness. This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases. Compared to baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable gains on datasets like Dreaddit (+0.52\% over M-LLM, +0.82\% over BERT) and SDCNL (+4.67\% over M-LLM, +2.17\% over BERT). However, performance declines in Depression Severity, and CSSRS predictions suggest dataset-specific limitations, likely due to our using a more extensive test set. Among prompting strategies, Few-shot CoT consistently outperforms others, reinforcing the effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability highlights challenges in model reliability and interpretability. This study provides a comprehensive benchmark of reasoning-based LLM techniques for mental health text classification. It offers insights into their potential for scalable clinical applications while identifying key challenges for future improvements.
- [698] arXiv:2503.10212 (replaced) [pdf, html, other]
-
Title: MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior AnalysisTeng Xu, Taotao Zhou, Youjia Wang, Peng Yang, Simin Tang, Kuixiang Shao, Zifeng Tang, Yifei Liu, Xinyuan Chen, Hongshuang Wang, Xiaohui Wang, Huoqing Luo, Jingya Wang, Ji Hu, Jingyi YuComments: 53 pages, 5 figures, 7 extended figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Analyzing animal behavior is crucial in advancing neuroscience, yet quantifying and deciphering its intricate dynamics remains a significant challenge. Traditional machine vision approaches, despite their ability to detect spontaneous behaviors, fall short due to limited interpretability and reliance on manual labeling, which restricts the exploration of the full behavioral spectrum. Here, we introduce MouseGPT, a Vision-Language Model (VLM) that integrates visual cues with natural language to revolutionize mouse behavior analysis. Built upon our first-of-its-kind dataset - incorporating pose dynamics and open-vocabulary behavioral annotations across over 42 million frames of diverse psychiatric conditions - MouseGPT provides a novel, context-rich method for comprehensive behavior interpretation. Our holistic analysis framework enables detailed behavior profiling, clustering, and novel behavior discovery, offering deep insights without the need for labor - intensive manual annotation. Evaluations reveal that MouseGPT surpasses existing models in precision, adaptability, and descriptive richness, positioning it as a transformative tool for ethology and for unraveling complex behavioral dynamics in animal models.
- [699] arXiv:2503.11108 (replaced) [pdf, html, other]
-
Title: Time and Memory Trade-off of KV-Cache Compression in Tensor Transformer DecodingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
The key-value (KV) cache in the tensor version of transformers presents a significant bottleneck during inference. While previous work analyzes the fundamental space complexity barriers in standard attention mechanisms [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = \Omega(\log n)$. Furthermore, we introduce two types of tensor attention cache and present a trade-off between time and memory for two scenarios. Overall, our work provides a theoretical foundation for us to understand the time-memory tradeoff of KV-Cache compression in tensor attention decoding and offers more perspectives in developing more memory-efficient tensor attention Transformer architectures.
- [700] arXiv:2503.11240 (replaced) [pdf, html, other]
-
Title: Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse RewardsZijing Hu, Fengda Zhang, Long Chen, Kun Kuang, Jiahui Li, Kaifeng Gao, Jun Xiao, Xin Wang, Wenwu ZhuComments: Accepted to CVPR 2025, add referencesSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named $\text{B}^2\text{-DiffuRL}$, employs two strategies: \textbf{B}ackward progressive training and \textbf{B}ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty from sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. $\text{B}^2\text{-DiffuRL}$ is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of $\text{B}^2\text{-DiffuRL}$ in improving prompt-image alignment and maintaining diversity in generated images. The code for this work is available.
- [701] arXiv:2503.12051 (replaced) [pdf, html, other]
-
Title: TLUE: A Tibetan Language Understanding Evaluation BenchmarkFan Gao, Cheng Huang, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Yongbin YuComments: 6 figures, 21 pagesSubjects: Computation and Language (cs.CL)
Large language models (LLMs) have made tremendous progress in recent years, but low-resource languages, such as Tibetan, remain significantly underrepresented in their evaluation. Despite Tibetan being spoken by over seven million people, it has largely been neglected in the development and assessment of LLMs. To address this gap, we present TLUE (A Tibetan Language Understanding Evaluation Benchmark), the first large-scale benchmark for assessing LLMs' capabilities in Tibetan. TLUE comprises two major components: (1) a comprehensive multi-task understanding benchmark spanning 5 domains and 67 subdomains, and (2) a safety benchmark covering 7 subdomains. We evaluate a diverse set of state-of-the-art LLMs. Experimental results demonstrate that most LLMs perform below the random baseline, highlighting the considerable challenges LLMs face in processing Tibetan, a low-resource language. TLUE provides an essential foundation for driving future research and progress in Tibetan language understanding and underscores the need for greater inclusivity in LLM development.
- [702] arXiv:2503.12101 (replaced) [pdf, html, other]
-
Title: MUSE: A Real-Time Multi-Sensor State Estimator for Quadruped RobotsComments: Accepted for publication in IEEE Robotics and Automation LettersSubjects: Robotics (cs.RO); Signal Processing (eess.SP)
This paper introduces an innovative state estimator, MUSE (MUlti-sensor State Estimator), designed to enhance state estimation's accuracy and real-time performance in quadruped robot navigation. The proposed state estimator builds upon our previous work presented in [1]. It integrates data from a range of onboard sensors, including IMUs, encoders, cameras, and LiDARs, to deliver a comprehensive and reliable estimation of the robot's pose and motion, even in slippery scenarios. We tested MUSE on a Unitree Aliengo robot, successfully closing the locomotion control loop in difficult scenarios, including slippery and uneven terrain. Benchmarking against Pronto [2] and VILENS [3] showed 67.6% and 26.7% reductions in translational errors, respectively. Additionally, MUSE outperformed DLIO [4], a LiDAR-inertial odometry system in rotational errors and frequency, while the proprioceptive version of MUSE (P-MUSE) outperformed TSIF [5], with a 45.9% reduction in absolute trajectory error (ATE).
- [703] arXiv:2503.13366 (replaced) [pdf, html, other]
-
Title: Follow-the-Regularized-Leader with Adversarial ConstraintsSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
Constrained Online Convex Optimization (COCO) can be seen as a generalization of the standard Online Convex Optimization (OCO) framework. At each round, a cost function and constraint function are revealed after a learner chooses an action. The goal is to minimize both the regret and cumulative constraint violation (CCV) against an adaptive adversary. We show for the first time that is possible to obtain the optimal $O(\sqrt{T})$ bound on both regret and CCV, improving the best known bounds of $O \left( \sqrt{T} \right)$ and $Õ \left( \sqrt{T} \right)$ for the regret and CCV, respectively.
- [704] arXiv:2503.13985 (replaced) [pdf, html, other]
-
Title: DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual InspectionComments: Accepted to CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Developing effective visual inspection models remains challenging due to the scarcity of defect data. While image generation models have been used to synthesize defect images, producing highly realistic defects remains difficult. We propose DefectFill, a novel method for realistic defect generation that requires only a few reference defect images. It leverages a fine-tuned inpainting diffusion model, optimized with our custom loss functions incorporating defect, object, and attention terms. It enables precise capture of detailed, localized defect features and their seamless integration into defect-free objects. Additionally, our Low-Fidelity Selection method further enhances the defect sample quality. Experiments show that DefectFill generates high-quality defect images, enabling visual inspection models to achieve state-of-the-art performance on the MVTec AD dataset.
- [705] arXiv:2503.14001 (replaced) [pdf, html, other]
-
Title: Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and WeightYi Xiao, Qiannan Han, Gang Shu, Guiping Liang, Hongyan Zhang, Song Wang, Zhihao Xu, Weican Wan, Chuang Li, Guitao Jiang, Wenbo XiaoSubjects: Computer Vision and Pattern Recognition (cs.CV)
Accurate body dimension and weight measurements are critical for optimizing poultry management, health assessment, and economic efficiency. This study introduces an innovative deep learning-based model leveraging multimodal data-2D RGB images from different views, depth images, and 3D point clouds-for the non-invasive estimation of duck body dimensions and weight. A dataset of 1,023 Linwu ducks, comprising over 5,000 samples with diverse postures and conditions, was collected to support model training. The proposed method innovatively employs PointNet++ to extract key feature points from point clouds, extracts and computes corresponding 3D geometric features, and fuses them with multi-view convolutional 2D features. A Transformer encoder is then utilized to capture long-range dependencies and refine feature interactions, thereby enhancing prediction robustness. The model achieved a mean absolute percentage error (MAPE) of 6.33% and an R2 of 0.953 across eight morphometric parameters, demonstrating strong predictive capability. Unlike conventional manual measurements, the proposed model enables high-precision estimation while eliminating the necessity for physical handling, thereby reducing animal stress and broadening its application scope. This study marks the first application of deep learning techniques to poultry body dimension and weight estimation, providing a valuable reference for the intelligent and precise management of the livestock industry with far-reaching practical significance.
- [706] arXiv:2503.14222 (replaced) [pdf, html, other]
-
Title: Stacked-Residual PINN for State Reconstruction of Hyperbolic SystemsSubjects: Systems and Control (eess.SY)
In a more connected world, modeling multi-agent systems with hyperbolic partial differential equations (PDEs) offers a potential solution to the curse of dimensionality. However, classical control tools need adaptation for these complex systems. Physics-informed neural networks (PINNs) provide a powerful framework to fix this issue by inferring solutions to PDEs by embedding governing equations into the neural network. A major limitation of original PINNs is their inability to capture steep gradients and discontinuities in hyperbolic PDEs. This paper proposes a stacked residual PINN method enhanced with a vanishing viscosity mechanism. Initially, a basic PINN with a small viscosity coefficient provides a stable, low-fidelity solution. Residual correction blocks with learnable scaling parameters then iteratively refine this solution, progressively decreasing the viscosity coefficient to transition from parabolic to hyperbolic PDEs. Applying this method to traffic state reconstruction improved results by an order of magnitude in relative $\mathcal{L}^2$ error, demonstrating its potential to accurately estimate solutions where original PINNs struggle with instability and low fidelity.
- [707] arXiv:2503.14340 (replaced) [pdf, html, other]
-
Title: MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM CollaborationComments: 10 pagesSubjects: Software Engineering (cs.SE)
Maintaining and scaling software systems relies heavily on effective code refactoring, yet this process remains labor-intensive, requiring developers to carefully analyze existing codebases and prevent the introduction of new defects. Although recent advancements have leveraged Large Language Models (LLMs) to automate refactoring tasks, current solutions are constrained in scope and lack mechanisms to guarantee code compilability and successful test execution. In this work, we introduce MANTRA, a comprehensive LLM agent-based framework that automates method-level refactoring. MANTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning to emulate human decision-making during refactoring while preserving code correctness and readability. Our empirical study, conducted on 703 instances of "pure refactorings" (i.e., code changes exclusively involving structural improvements), drawn from 10 representative Java projects, covers the six most prevalent refactoring operations. Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model (RawGPT ), achieving an 82.8% success rate (582/703) in producing code that compiles and passes all tests, compared to just 8.7% (61/703) with RawGPT. Moreover, in comparison to IntelliJ's LLM-powered refactoring tool (EM-Assist), MANTRA exhibits a 50% improvement in generating Extract Method transformations. A usability study involving 37 professional developers further shows that refactorings performed by MANTRA are perceived to be as readable and reusable as human-written code, and in certain cases, even more favorable. These results highlight the practical advantages of MANTRA and emphasize the growing potential of LLM-based systems in advancing the automation of software refactoring tasks.
- [708] arXiv:2503.14734 (replaced) [pdf, html, other]
-
Title: GR00T N1: An Open Foundation Model for Generalist Humanoid RobotsNVIDIA: Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guanzhi Wang, Zu Wang, Jing Wang, Qi Wang, Jiannan Xiang, Yuqi Xie, Yinzhen Xu, Zhenjia Xu, Seonghyeon Ye, Zhiding Yu, Ao Zhang, Hao Zhang, Yizhou Zhao, Ruijie Zheng, Yuke ZhuComments: Authors are listed alphabetically. Project leads are Linxi "Jim" Fan and Yuke Zhu. For more information, see this https URLSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.
- [709] arXiv:2503.15454 (replaced) [pdf, html, other]
-
Title: Bias Evaluation and Mitigation in Retrieval-Augmented Medical Question-Answering SystemsSubjects: Computation and Language (cs.CL)
Medical Question Answering systems based on Retrieval Augmented Generation is promising for clinical decision support because they can integrate external knowledge, thus reducing inaccuracies inherent in standalone large language models (LLMs). However, these systems may unintentionally propagate or amplify biases associated with sensitive demographic attributes like race, gender, and socioeconomic factors. This study systematically evaluates demographic biases within medical RAG pipelines across multiple QA benchmarks, including MedQA, MedMCQA, MMLU, and EquityMedQA. We quantify disparities in retrieval consistency and answer correctness by generating and analyzing queries sensitive to demographic variations. We further implement and compare several bias mitigation strategies to address identified biases, including Chain of Thought reasoning, Counterfactual filtering, Adversarial prompt refinement, and Majority Vote aggregation. Experimental results reveal significant demographic disparities, highlighting that Majority Vote aggregation notably improves accuracy and fairness metrics. Our findings underscore the critical need for explicitly fairness-aware retrieval methods and prompt engineering strategies to develop truly equitable medical QA systems.
- [710] arXiv:2503.15469 (replaced) [pdf, other]
-
Title: Dynamic Bi-Elman Attention Networks: A Dual-Directional Context-Aware Test-Time Learning for Text ClassificationComments: 11 pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Text classification, a fundamental task in natural language processing, aims to categorize textual data into predefined labels. Traditional methods struggled with complex linguistic structures and semantic dependencies. However, the advent of deep learning, particularly recurrent neural networks and Transformer-based models, has significantly advanced the field by enabling nuanced feature extraction and context-aware predictions. Despite these improvements, existing models still exhibit limitations in balancing interpretability, computational efficiency, and long-range contextual understanding. To address these challenges, this paper proposes the Dynamic Bidirectional Elman with Attention Network (DBEAN). DBEAN integrates bidirectional temporal modeling with self-attention mechanisms. It dynamically assigns weights to critical segments of input, improving contextual representation while maintaining computational efficiency.
- [711] arXiv:2503.16176 (replaced) [pdf, html, other]
-
Title: Nonnegative Biquadratic TensorsSubjects: Numerical Analysis (math.NA)
An M-eigenvalue of a nonnegative biquadratic tensor is referred to as an M$^+$-eigenvalue if it has a pair of nonnegative M-eigenvectors. If furthermore that pair of M-eigenvectors is positive, then that M$^+$-eigenvalue is called an M$^{++}$-eigenvalue. A nonnegative biquadratic tensor has at least one M$^+$ eigenvalue, and the largest M$^+$-eigenvalue is both the largest M-eigenvalue and the M-spectral radius. For irreducible nonnegative biquadratic tensors, all the M$^+$-eigenvalues are M$^{++}$-eigenvalues. Although the M$^+$-eigenvalues of irreducible nonnegative biquadratic tensors are not unique in general, we establish a sufficient condition to ensure their uniqueness. For an irreducible nonnegative biquadratic tensor, the largest M$^+$-eigenvalue has a max-min characterization, while the smallest M$^+$-eigenvalue has a min-max characterization. A Collatz algorithm for computing the largest M$^+$-eigenvalues is proposed. Numerical results are reported.
- [712] arXiv:2503.16400 (replaced) [pdf, html, other]
-
Title: ScalingNoise: Scaling Inference-Time Search for Generating Infinite VideosHaolin Yang, Feilong Tang, Ming Hu, Yulong Li, Yexin Liu, Zelin Peng, Junjun He, Zongyuan Ge, Imran RazzakSubjects: Machine Learning (cs.LG)
Video diffusion models (VDMs) facilitate the generation of high-quality videos, with current research predominantly concentrated on scaling efforts during training through improvements in data quality, computational resources, and model complexity. However, inference-time scaling has received less attention, with most approaches restricting models to a single generation attempt. Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation. Building on this, we find that guiding the scaling inference-time search of VDMs to identify better noise candidates not only evaluates the quality of the frames generated in the current step but also preserves the high-level object features by referencing the anchor frame from previous multi-chunks, thereby delivering long-term value. Our analysis reveals that diffusion models inherently possess flexible adjustments of computation by varying denoising steps, and even a one-step denoising approach, when guided by a reward signal, yields significant long-term benefits. Based on the observation, we proposeScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process to improve global content consistency and visual diversity. Specifically, we perform one-step denoising to convert initial noises into a clip and subsequently evaluate its long-term value, leveraging a reward model anchored by previously generated content. Moreover, to preserve diversity, we sample candidates from a tilted noise distribution that up-weights promising noises. In this way, ScalingNoise significantly reduces noise-induced errors, ensuring more coherent and spatiotemporally consistent video generation. Extensive experiments on benchmark datasets demonstrate that the proposed ScalingNoise effectively improves long video generation.
- [713] arXiv:2503.16541 (replaced) [pdf, html, other]
-
Title: Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language ModelsSubjects: Computation and Language (cs.CL)
Hallucinations in generative AI, particularly in Large Language Models (LLMs), pose a significant challenge to the reliability of multilingual applications. Existing benchmarks for hallucination detection focus primarily on English and a few widely spoken languages, lacking the breadth to assess inconsistencies in model performance across diverse linguistic contexts. To address this gap, we introduce Poly-FEVER, a large-scale multilingual fact verification benchmark specifically designed for evaluating hallucination detection in LLMs. Poly-FEVER comprises 77,973 labeled factual claims spanning 11 languages, sourced from FEVER, Climate-FEVER, and SciFact. It provides the first large-scale dataset tailored for analyzing hallucination patterns across languages, enabling systematic evaluation of LLMs such as ChatGPT and the LLaMA series. Our analysis reveals how topic distribution and web resource availability influence hallucination frequency, uncovering language-specific biases that impact model accuracy. By offering a multilingual benchmark for fact verification, Poly-FEVER facilitates cross-linguistic comparisons of hallucination detection and contributes to the development of more reliable, language-inclusive AI systems. The dataset is publicly available to advance research in responsible AI, fact-checking methodologies, and multilingual NLP, promoting greater transparency and robustness in LLM performance. The proposed Poly-FEVER is available at: this https URL.
- [714] arXiv:2503.16655 (replaced) [pdf, html, other]
-
Title: Accelerating Antibiotic Discovery with Large Language Models and Knowledge GraphsComments: 11 pages, 9 figures, 3 tables fix: table, typos and error analysisSubjects: Computation and Language (cs.CL)
The discovery of novel antibiotics is critical to address the growing antimicrobial resistance (AMR). However, pharmaceutical industries face high costs (over $1 billion), long timelines, and a high failure rate, worsened by the rediscovery of known compounds. We propose an LLM-based pipeline that acts as an alarm system, detecting prior evidence of antibiotic activity to prevent costly rediscoveries. The system integrates organism and chemical literature into a Knowledge Graph (KG), ensuring taxonomic resolution, synonym handling, and multi-level evidence classification. We tested the pipeline on a private list of 73 potential antibiotic-producing organisms, disclosing 12 negative hits for evaluation. The results highlight the effectiveness of the pipeline for evidence reviewing, reducing false negatives, and accelerating decision-making. The KG for negative hits and the user interface for interactive exploration will be made publicly available.
- [715] arXiv:2503.17038 (replaced) [pdf, other]
-
Title: Arm DynamIQ Shared Unit and Real-Time: An Empirical EvaluationAshutosh Pradhan, Daniele Ottaviano, Yi Jiang, Haozheng Huang, Alexander Zuepke, Andrea Bastoni, Marco CaccamoComments: Accepted for publication in the Proceedings of the 31st IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 2025)Subjects: Performance (cs.PF); Hardware Architecture (cs.AR)
The increasing complexity of embedded hardware platforms poses significant challenges for real-time workloads. Architectural features such as Intel RDT, Arm QoS, and Arm MPAM are either unavailable on commercial embedded platforms or designed primarily for server environments optimized for average-case performance and might fail to deliver the expected real-time guarantees. Arm DynamIQ Shared Unit (DSU) includes isolation features-among others, hardware per-way cache partitioning-that can improve the real-time guarantees of complex embedded multicore systems and facilitate real-time analysis. However, the DSU also targets average cases, and its real-time capabilities have not yet been evaluated. This paper presents the first comprehensive analysis of three real-world deployments of the Arm DSU on Rockchip RK3568, Rockchip RK3588, and NVIDIA Orin platforms. We integrate support for the DSU at the operating system and hypervisor level and conduct a large-scale evaluation using both synthetic and real-world benchmarks with varying types and intensities of interference. Our results make extensive use of performance counters and indicate that, although effective, the quality of partitioning and isolation provided by the DSU depends on the type and the intensity of the interfering workloads. In addition, we uncover and analyze in detail the correlation between benchmarks and different types and intensities of interference.
- [716] arXiv:2503.17125 (replaced) [pdf, other]
-
Title: LaMOuR: Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement LearningComments: This paper is currently under security review and will be re-released once the review is completeSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Deep Reinforcement Learning (DRL) has demonstrated strong performance in robotic control but remains susceptible to out-of-distribution (OOD) states, often resulting in unreliable actions and task failure. While previous methods have focused on minimizing or preventing OOD occurrences, they largely neglect recovery once an agent encounters such states. Although the latest research has attempted to address this by guiding agents back to in-distribution states, their reliance on uncertainty estimation hinders scalability in complex environments. To overcome this limitation, we introduce Language Models for Out-of-Distribution Recovery (LaMOuR), which enables recovery learning without relying on uncertainty estimation. LaMOuR generates dense reward codes that guide the agent back to a state where it can successfully perform its original task, leveraging the capabilities of LVLMs in image description, logical reasoning, and code generation. Experimental results show that LaMOuR substantially enhances recovery efficiency across diverse locomotion tasks and even generalizes effectively to complex environments, including humanoid locomotion and mobile manipulation, where existing methods struggle. The code and supplementary materials are available at this https URL.
- [717] arXiv:2503.17132 (replaced) [pdf, html, other]
-
Title: Temporal-Guided Spiking Neural Networks for Event-Based Human Action RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
This paper explores the promising interplay between spiking neural networks (SNNs) and event-based cameras for privacy-preserving human action recognition (HAR). The unique feature of event cameras in capturing only the outlines of motion, combined with SNNs' proficiency in processing spatiotemporal data through spikes, establishes a highly synergistic compatibility for event-based HAR. Previous studies, however, have been limited by SNNs' ability to process long-term temporal information, essential for precise HAR. In this paper, we introduce two novel frameworks to address this: temporal segment-based SNN (\textit{TS-SNN}) and 3D convolutional SNN (\textit{3D-SNN}). The \textit{TS-SNN} extracts long-term temporal information by dividing actions into shorter segments, while the \textit{3D-SNN} replaces 2D spatial elements with 3D components to facilitate the transmission of temporal information. To promote further research in event-based HAR, we create a dataset, \textit{FallingDetection-CeleX}, collected using the high-resolution CeleX-V event camera $(1280 \times 800)$, comprising 7 distinct actions. Extensive experimental results show that our proposed frameworks surpass state-of-the-art SNN methods on our newly collected dataset and three other neuromorphic datasets, showcasing their effectiveness in handling long-range temporal information for event-based HAR.
- [718] arXiv:2503.17695 (replaced) [pdf, html, other]
-
Title: MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view DiffusionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Generative models have made remarkable advancements and are capable of producing high-quality content. However, performing controllable editing with generative models remains challenging, due to their inherent uncertainty in outputs. This challenge is praticularly pronounced in motion editing, which involves the processing of spatial information. While some physics-based generative methods have attempted to implement motion editing, they typically operate on single-view images with simple motions, such as translation and dragging. These methods struggle to handle complex rotation and stretching motions and ensure multi-view consistency, often necessitating resource-intensive retraining. To address these challenges, we propose MotionDiff, a training-free zero-shot diffusion method that leverages optical flow for complex multi-view motion editing. Specifically, given a static scene, users can interactively select objects of interest to add motion priors. The proposed Point Kinematic Model (PKM) then estimates corresponding multi-view optical flows during the Multi-view Flow Estimation Stage (MFES). Subsequently, these optical flows are utilized to generate multi-view motion results through decoupled motion representation in the Multi-view Motion Diffusion Stage (MMDS). Extensive experiments demonstrate that MotionDiff outperforms other physics-based generative motion editing methods in achieving high-quality multi-view consistent motion results. Notably, MotionDiff does not require retraining, enabling users to conveniently adapt it for various down-stream tasks.
- [719] arXiv:2503.17922 (replaced) [pdf, html, other]
-
Title: WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM InferenceSubjects: Computation and Language (cs.CL)
With the advancements in long-context inference capabilities of large language models (LLMs), the KV cache has become one of the foundational components. However, its substantial GPU memory consumption makes KV cache compression a key technique for enabling efficient LLM inference in industrial scenarios. While recent studies have focused on optimizing the memory occupied by the KV cache, they overlook two critical factors: preserving semantic coherence and considering task-specific characteristic during compression. To address these limitations, we propose a novel task-adaptive KV cache window selection method, WindowKV. WindowKV dynamically selects local semantic windows consisting of consecutive tokens, according to task-specific characteristics, ensuring the retained KV cache captures continuous, essential context. Additionally, we introduce an intra-group layer KV cache indices sharing strategy to reduce computational overhead, achieving a balance between performance and efficiency. We rigorously evaluate WindowKV on the LongBench benchmark, and the results demonstrate that it maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache, significantly reducing memory requirements. Furthermore, our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.
- [720] arXiv:2503.18297 (replaced) [pdf, html, other]
-
Title: Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM ModuleSubjects: Computer Vision and Pattern Recognition (cs.CV)
Medical report generation requires specialized expertise that general large models often fail to accurately capture. Moreover, the inherent repetition and similarity in medical data make it difficult for models to extract meaningful features, resulting in a tendency to overfit. So in this paper, we propose a multimodal model, Co-Attention Triple-LSTM Network (CA-TriNet), a deep learning model that combines transformer architectures with a Multi-LSTM network. Its Co-Attention module synergistically links a vision transformer with a text transformer to better differentiate medical images with similarities, augmented by an adaptive weight operator to catch and amplify image labels with minor similarities. Furthermore, its Triple-LSTM module refines generated sentences using targeted image objects. Extensive evaluations over three public datasets have demonstrated that CA-TriNet outperforms state-of-the-art models in terms of comprehensive ability, even pre-trained large language models on some metrics.
- [721] arXiv:2503.18305 (replaced) [pdf, html, other]
-
Title: Enhancing LLM-based Code Translation in Repository Context via Triple Knowledge-AugmentedGuangsheng Ou, Mingwei Liu, Yuxuan Chen, Xueying Du, Shengbo Wang, Zekai Zhang, Xin Peng, Zibin ZhengSubjects: Software Engineering (cs.SE)
Large language models (LLMs) have behaved well in function-level code translation without repository-level context. However, the performance of LLMs in repository-level context code translation remains suboptimal due to complex dependencies and context, hindering their adoption in industrial settings. In this work, we propose a novel LLM-based code translation technique K-Trans, which leverages triple knowledge augmentation to enhance LLM's translation quality under repository context in real-world software development. First, K-Trans constructs a translation knowledge base by extracting relevant information from target-language codebases, the repository being translated, and prior translation results. Second, for each function to be translated, K-Trans retrieves relevant triple knowledge, including target-language code samples, dependency usage examples, and successful translation function pairs, serving as references to enhance LLM for translation. Third, K-Trans constructs a knowledge-augmented translation prompt using the retrieved triple knowledge and employs LLMs to generate the translated code while preserving repository context. It further leverages LLMs for self-debugging, enhancing translation correctness.
The experiments show that K-Trans substantially outperforms the baseline adapted from previous work by 19.4%/40.2% relative improvement in pass@1 and 0.138 in CodeBLEU. It is important to note that the results also demonstrate that each knowledge significantly contributes to K-Trans's effectiveness in handling repository-level context code translation, with dependency usage examples making the most notable contribution. Moreover, as the self-evolution process progresses, the knowledge base continuously enhances the LLM's performance across various aspects of the repository-level code translation. - [722] arXiv:2503.18684 (replaced) [pdf, html, other]
-
Title: Efficient Continual Adaptation of Pretrained Robotic Policy with Online Meta-Learned AdaptersComments: Project link: this https URLSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Continual adaptation is essential for general autonomous agents. For example, a household robot pretrained with a repertoire of skills must still adapt to unseen tasks specific to each household. Motivated by this, building upon parameter-efficient fine-tuning in language models, prior works have explored lightweight adapters to adapt pretrained policies, which can preserve learned features from the pretraining phase and demonstrate good adaptation performances. However, these approaches treat task learning separately, limiting knowledge transfer between tasks. In this paper, we propose Online Meta-Learned adapters (OMLA). Instead of applying adapters directly, OMLA can facilitate knowledge transfer from previously learned tasks to current learning tasks through a novel meta-learning objective. Extensive experiments in both simulated and real-world environments demonstrate that OMLA can lead to better adaptation performances compared to the baseline methods. The project link: this https URL.
- [723] arXiv:2503.18769 (replaced) [pdf, html, other]
-
Title: AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic ReasoningSubjects: Computation and Language (cs.CL); Robotics (cs.RO)
This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of language models for robotic manipulation in 3D Cartesian space. AlphaSpace employs a hierarchical semantics-based tokenization strategy that encodes spatial information at both coarse and fine-grained levels. Our approach represents objects with their attributes, positions, and height information through structured tokens, enabling precise spatial reasoning without relying on traditional vision-based embeddings. This approach enables LLMs to accurately manipulate objects by positioning them at specific (x, y, z) coordinates. Experimental results suggest that AlphaSpace demonstrates promising potential for improving manipulation tasks, achieving a total accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5 Sonnet. These results demonstrate the potential of structured spatial encoding for manipulation tasks and warrant further exploration.
- [724] arXiv:2503.18869 (replaced) [pdf, html, other]
-
Title: Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller DesignComments: 9 pages, 11 figuresSubjects: Hardware Architecture (cs.AR)
The efficiency of Large Language Model~(LLM) inference is often constrained by substantial memory bandwidth and capacity demands. Existing techniques, such as pruning, quantization, and mixture of experts/depth, reduce memory capacity and/or bandwidth consumption at the cost of slight degradation in inference quality. This paper introduces a design solution that further alleviates memory bottlenecks by enhancing the on-chip memory controller in AI accelerators to achieve two main objectives: (1) significantly reducing memory capacity and bandwidth usage through lossless block compression~(e.g., LZ4 and ZSTD) of model weights and key-value (KV) cache without compromising inference quality, and (2) enabling memory bandwidth and energy consumption to scale proportionally with context-dependent dynamic quantization. These goals are accomplished by equipping the on-chip memory controller with mechanisms to improve fine-grained bit-level accessibility and compressibility of weights and KV cache through LLM-aware configuration of in-memory placement and representation. Experimental results on publicly available LLMs demonstrate the effectiveness of this approach, showing memory footprint reductions of 25.2\% for model weights and 46.9\% for KV cache. In addition, our hardware prototype at 4\,GHz and 32 lanes (7\,nm) achieves 8\,TB/s throughput with a modest area overhead (under 3.8\,mm\(^2\)), which underscores the viability of LLM-aware memory control as a key to efficient large-scale inference.
- [725] arXiv:2503.18940 (replaced) [pdf, html, other]
-
Title: Training-free Diffusion Acceleration with Bottleneck SamplingComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3$\times$ for image generation and 2.5$\times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics.
- [726] arXiv:2503.18943 (replaced) [pdf, html, other]
-
Title: SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video UnderstandingMingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, Afshin DehghanComments: Technical reportSubjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for mobile-friendly models. Experimental results demonstrate that SF-LLaVA-1.5 achieves superior performance on a wide range of video and image tasks, with robust results at all model sizes (ranging from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales across various video benchmarks.
- [727] arXiv:2503.19176 (replaced) [pdf, html, other]
-
Title: SoK: How Robust is Audio Watermarking in Generative AI models?Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Audio watermarking is increasingly used to verify the provenance of AI-generated content, enabling applications such as detecting AI-generated speech, protecting music IP, and defending against voice cloning. To be effective, audio watermarks must resist removal attacks that distort signals to evade detection. While many schemes claim robustness, these claims are typically tested in isolation and against a limited set of attacks. A systematic evaluation against diverse removal attacks is lacking, hindering practical deployment. In this paper, we investigate whether recent watermarking schemes that claim robustness can withstand a broad range of removal attacks. First, we introduce a taxonomy covering 22 audio watermarking schemes. Next, we summarize their underlying technologies and potential vulnerabilities. We then present a large-scale empirical study to assess their robustness. To support this, we build an evaluation framework encompassing 22 types of removal attacks (109 configurations) including signal-level, physical-level, and AI-induced distortions. We reproduce 9 watermarking schemes using open-source code, identify 8 new highly effective attacks, and highlight 11 key findings that expose the fundamental limitations of these methods across 3 public datasets. Our results reveal that none of the surveyed schemes can withstand all tested distortions. This evaluation offers a comprehensive view of how current watermarking methods perform under real-world threats. Our demo and code are available at this https URL.
- [728] arXiv:2503.19180 (replaced) [pdf, html, other]
-
Title: "Test, Build, Deploy" -- A CI/CD Framework for Open-Source Hardware DesignsComments: 6 pages, 3 figures, under submission at ICECET'25Subjects: Hardware Architecture (cs.AR)
Addressing TedX, Amber Huffman made an impassioned case that "none of us is as smart as all of us" and that open-source hardware is the future. A major contribution to software quality, open source and otherwise, on the software side, is the systems design methodology of Continuous Integration and Delivery (CI/CD), which we propose to systematically bring to hardware designs and their specifications. To do so, we automatically generate specifications using specification mining, "a machine learning approach to discovering formal specifications" which dramatically impacted the ability of software engineers to achieve quality, verification, and security. Yet applying the same techniques to hardware is non-trivial. We present a technique for generalized, continuous integration (CI) of hardware specification designs that continually deploys (CD) a hardware specification. As a proof-of-concept, we demonstrate Myrtha, a cloud-based, specification generator based on established hardware and software quality tools.
- [729] arXiv:2503.19285 (replaced) [pdf, html, other]
-
Title: No Black Box Anymore: Demystifying Clinical Predictive Modeling with Temporal-Feature Cross Attention MechanismComments: 10 pages, 3 figures, submitted to AMIA 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Despite the outstanding performance of deep learning models in clinical prediction tasks, explainability remains a significant challenge. Inspired by transformer architectures, we introduce the Temporal-Feature Cross Attention Mechanism (TFCAM), a novel deep learning framework designed to capture dynamic interactions among clinical features across time, enhancing both predictive accuracy and interpretability. In an experiment with 1,422 patients with Chronic Kidney Disease, predicting progression to End-Stage Renal Disease, TFCAM outperformed LSTM and RETAIN baselines, achieving an AUROC of 0.95 and an F1-score of 0.69. Beyond performance gains, TFCAM provides multi-level explainability by identifying critical temporal periods, ranking feature importance, and quantifying how features influence each other across time before affecting predictions. Our approach addresses the "black box" limitations of deep learning in healthcare, offering clinicians transparent insights into disease progression mechanisms while maintaining state-of-the-art predictive performance.
- [730] arXiv:2503.19316 (replaced) [pdf, html, other]
-
Title: A Social Dynamical System for Twitter AnalysisComments: will be submitted to a journal soonSubjects: Social and Information Networks (cs.SI)
Understanding the evolution of public opinion is crucial for informed decision-making in various domains, particularly public affairs. The rapid growth of social networks, such as Twitter (now rebranded as X), provides an unprecedented opportunity to analyze public opinion at scale without relying on traditional surveys. With the rise of deep learning, Graph Neural Networks (GNNs) have shown great promise in modeling online opinion dynamics. Notably, classical opinion dynamics models, such as DeGroot, can be reformulated within a GNN framework.
We introduce Latent Social Dynamical System (LSDS), a novel framework for modeling the latent dynamics of social media users' opinions based on textual content. Since expressed opinions may not fully reflect underlying beliefs, LSDS first encodes post content into latent representations. It then leverages a GraphODE framework, using a GNN-based ODE function to predict future opinions. A decoder subsequently utilizes these predicted latent opinions to perform downstream tasks, such as interaction prediction, which serve as benchmarks for model evaluation. Our framework is highly flexible, supporting various opinion dynamic models as ODE functions, provided they can be adapted into a GNN-based form. It also accommodates different encoder architectures and is compatible with diverse downstream tasks.
To validate our approach, we constructed dynamic datasets from Twitter data. Experimental results demonstrate the effectiveness of LSDS, highlighting its potential for future applications. We plan to publicly release our dataset and code upon the publication of this paper. - [731] arXiv:2503.19470 (replaced) [pdf, other]
-
Title: ReSearch: Learning to Reason with Search for LLMs via Reinforcement LearningMingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, Weipeng ChenComments: Work in progressSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.
- [732] arXiv:2503.19654 (replaced) [pdf, html, other]
-
Title: RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We introduce RGB-Th-Bench, the first benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to comprehend RGB-Thermal image pairs. While VLMs have demonstrated remarkable progress in visual reasoning and multimodal understanding, their evaluation has been predominantly limited to RGB-based benchmarks, leaving a critical gap in assessing their capabilities in infrared vision tasks. Existing visible-infrared datasets are either task-specific or lack high-quality annotations necessary for rigorous model evaluation. To address these limitations, RGB-Th-Bench provides a comprehensive evaluation framework covering 14 distinct skill dimensions, with a total of 1,600+ expert-annotated Yes/No questions. The benchmark employs two accuracy metrics: a standard question-level accuracy and a stricter skill-level accuracy, which evaluates model robustness across multiple questions within each skill dimension. This design ensures a thorough assessment of model performance, including resilience to adversarial and hallucinated responses. We conduct extensive evaluations on 19 state-of-the-art VLMs, revealing significant performance gaps in RGB-Thermal understanding. Our results show that even the strongest models struggle with thermal image comprehension, with performance heavily constrained by their RGB-based capabilities. Additionally, the lack of large-scale application-specific and expert-annotated thermal-caption-pair datasets in pre-training is an important reason of the observed performance gap. RGB-Th-Bench highlights the urgent need for further advancements in multimodal learning to bridge the gap between visible and thermal image understanding. The dataset is available through this link, and the evaluation code will also be made publicly available.
- [733] arXiv:2503.19690 (replaced) [pdf, html, other]
-
Title: Risk-Aware Reinforcement Learning for Autonomous Driving: Improving Safety When Driving through IntersectionComments: 11 pages, 10 figuresSubjects: Robotics (cs.RO)
Applying reinforcement learning to autonomous driving has garnered widespread attention. However, classical reinforcement learning methods optimize policies by maximizing expected rewards but lack sufficient safety considerations, often putting agents in hazardous situations. This paper proposes a risk-aware reinforcement learning approach for autonomous driving to improve the safety performance when crossing the intersection. Safe critics are constructed to evaluate driving risk and work in conjunction with the reward critic to update the actor. Based on this, a Lagrangian relaxation method and cyclic gradient iteration are combined to project actions into a feasible safe region. Furthermore, a Multi-hop and Multi-layer perception (MLP) mixed Attention Mechanism (MMAM) is incorporated into the actor-critic network, enabling the policy to adapt to dynamic traffic and overcome permutation sensitivity challenges. This allows the policy to focus more effectively on surrounding potential risks while enhancing the identification of passing opportunities. Simulation tests are conducted on different tasks at unsignalized intersections. The results show that the proposed approach effectively reduces collision rates and improves crossing efficiency in comparison to baseline algorithms. Additionally, our ablation experiments demonstrate the benefits of incorporating risk-awareness and MMAM into RL.
- [734] arXiv:2503.19721 (replaced) [pdf, html, other]
-
Title: EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video ReconstructionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Leveraging its robust linear global modeling capability, Mamba has notably excelled in computer vision. Despite its success, existing Mamba-based vision models have overlooked the nuances of event-driven tasks, especially in video reconstruction. Event-based video reconstruction (EBVR) demands spatial translation invariance and close attention to local event relationships in the spatio-temporal domain. Unfortunately, conventional Mamba algorithms apply static window partitions and standard reshape scanning methods, leading to significant losses in local connectivity. To overcome these limitations, we introduce EventMamba--a specialized model designed for EBVR tasks. EventMamba innovates by incorporating random window offset (RWO) in the spatial domain, moving away from the restrictive fixed partitioning. Additionally, it features a new consistent traversal serialization approach in the spatio-temporal domain, which maintains the proximity of adjacent events both spatially and temporally. These enhancements enable EventMamba to retain Mamba's robust modeling capabilities while significantly preserving the spatio-temporal locality of event data. Comprehensive testing on multiple datasets shows that EventMamba markedly enhances video reconstruction, drastically improving computation speed while delivering superior visual quality compared to Transformer-based methods.
- [735] arXiv:2503.20074 (replaced) [pdf, html, other]
-
Title: Adaptive Orchestration for Large-Scale Inference on Heterogeneous Accelerator Systems Balancing Cost, Performance, and ResilienceComments: 14 pages, 7 figuresSubjects: Performance (cs.PF); Artificial Intelligence (cs.AI)
The surge in generative AI workloads has created a need for scalable inference systems that can flexibly harness both GPUs and specialized accelerators while containing operational costs. This paper proposes a hardware-agnostic control loop that adaptively allocates requests across heterogeneous accelerators based on real-time cost and capacity signals. The approach sustains low latency and high throughput by dynamically shifting between cost-optimized and capacity-optimized modes, ensuring the most efficient use of expensive compute resources under fluctuating availability. Evaluated using the Stable Diffusion model, the framework consistently meets latency targets, automatically redirects traffic during capacity shortfalls, and capitalizes on lower-cost accelerators when possible. These results highlight how a feedback-driven deployment strategy, spanning the entire software and hardware stack, can help organizations efficiently scale generative AI workloads while maintaining resilience in the face of limited accelerator capacity.
- [736] arXiv:2503.20083 (replaced) [pdf, html, other]
-
Title: Cross-Tokenizer Distillation via Approximate Likelihood MatchingComments: PreprintSubjects: Computation and Language (cs.CL)
Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods predominantly require the same tokenizer between the teacher and the student, restricting their applicability to only a small subset of teacher-student pairs. In this work, we develop a cross-tokenizer distillation method to solve this crucial deficiency. Our method is the first to enable cross-tokenizer distillation without a next-token prediction loss as the main objective, instead purely maximizing the student predictions' similarity to the teacher's predictions (known as pure distillation), while also being robust to large mismatches between the teacher and the student tokenizer function and vocabulary. Empirically, our method enables substantially improved performance as tested on two use cases. First, we show that viewing tokenizer transfer as self-distillation enables unprecedently effective transfer across tokenizers. We transfer (subword-level) Llama and Gemma models to byte-level tokenization more effectively than prior methods transfer to a similar subword tokenizer under a comparable training budget. Transferring different base models to the same tokenizer also enables ensembling them (e.g., via averaging their predicted probabilities) which boosts performance. Second, we use our cross-tokenizer distillation method to distil a large maths-specialized LLM into a smaller model, achieving competitive maths problem-solving performance. Overall, our results make substantial strides toward better adaptability and enhanced interaction between different LLMs.
- [737] arXiv:2503.20235 (replaced) [pdf, html, other]
-
Title: Leveraging 3D Geometric Priors in 2D Rotation Symmetry DetectionComments: Accepted to CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Symmetry plays a vital role in understanding structural patterns, aiding object recognition and scene interpretation. This paper focuses on rotation symmetry, where objects remain unchanged when rotated around a central axis, requiring detection of rotation centers and supporting vertices. Traditional methods relied on hand-crafted feature matching, while recent segmentation models based on convolutional neural networks detect rotation centers but struggle with 3D geometric consistency due to viewpoint distortions. To overcome this, we propose a model that directly predicts rotation centers and vertices in 3D space and projects the results back to 2D while preserving structural integrity. By incorporating a vertex reconstruction stage enforcing 3D geometric priors -- such as equal side lengths and interior angles -- our model enhances robustness and accuracy. Experiments on the DENDI dataset show superior performance in rotation axis detection and validate the impact of 3D priors through ablation studies.
- [738] arXiv:2503.20262 (replaced) [pdf, html, other]
-
Title: From the CDC to emerging infectious disease publics: The long-now of polarizing and complex health crisesSubjects: Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
As the COVID-19 pandemic evolved, the Center for Disease Control and Prevention used Twitter to share updates about the virus and safety guidelines, reaching millions instantly, in what we call the CDC public. We analyze two years of tweets, from, to, and about the CDC using a mixed-methods approach to characterize the nature and credibility of COVID-19 discourse and audience engagement. We found that the CDC is not engaging in two-way communication with the CDC publics and that discussions about COVID-19 reflected societal divisions and political polarization. We introduce a crisis message journey concept showing how the CDC public responds to the changing nature of the crisis (e.g., new variants) using ``receipts'' of earlier, and at times contradictory, guidelines. We propose design recommendations to support the CDC in tailoring messages to specific users and publics (e.g., users interested in racial equity) and in managing misinformation, especially in reaction to crisis flashpoints.
- [739] arXiv:2503.20275 (replaced) [pdf, html, other]
-
Title: Survey of Disaggregated Memory: Cross-layer Technique Insights for Next-Generation DatacentersSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR)
The growing scale of data requires efficient memory subsystems with large memory capacity and high memory performance. Disaggregated architecture has become a promising solution for today's cloud and edge computing for its scalability and elasticity. As a critical part of disaggregation, disaggregated memory faces many design challenges in many dimensions, including hardware scalability, architecture structure, software system design, application programmability, resource allocation, power management, etc. These challenges inspire a number of novel solutions at different system levels to improve system efficiency. In this paper, we provide a comprehensive review of disaggregated memory, including the methodology and technologies of disaggregated memory system foundation, optimization, and management. We study the technical essentials of disaggregated memory systems and analyze them from the hardware, architecture, system, and application levels. Then, we compare the design details of typical cross-layer designs on disaggregated memory. Finally, we discuss the challenges and opportunities of future disaggregated memory works that serve better for next-generation elastic and efficient datacenters.
- [740] arXiv:2503.20286 (replaced) [pdf, html, other]
-
Title: Bridging Evolutionary Multiobjective Optimization and GPU Acceleration via TensorizationComments: Accepted by IEEE TEVCSubjects: Neural and Evolutionary Computing (cs.NE)
Evolutionary multiobjective optimization (EMO) has made significant strides over the past two decades. However, as problem scales and complexities increase, traditional EMO algorithms face substantial performance limitations due to insufficient parallelism and scalability. While most work has focused on algorithm design to address these challenges, little attention has been given to hardware acceleration, thereby leaving a clear gap between EMO algorithms and advanced computing devices, such as GPUs. To bridge the gap, we propose to parallelize EMO algorithms on GPUs via the tensorization methodology. By employing tensorization, the data structures and operations of EMO algorithms are transformed into concise tensor representations, which seamlessly enables automatic utilization of GPU computing. We demonstrate the effectiveness of our approach by applying it to three representative EMO algorithms: NSGA-III, MOEA/D, and HypE. To comprehensively assess our methodology, we introduce a multiobjective robot control benchmark using a GPU-accelerated physics engine. Our experiments show that the tensorized EMO algorithms achieve speedups of up to 1113x compared to their CPU-based counterparts, while maintaining solution quality and effectively scaling population sizes to hundreds of thousands. Furthermore, the tensorized EMO algorithms efficiently tackle complex multiobjective robot control tasks, producing high-quality solutions with diverse behaviors. Source codes are available at this https URL.
- [741] arXiv:2503.20308 (replaced) [pdf, html, other]
-
Title: Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation MetricsComments: CVPR 2025Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria -- Temporal Synchronization, Lip Readability, and Expressiveness -- are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization. Codes and datasets are available at this https URL.
- [742] arXiv:2503.20313 (replaced) [pdf, html, other]
-
Title: TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric PrimitivesSize Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Xin LiuSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model execution are intra-layer parallel operators. The most effective approach to enhancing the performance of intra-layer parallel operators involves overlapping computation with communication. The overlapping can be achieved through either operator decomposition or kernel fusion. While decomposing operators is straightforward to implement, it often results in suboptimal performance. On the other hand, fusing communication kernels with compute kernels demands significant expertise and is error-prone.
In this paper, we propose TileLink to enable efficient compilation and generation of overlapped compute-communication kernels. TileLink is composed of frontend and backend. In the frontend, TileLink decouples the design space of communication and computation, linking these two parts via tile-centric primitives. In the backend, TileLink translates these primitives into low-level communication instructions, integrating the communication and computation components to achieve overlapped execution. In experiments, TileLink achieves from $1.17\times$ to $20.76\times$ speedup to non-overlapping baseline and achieves performance comparable to state-of-the-art overlapping libraries on GPUs. - [743] arXiv:2503.20321 (replaced) [pdf, html, other]
-
Title: Recovering Dynamic 3D Sketches from VideosComments: Accepted to CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Understanding 3D motion from videos presents inherent challenges due to the diverse types of movement, ranging from rigid and deformable objects to articulated structures. To overcome this, we propose Liv3Stroke, a novel approach for abstracting objects in motion with deformable 3D strokes. The detailed movements of an object may be represented by unstructured motion vectors or a set of motion primitives using a pre-defined articulation from a template model. Just as a free-hand sketch can intuitively visualize scenes or intentions with a sparse set of lines, we utilize a set of parametric 3D curves to capture a set of spatially smooth motion elements for general objects with unknown structures. We first extract noisy, 3D point cloud motion guidance from video frames using semantic features, and our approach deforms a set of curves to abstract essential motion features as a set of explicit 3D representations. Such abstraction enables an understanding of prominent components of motions while maintaining robustness to environmental factors. Our approach allows direct analysis of 3D object movements from video, tackling the uncertainty that typically occurs when translating real-world motion into recorded footage. The project page is accessible via: this https URL
- [744] arXiv:2503.20336 (replaced) [pdf, html, other]
-
Title: Power Minimization for NOMA-assisted Pinching Antenna Systems With Multiple WaveguidesSubjects: Information Theory (cs.IT)
The integration of pinching antenna systems with non-orthogonal multiple access (NOMA) has emerged as a promising technique for future 6G applications. This paper is the first to investigate power minimization for NOMA-assisted pinching antenna systems utilizing multiple dielectric waveguides. We formulate a total power minimization problem constrained by each user's minimum data requirements, addressing a classical challenge. To efficiently solve the non-convex optimization problem, we propose an iterative algorithm. Furthermore, we demonstrate that the interference function of this algorithm is standard, ensuring convergence to a unique fixed point. Numerical simulations validate that our developed algorithm converges within a few steps and significantly outperforms benchmark strategies across various data rate requirements. The results also indicate that the minimum transmit power, as a function of the interval between the waveguides, exhibits an approximately oscillatory decay with a negative trend.
- [745] arXiv:2503.20349 (replaced) [pdf, html, other]
-
Title: Consistency Trajectory Matching for One-Step Generative Super-ResolutionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Current diffusion-based super-resolution (SR) approaches achieve commendable performance at the cost of high inference overhead. Therefore, distillation techniques are utilized to accelerate the multi-step teacher model into one-step student model. Nevertheless, these methods significantly raise training costs and constrain the performance of the student model by the teacher model. To overcome these tough challenges, we propose Consistency Trajectory Matching for Super-Resolution (CTMSR), a distillation-free strategy that is able to generate photo-realistic SR results in one step. Concretely, we first formulate a Probability Flow Ordinary Differential Equation (PF-ODE) trajectory to establish a deterministic mapping from low-resolution (LR) images with noise to high-resolution (HR) images. Then we apply the Consistency Training (CT) strategy to directly learn the mapping in one step, eliminating the necessity of pre-trained diffusion model. To further enhance the performance and better leverage the ground-truth during the training process, we aim to align the distribution of SR results more closely with that of the natural images. To this end, we propose to minimize the discrepancy between their respective PF-ODE trajectories from the LR image distribution by our meticulously designed Distribution Trajectory Matching (DTM) loss, resulting in improved realism of our recovered HR images. Comprehensive experimental results demonstrate that the proposed methods can attain comparable or even superior capabilities on both synthetic and real datasets while maintaining minimal inference latency.
- [746] arXiv:2503.20519 (replaced) [pdf, html, other]
-
Title: MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D GenerationComments: Accepted to CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).
- [747] arXiv:2503.20578 (replaced) [pdf, other]
-
Title: LLPut: Investigating Large Language Models for Bug Report-Based Input GenerationSubjects: Software Engineering (cs.SE)
Failure-inducing inputs play a crucial role in diagnosing and analyzing software bugs. Bug reports typically contain these inputs, which developers extract to facilitate debugging. Since bug reports are written in natural language, prior research has leveraged various Natural Language Processing (NLP) techniques for automated input extraction. With the advent of Large Language Models (LLMs), an important research question arises: how effectively can generative LLMs extract failure-inducing inputs from bug reports? In this paper, we propose LLPut, a technique to empirically evaluate the performance of three open-source generative LLMs -- LLaMA, Qwen, and Qwen-Coder -- in extracting relevant inputs from bug reports. We conduct an experimental evaluation on a dataset of 206 bug reports to assess the accuracy and effectiveness of these models. Our findings provide insights into the capabilities and limitations of generative LLMs in automated bug diagnosis.
- [748] arXiv:2503.20639 (replaced) [pdf, html, other]
-
Title: PVLens: Enhancing Pharmacovigilance Through Automated Label ExtractionSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Reliable drug safety reference databases are essential for pharmacovigilance, yet existing resources like SIDER are outdated and static. We introduce PVLens, an automated system that extracts labeled safety information from FDA Structured Product Labels (SPLs) and maps terms to MedDRA. PVLens integrates automation with expert oversight through a web-based review tool. In validation against 97 drug labels, PVLens achieved an F1 score of 0.882, with high recall (0.983) and moderate precision (0.799). By offering a scalable, more accurate and continuously updated alternative to SIDER, PVLens enhances real-time pharamcovigilance with improved accuracy and contemporaneous insights.
- [749] arXiv:2503.20646 (replaced) [pdf, html, other]
-
Title: Immersive and Wearable Thermal Rendering for Augmented RealitySubjects: Human-Computer Interaction (cs.HC); Robotics (cs.RO); Systems and Control (eess.SY)
In augmented reality (AR), where digital content is overlaid onto the real world, realistic thermal feedback has been shown to enhance immersion. Yet current thermal feedback devices, heavily influenced by the needs of virtual reality, often hinder physical interactions and are ineffective for immersion in AR. To bridge this gap, we have identified three design considerations relevant for AR thermal feedback: indirect feedback to maintain dexterity, thermal passthrough to preserve real-world temperature perception, and spatiotemporal rendering for dynamic sensations. We then created a unique and innovative thermal feedback device that satisfies these criteria. Human subject experiments assessing perceptual sensitivity, object temperature matching, spatial pattern recognition, and moving thermal stimuli demonstrated the impact of our design, enabling realistic temperature discrimination, virtual object perception, and enhanced immersion. These findings demonstrate that carefully designed thermal feedback systems can bridge the sensory gap between physical and virtual interactions, enhancing AR realism and usability.
- [750] arXiv:2503.20652 (replaced) [pdf, html, other]
-
Title: Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly ClassificationComments: 13 pages, 4 figures. Accepted for MIDL 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
The rapid increase in the number of Computed Tomography (CT) scan examinations has created an urgent need for automated tools, such as organ segmentation, anomaly classification, and report generation, to assist radiologists with their growing workload. Multi-label classification of Three-Dimensional (3D) CT scans is a challenging task due to the volumetric nature of the data and the variety of anomalies to be detected. Existing deep learning methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies effectively, while Vision Transformers require extensive pre-training, posing challenges for practical use. Additionally, these existing methods do not explicitly model the radiologist's navigational behavior while scrolling through CT scan slices, which requires both global context understanding and local detail awareness. In this study, we present CT-Scroll, a novel global-local attention model specifically designed to emulate the scrolling behavior of radiologists during the analysis of 3D CT scans. Our approach is evaluated on two public datasets, demonstrating its efficacy through comprehensive experiments and an ablation study that highlights the contribution of each model component.
- [751] arXiv:2503.20660 (replaced) [pdf, html, other]
-
Title: DR-PETS: Learning-Based Control With Planning in Adversarial EnvironmentsSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Ensuring robustness against epistemic, possibly adversarial, perturbations is essential for reliable real-world decision-making. While the Probabilistic Ensembles with Trajectory Sampling (PETS) algorithm inherently handles uncertainty via ensemble-based probabilistic models, it lacks guarantees against structured adversarial or worst-case uncertainty distributions. To address this, we propose DR-PETS, a distributionally robust extension of PETS that certifies robustness against adversarial perturbations. We formalize uncertainty via a p-Wasserstein ambiguity set, enabling worst-case-aware planning through a min-max optimization framework. While PETS passively accounts for stochasticity, DR-PETS actively optimizes robustness via a tractable convex approximation integrated into PETS planning loop. Experiments on pendulum stabilization and cart-pole balancing show that DR-PETS certifies robustness against adversarial parameter perturbations, achieving consistent performance in worst-case scenarios where PETS deteriorates.
- [752] arXiv:2503.20673 (replaced) [pdf, html, other]
-
Title: Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training StrategySubjects: Computer Vision and Pattern Recognition (cs.CV)
The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics.
- [753] arXiv:2503.20685 (replaced) [pdf, html, other]
-
Title: Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast UltrasoundYuhao Huang, Ao Chang, Haoran Dou, Xing Tao, Xinrui Zhou, Yan Cao, Ruobing Huang, Alejandro F Frangi, Lingyun Bao, Xin Yang, Dong NiComments: Accepted by Medical Image Analysis. 24 pages, 13 figures, 20 tabelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.
- [754] arXiv:2503.20749 (replaced) [pdf, html, other]
-
Title: Beyond Believability: Accurate Human Behavior Simulation with Fine-Tuned LLMsSubjects: Computation and Language (cs.CL)
Recent research shows that LLMs can simulate ``believable'' human behaviors to power LLM agents via prompt-only methods. In this work, we focus on evaluating and improving LLM's objective ``accuracy'' rather than the subjective ``believability'' in the web action generation task, leveraging a large-scale, real-world dataset collected from online shopping human actions. We present the first comprehensive quantitative evaluation of state-of-the-art LLMs (e.g., DeepSeek-R1, Llama, and Claude) on the task of web action generation. Our results show that fine-tuning LLMs on real-world behavioral data substantially improves their ability to generate actions compared to prompt-only methods. Furthermore, incorporating synthesized reasoning traces into model training leads to additional performance gains, demonstrating the value of explicit rationale in behavior modeling. This work establishes a new benchmark for evaluating LLMs in behavior simulation and offers actionable insights into how real-world action data and reasoning augmentation can enhance the fidelity of LLM agents.
- [755] arXiv:2503.20752 (replaced) [pdf, html, other]
-
Title: Reason-RFT: Reinforcement Fine-Tuning for Visual ReasoningComments: 35 pages, 22 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods improve VLM reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities. However, this training paradigm may lead to overfitting and cognitive rigidity, restricting the model's ability to transfer visual reasoning skills across domains and limiting its real-world applicability. To address these limitations, we propose Reason-RFT, a novel reinforcement fine-tuning framework that significantly enhances generalization capabilities in visual reasoning tasks. Reason-RFT introduces a two-phase training framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated Chain-of-Thought (CoT) data activates the reasoning potential of Vision-Language Models (VLMs), followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing generalization in visual reasoning tasks. To evaluate Reason-RFT's visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation. Experimental results demonstrate Reasoning-RFT's three key advantages: (1) Performance Enhancement: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models; (2) Generalization Superiority: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms; (3) Data Efficiency: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines. Project website: this https URL
- [756] arXiv:2503.20768 (replaced) [pdf, html, other]
-
Title: An Empirical Study of the Impact of Federated Learning on Machine Learning Model AccuracySubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated Learning (FL) enables distributed ML model training on private user data at the global scale. Despite the potential of FL demonstrated in many domains, an in-depth view of its impact on model accuracy remains unclear. In this paper, we investigate, systematically, how this learning paradigm can affect the accuracy of state-of-the-art ML models for a variety of ML tasks. We present an empirical study that involves various data types: text, image, audio, and video, and FL configuration knobs: data distribution, FL scale, client sampling, and local and global computations. Our experiments are conducted in a unified FL framework to achieve high fidelity, with substantial human efforts and resource investments. Based on the results, we perform a quantitative analysis of the impact of FL, and highlight challenging scenarios where applying FL degrades the accuracy of the model drastically and identify cases where the impact is negligible. The detailed and extensive findings can benefit practical deployments and future development of FL.
- [757] arXiv:2305.15364 (replaced) [pdf, html, other]
-
Title: LQG Risk-Sensitive Single-Agent and Major-Minor Mean-Field Game Systems: A Variational FrameworkSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY); Probability (math.PR); Mathematical Finance (q-fin.MF); Risk Management (q-fin.RM)
We develop a variational approach to address risk-sensitive optimal control problems with an exponential-of-integral cost functional in a general linear-quadratic-Gaussian (LQG) single-agent setup, offering new insights into such problems. Our analysis leads to the derivation of a nonlinear necessary and sufficient condition of optimality, expressed in terms of martingale processes. Subject to specific conditions, we find an equivalent risk-neutral measure, under which a linear state feedback form can be obtained for the optimal control. It is then shown that the obtained feedback control is consistent with the imposed condition and remains optimal under the original measure. Building upon this development, we (i) propose a variational framework for general LQG risk-sensitive mean-field games (MFGs) and (ii) advance the LQG risk-sensitive MFG theory by incorporating a major agent in the framework. The major agent interacts with a large number of minor agents, and unlike the minor agents, its influence on the system remains significant even with an increasing number of minor agents. We derive the Markovian closed-loop best-response strategies of agents in the limiting case where the number of agents goes to infinity. We establish that the set of obtained best-response strategies yields a Nash equilibrium in the limiting case and an $\varepsilon$-Nash equilibrium in the finite-player case.
- [758] arXiv:2308.10977 (replaced) [pdf, html, other]
-
Title: An elementary proof of Bridy's theoremComments: 31 pages, 2 figures, 2 tables; publication versionSubjects: Number Theory (math.NT); Formal Languages and Automata Theory (cs.FL); Symbolic Computation (cs.SC)
Christol's theorem states that a power series with coefficients in a finite field is algebraic if and only if its coefficient sequence is automatic. A natural question is how the size of a polynomial describing such a sequence relates to the size of an automaton describing the same sequence. Bridy used tools from algebraic geometry to bound the size of the minimal automaton for a sequence, given its minimal polynomial. We produce a new proof of Bridy's bound by embedding algebraic sequences as diagonals of rational functions.
- [759] arXiv:2312.15574 (replaced) [pdf, html, other]
-
Title: Clustered Switchback Designs for Experimentation Under Spatio-temporal InterferenceSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG)
We consider experimentation in the presence of non-stationarity, inter-unit (spatial) interference, and carry-over effects (temporal interference), where we wish to estimate the global average treatment effect (GATE), the difference between average outcomes having exposed all units at all times to treatment or to control. We suppose spatial interference is described by a graph, where a unit's outcome depends on its neighborhood's treatments, and that temporal interference is described by an MDP, where the transition kernel under either treatment (action) satisfies a rapid mixing condition. We propose a clustered switchback design, where units are grouped into clusters and time steps are grouped into blocks, and each whole cluster-block combination is assigned a single random treatment. Under this design, we show that for graphs that admit good clustering, a truncated Horvitz-Thompson estimator achieves a $\tilde O(1/NT)$ mean squared error (MSE), matching the lower bound up to logarithmic terms for sparse graphs. Our results simultaneously generalize the results from \citet{hu2022switchback,ugander2013graph} and \citet{leung2022rate}. Simulation studies validate the favorable performance of our approach.
- [760] arXiv:2403.02095 (replaced) [pdf, html, other]
-
Title: Homotopy Methods for Convex OptimizationComments: 28 pages, 8 figures, v2: close to the published versionSubjects: Optimization and Control (math.OC); Algebraic Geometry (math.AG); Numerical Analysis (math.NA)
Convex optimization encompasses a wide range of optimization problems that contain many efficiently solvable subclasses. Interior point methods are currently the state-of-the-art approach for solving such problems, particularly effective for classes like semidefinite programming, quadratic programming, and geometric programming. However, their success hinges on the construction of self-concordant barrier functions for feasible sets. In this work, we investigate and develop a homotopy-based approach to solve convex optimization problems. While homotopy methods have been considered in optimization before, their potential for general convex programs remains underexplored. This approach gradually transforms the feasible set of a trivial optimization problem into the target one while tracking solutions by solving a differential equation, in contrast to traditional central path methods. We establish a criterion that ensures that the homotopy method correctly solves the optimization problem and prove the existence of such homotopies for several important classes, including semidefinite and hyperbolic programs. Furthermore, we demonstrate that our approach numerically outperforms state-of-the-art methods in hyperbolic programming, highlighting its practical advantages.
- [761] arXiv:2405.05222 (replaced) [pdf, html, other]
-
Title: Brooks-type colourings of digraphs in linear timeComments: 26 pages, 5 figuresSubjects: Combinatorics (math.CO); Data Structures and Algorithms (cs.DS)
Brooks' Theorem is a fundamental result on graph colouring, stating that the chromatic number of a graph is almost always upper bounded by its maximal degree. Lovász showed that such a colouring may then be computed in linear time when it exists. Many analogues are known for variants of (di)graph colouring, notably for list-colouring and partitions into subgraphs with prescribed degeneracy. One of the most general results of this kind is due to Borodin, Kostochka, and Toft, when asking for classes of colours to satisfy "variable degeneracy" constraints. An extension of this result to digraphs has recently been proposed by Bang-Jensen, Schweser, and Stiebitz, by considering colourings as partitions into "variable weakly degenerate" subdigraphs. Unlike earlier variants, there exists no linear-time algorithm to produce colourings for these generalisations.
We introduce the notion of (variable) bidegeneracy for digraphs, capturing multiple (di)graph degeneracy variants. We define the corresponding concept of $F$-dicolouring, where $F = (f_1,...,f_s)$ is a vector of functions, and an $F$-dicolouring requires vertices coloured $i$ to induce a "strictly-$f_i$-bidegenerate" subdigraph. We prove an analogue of Brooks' theorem for $F$-dicolouring, generalising the result of Bang-Jensen et al., and earlier analogues in turn.
Our new approach provides a linear-time algorithm that, given a digraph $D$, either produces an $F$-dicolouring of $D$, or correctly certifies that none exist. This yields the first linear-time algorithms to compute (di)colourings corresponding to the aforementioned generalisations of Brooks' theorem. In turn, it gives an unified framework to compute such colourings for various intermediate generalisations of Brooks' theorem such as list-(di)colouring and partitioning into (variable) degenerate sub(di)graphs. - [762] arXiv:2407.00258 (replaced) [pdf, html, other]
-
Title: Topological Graph Simplification Solutions to the Street Intersection Miscount ProblemJournal-ref: Transactions in GIS, 2025Subjects: Physics and Society (physics.soc-ph); Discrete Mathematics (cs.DM); Systems and Control (eess.SY); Computation (stat.CO)
Street intersection counts and densities are ubiquitous measures in transport geography and planning. However, typical street network data and typical street network analysis tools can substantially overcount them. This article explains the three main reasons why this happens and presents solutions to each. It contributes algorithms to automatically simplify spatial graphs of urban street networks -- via edge simplification and node consolidation -- resulting in faster parsimonious models and more accurate network measures like intersection counts and densities, street segment lengths, and node degrees. These algorithms' information compression improves downstream graph analytics' memory and runtime efficiency, boosting analytical tractability without loss of model fidelity. Finally, this article validates these algorithms and empirically assesses intersection count biases worldwide to demonstrate the problem's widespread prevalence. Without consolidation, traditional methods would overestimate the median urban area intersection count by 14%. However, this bias varies drastically across regions, underscoring these algorithms' importance for consistent comparative empirical analyses.
- [763] arXiv:2407.11828 (replaced) [pdf, html, other]
-
Title: Vibravox: A Dataset of French Speech Captured with Body-conduction Audio SensorsJulien Hauret, Malo Olivier, Thomas Joubaud, Christophe Langrenne, Sarah Poirée, Véronique Zimpfer, Éric BavuComments: 23 pages, 42 figuresSubjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 hours per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.
- [764] arXiv:2408.07254 (replaced) [pdf, html, other]
-
Title: Learning Multi-Index Models with Neural Networks via Mean-Field Langevin DynamicsComments: 36 pages, 2 figures. To appear in the International Conference on Learning Representations (ICLR), 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We study the problem of learning multi-index models in high-dimensions using a two-layer neural network trained with the mean-field Langevin algorithm. Under mild distributional assumptions on the data, we characterize the effective dimension $d_{\mathrm{eff}}$ that controls both sample and computational complexity by utilizing the adaptivity of neural networks to latent low-dimensional structures. When the data exhibit such a structure, $d_{\mathrm{eff}}$ can be significantly smaller than the ambient dimension. We prove that the sample complexity grows almost linearly with $d_{\mathrm{eff}}$, bypassing the limitations of the information and generative exponents that appeared in recent analyses of gradient-based feature learning. On the other hand, the computational complexity may inevitably grow exponentially with $d_{\mathrm{eff}}$ in the worst-case scenario. Motivated by improving computational complexity, we take the first steps towards polynomial time convergence of the mean-field Langevin algorithm by investigating a setting where the weights are constrained to be on a compact manifold with positive Ricci curvature, such as the hypersphere. There, we study assumptions under which polynomial time convergence is achievable, whereas similar assumptions in the Euclidean setting lead to exponential time complexity.
- [765] arXiv:2408.12691 (replaced) [pdf, html, other]
-
Title: Quantization-aware Matrix Factorization for Low Bit Rate Image CompressionPooya Ashtari, Pourya Behmandpoor, Fateme Nateghi Haredasht, Jonathan H. Chen, Panagiotis Patrinos, Sabine Van HuffelComments: 22 pages, 6 figures, 1 table, 1 algorithmSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
Lossy image compression is essential for efficient transmission and storage. Traditional compression methods mainly rely on discrete cosine transform (DCT) or singular value decomposition (SVD), both of which represent image data in continuous domains and, therefore, necessitate carefully designed quantizers. Notably, these methods consider quantization as a separate step, where quantization errors cannot be incorporated into the compression process. The sensitivity of these methods, especially SVD-based ones, to quantization errors significantly degrades reconstruction quality. To address this issue, we introduce a quantization-aware matrix factorization (QMF) to develop a novel lossy image compression method. QMF provides a low-rank representation of the image data as a product of two smaller factor matrices, with elements constrained to bounded integer values, thereby effectively integrating quantization with low-rank approximation. We propose an efficient, provably convergent iterative algorithm for QMF using a block coordinate descent (BCD) scheme, with subproblems having closed-form solutions. Our experiments on the Kodak and CLIC 2024 datasets demonstrate that our QMF compression method consistently outperforms JPEG at low bit rates below 0.25 bits per pixel (bpp) and remains comparable at higher bit rates. We also assessed our method's capability to preserve visual semantics by evaluating an ImageNet pre-trained classifier on compressed images. Remarkably, our method improved top-1 accuracy by over 5 percentage points compared to JPEG at bit rates under 0.25 bpp. The project is available at this https URL .
- [766] arXiv:2410.00068 (replaced) [pdf, other]
-
Title: Denoising VAE as an Explainable Feature Reduction and Diagnostic Pipeline for Autism Based on Resting state fMRIXinyuan Zheng, Orren Ravid, Robert A.J. Barry, Yoojean Kim, Qian Wang, Young-geun Kim, Xi Zhu, Xiaofu HeSubjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Applications (stat.AP)
Autism spectrum disorders (ASDs) are developmental conditions characterized by restricted interests and difficulties in communication. The complexity of ASD has resulted in a deficiency of objective diagnostic biomarkers. Deep learning methods have gained recognition for addressing these challenges in neuroimaging analysis, but finding and interpreting such diagnostic biomarkers are still challenging computationally. Here, we propose a feature reduction pipeline using resting-state fMRI data. We used Craddock atlas and Power atlas to extract functional connectivity data from rs-fMRI, resulting in over 30 thousand features. By using a denoising variational autoencoder, our proposed pipeline further compresses the connectivity features into 5 latent Gaussian distributions, providing is a low-dimensional representation of the data to promote computational efficiency and interpretability. To test the method, we employed the extracted latent representations to classify ASD using traditional classifiers such as SVM on a large multi-site dataset. The 95% confidence interval for the prediction accuracy of SVM is [0.63, 0.76] after site harmonization using the extracted latent distributions. Without using DVAE for dimensionality reduction, the prediction accuracy is 0.70, which falls within the interval. The DVAE successfully encoded the diagnostic information from rs-fMRI data without sacrificing prediction performance. The runtime for training the DVAE and obtaining classification results from its extracted latent features was 7 times shorter compared to training classifiers directly on the raw data. Our findings suggest that the Power atlas provides more effective brain connectivity insights for diagnosing ASD than Craddock atlas. Additionally, we visualized the latent representations to gain insights into the brain networks contributing to the differences between ASD and neurotypical brains.
- [767] arXiv:2410.16449 (replaced) [pdf, html, other]
-
Title: Robust Feature Learning for Multi-Index Models in High DimensionsComments: 41 pages, 1 figure. To appear in the International Conference on Learning Representations (ICLR), 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Recently, there have been numerous studies on feature learning with neural networks, specifically on learning single- and multi-index models where the target is a function of a low-dimensional projection of the input. Prior works have shown that in high dimensions, the majority of the compute and data resources are spent on recovering the low-dimensional projection; once this subspace is recovered, the remainder of the target can be learned independently of the ambient dimension. However, implications of feature learning in adversarial settings remain unexplored. In this work, we take the first steps towards understanding adversarially robust feature learning with neural networks. Specifically, we prove that the hidden directions of a multi-index model offer a Bayes optimal low-dimensional projection for robustness against $\ell_2$-bounded adversarial perturbations under the squared loss, assuming that the multi-index coordinates are statistically independent from the rest of the coordinates. Therefore, robust learning can be achieved by first performing standard feature learning, then robustly tuning a linear readout layer on top of the standard representations. In particular, we show that adversarially robust learning is just as easy as standard learning. Specifically, the additional number of samples needed to robustly learn multi-index models when compared to standard learning does not depend on dimensionality.
- [768] arXiv:2410.17887 (replaced) [pdf, html, other]
-
Title: Average-case matrix discrepancy: satisfiability boundsComments: 37 pages, 2 figures ; v2: corrections of small typos and error estimates, move of parts of the proof of the first moment method to appendix, and addition of the failure of the second moment methodSubjects: Probability (math.PR); Disordered Systems and Neural Networks (cond-mat.dis-nn); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
Given a sequence of $d \times d$ symmetric matrices $\{\mathbf{W}_i\}_{i=1}^n$, and a margin $\Delta > 0$, we investigate whether it is possible to find signs $(\epsilon_1, \dots, \epsilon_n) \in \{\pm 1\}^n$ such that the operator norm of the signed sum satisfies $\|\sum_{i=1}^n \epsilon_i \mathbf{W}_i\|_{\rm op} \leq \Delta$. Kunisky and Zhang (2023) recently introduced a random version of this problem, where the matrices $\{\mathbf{W}_i\}_{i=1}^n$ are drawn from the Gaussian orthogonal ensemble. This model can be seen as a random variant of the celebrated Matrix Spencer conjecture and as a matrix-valued analog of the symmetric binary perceptron in statistical physics. In this work, we establish a satisfiability transition in this problem as $n, d \to \infty$ with $n / d^2 \to \tau > 0$. First, we prove that the expected number of solutions with margin $\Delta=\kappa \sqrt{n}$ has a sharp threshold at a critical $\tau_1(\kappa)$: for $\tau < \tau_1(\kappa)$ the problem is typically unsatisfiable, while for $\tau > \tau_1(\kappa)$ the average number of solutions is exponentially large. Second, combining a second-moment method with recent results from Altschuler (2023) on margin concentration in perceptron-type problems, we identify a second threshold $\tau_2(\kappa)$, such that for $\tau>\tau_2(\kappa)$ the problem admits solutions with high probability. In particular, we establish that a system of $n = \Theta(d^2)$ Gaussian random matrices can be balanced so that the spectrum of the resulting matrix macroscopically shrinks compared to the semicircle law. Finally, under a technical assumption, we show that there exists values of $(\tau,\kappa)$ for which the number of solutions has large variance, implying the failure of the second moment method. Our proofs rely on establishing concentration and large deviation properties of correlated Gaussian matrices under spectral norm constraints.
- [769] arXiv:2410.21212 (replaced) [pdf, html, other]
-
Title: On learning higher-order cumulants in diffusion modelsComments: 21 pages, many figures. Extended version of contribution awarded "best 'physics for AI' paper award" in the NeurIPS 2024 workshop "Machine Learning and the Physical Sciences"; v2: references and minor clarifications added, version to appear in Machine Learning: Science and TechnologySubjects: High Energy Physics - Lattice (hep-lat); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
To analyse how diffusion models learn correlations beyond Gaussian ones, we study the behaviour of higher-order cumulants, or connected n-point functions, under both the forward and backward process. We derive explicit expressions for the moment- and cumulant-generating functionals, in terms of the distribution of the initial data and properties of forward process. It is shown analytically that during the forward process higher-order cumulants are conserved in models without a drift, such as the variance-expanding scheme, and that therefore the endpoint of the forward process maintains nontrivial correlations. We demonstrate that since these correlations are encoded in the score function, higher-order cumulants are learnt in the backward process, also when starting from a normal prior. We confirm our analytical results in an exactly solvable toy model with nonzero cumulants and in scalar lattice field theory.
- [770] arXiv:2410.21858 (replaced) [pdf, html, other]
-
Title: Joint Estimation of Conditional Mean and Covariance for Unbalanced PanelsSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
We develop a nonparametric, kernel-based joint estimator for conditional mean and covariance matrices in large and unbalanced panels. The estimator is supported by rigorous consistency results and finite-sample guarantees, ensuring its reliability for empirical applications. We apply it to an extensive panel of monthly US stock excess returns from 1962 to 2021, using macroeconomic and firm-specific covariates as conditioning variables. The estimator effectively captures time-varying cross-sectional dependencies, demonstrating robust statistical and economic performance. We find that idiosyncratic risk explains, on average, more than 75% of the cross-sectional variance.
- [771] arXiv:2411.02087 (replaced) [pdf, html, other]
-
Title: An Exponential Separation Between Quantum and Quantum-Inspired Classical Algorithms for Linear SystemsSubjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
Achieving a provable exponential quantum speedup for an important machine learning task has been a central research goal since the seminal HHL quantum algorithm for solving linear systems and the subsequent quantum recommender systems algorithm by Kerenidis and Prakash. These algorithms were initially believed to be strong candidates for exponential speedups, but a lower bound ruling out similar classical improvements remained absent. In breakthrough work by Tang, it was demonstrated that this lack of progress in classical lower bounds was for good reasons. Concretely, she gave a classical counterpart of the quantum recommender systems algorithm, reducing the quantum advantage to a mere polynomial. Her approach is quite general and was named quantum-inspired classical algorithms. Since then, almost all the initially exponential quantum machine learning speedups have been reduced to polynomial via new quantum-inspired classical algorithms. From the current state-of-affairs, it is unclear whether we can hope for exponential quantum speedups for any natural machine learning task.
In this work, we present the first such provable exponential separation between quantum and quantum-inspired classical algorithms for the basic problem of solving a linear system when the input matrix is well-conditioned and has sparse rows and columns. - [772] arXiv:2411.04844 (replaced) [pdf, html, other]
-
Title: Discretized Gaussian Representation for Tomographic ReconstructionShaokai Wu, Yuxiang Lu, Wei Ji, Suizhi Huang, Fengyu Yang, Shalayiding Sirejiding, Qichen He, Jing Tong, Yanbiao Ji, Yue Ding, Hongtao LuSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Computed Tomography (CT) is a widely used imaging technique that provides detailed cross-sectional views of objects. Over the past decade, Deep Learning-based Reconstruction (DLR) methods have led efforts to enhance image quality and reduce noise, yet they often require large amounts of data and are computationally intensive. Inspired by recent advancements in scene reconstruction, some approaches have adapted NeRF and 3D Gaussian Splatting (3DGS) techniques for CT reconstruction. However, these methods are not ideal for direct 3D volume reconstruction. In this paper, we propose a novel Discretized Gaussian Representation (DGR) for CT reconstruction, which directly reconstructs the 3D volume using a set of discretized Gaussian functions in an end-to-end manner. To further enhance computational efficiency, we introduce a Fast Volume Reconstruction technique that aggregates the contributions of these Gaussians into a discretized volume in a highly parallelized fashion. Our extensive experiments on both real-world and synthetic datasets demonstrate that DGR achieves superior reconstruction quality and significantly improved computational efficiency compared to existing DLR and instance reconstruction methods. Our code has been provided for review purposes and will be made publicly available upon publication.
- [773] arXiv:2411.08077 (replaced) [pdf, html, other]
-
Title: DBgDel: Database-Enhanced Gene Deletion Framework for Growth-Coupled Production in Genome-Scale Metabolic ModelsSubjects: Quantitative Methods (q-bio.QM); Databases (cs.DB)
When simulating metabolite productions with genome-scale constraint-based metabolic models, gene deletion strategies are necessary to achieve growth-coupled production, which means cell growth and target metabolite production occur simultaneously. Since obtaining gene deletion strategies for large genome-scale models suffers from significant computational time, it is necessary to develop methods to mitigate this computational burden. In this study, we introduce a novel framework for computing gene deletion strategies. The proposed framework first mines related databases to extract prior information about gene deletions for growth-coupled production. It then integrates the extracted information with downstream algorithms to narrow down the algorithmic search space, resulting in highly efficient calculations on genome-scale models. Computational experiment results demonstrated that our framework can compute stoichiometrically feasible gene deletion strategies for numerous target metabolites, showcasing a noteworthy improvement in computational efficiency. Specifically, our framework achieves an average 6.1-fold acceleration in computational speed compared to existing methods while maintaining a respectable success rate. The source code of DBgDel with examples are available on this https URL.
- [774] arXiv:2411.08909 (replaced) [pdf, html, other]
-
Title: Long-context Protein Language Modeling Using Bidirectional Mamba with Shared Projection LayersYingheng Wang, Zichen Wang, Gil Sadeh, Luca Zancato, Alessandro Achille, George Karypis, Huzefa RangwalaComments: model weights open-sourced at this https URLSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design. Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths. Such protein LMs cannot extrapolate to longer proteins and protein complexes well. They also fail to account for the underlying biological mechanisms carried out by biomolecular interactions and dynamics i.e., proteins often interact with other proteins, molecules, and pathways in complex biological systems. In this work, we propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built upon selective structured state-space models, to learn high-quality universal protein representations at the amino acid token level using masked language modeling. We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction (PPI) graphs for a second stage of training. LC-PLM demonstrates favorable neural scaling laws, better length extrapolation capability, and a 7% to 34% improvement on protein downstream tasks than Transformer-based ESM-2. LC-PLM-G further trained within the context of PPI graphs shows promising results on protein structure and function prediction tasks. Our study demonstrates the benefit of increasing the context size with computationally efficient LM architecture (e.g. structured state space models) in learning universal protein representations and incorporating molecular interaction context contained in biological graphs.
- [775] arXiv:2412.04584 (replaced) [pdf, html, other]
-
Title: The relevance of higher-order tiesSubjects: Physics and Society (physics.soc-ph); Social and Information Networks (cs.SI)
Higher-order networks effectively represent complex systems with group interactions. Existing methods usually overlook the relative contribution of group interactions (hyperlinks) of different sizes to the overall network structure. Yet, this has many important applications, especially when the network has meaningful node labels. In this work, we propose a comprehensive methodology to precisely measure the contribution of different orders to the overall network structure. First, we propose the order contribution measure, which quantifies the contribution of hyperlinks of different orders to the link weights (local scale), number of triangles (mesoscale) and size of the largest connected component (global scale) of the pairwise weighted network. Second, we propose the measure of order relevance, which gives insights in how hyperlinks of different orders contribute to the considered network property. Most interestingly, it enables an assessment of whether this contribution is synergistic or redundant with respect to that of hyperlinks of other orders. Third, to account for labels, we propose a metric of label group balance to assess how hyperlinks of different orders connect label-induced groups of nodes. We applied these metrics to a large-scale board interlock network and scientific collaboration network, in which node labels correspond to geographical location of the nodes. Experiments including a comparison with randomized null models reveal how from the global level perspective, we observe synergistic contributions of orders in the board interlock network, whereas in the collaboration network there is more redundancy. The findings shed new light on social scientific debates on the role of busy directors in global business networks and the connective effects of large author teams in scientific collaboration networks.
- [776] arXiv:2412.07428 (replaced) [pdf, html, other]
-
Title: Latency Minimization for UAV-Enabled Federated Learning: Trajectory Design and Resource AllocationComments: This manuscript has been submitted to IEEESubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Federated learning (FL) has become a transformative paradigm for distributed machine learning across wireless networks. However, the performance of FL is often hindered by the unreliable communication links between resource-constrained Internet of Things (IoT) devices and the central server. To overcome this challenge, we propose a novel framework that employs an unmanned aerial vehicle (UAV) as a mobile server to enhance the FL training process. By capitalizing on the UAV's mobility, we establish strong line-of-sight connections with IoT devices, thereby enhancing communication reliability and capacity. To maximize training efficiency, we formulate a latency minimization problem that jointly optimizes bandwidth allocation, computing frequencies, transmit power for both the UAV and IoT devices, and the UAV's flight trajectory. Subsequently, we analyze the required rounds of the IoT devices training and the UAV aggregation for FL convergence. Based on the convergence constraint, we transform the problem into three subproblems and develop an efficient alternating optimization algorithm to solve this problem effectively. Additionally, we provide a thorough analysis of the algorithm's convergence and computational complexity. Extensive numerical results demonstrate that our proposed scheme not only surpasses existing benchmark schemes in reducing latency up to 15.29%, but also achieves training efficiency that nearly matches the ideal scenario.
- [777] arXiv:2412.08453 (replaced) [pdf, html, other]
-
Title: On best approximation by multivariate ridge functions with applications to generalized translation networksSubjects: Functional Analysis (math.FA); Machine Learning (cs.LG); Machine Learning (stat.ML)
We prove sharp upper and lower bounds for the approximation of Sobolev functions by sums of multivariate ridge functions, i.e., functions of the form $\mathbb{R}^d \ni x \mapsto \sum_{k=1}^n h_k(A_k x) \in \mathbb{R}$ with $h_k : \mathbb{R}^\ell \to \mathbb{R}$ and $A_k \in \mathbb{R}^{\ell \times d}$. We show that the order of approximation asymptotically behaves as $n^{-r/(d-\ell)}$, where $r$ is the regularity of the Sobolev functions to be approximated. Our lower bound even holds when approximating $L^\infty$-Sobolev functions of regularity $r$ with error measured in $L^1$, while our upper bound applies to the approximation of $L^p$-Sobolev functions in $L^p$ for any $1 \leq p \leq \infty$. These bounds generalize well-known results about the approximation properties of univariate ridge functions to the multivariate case. Moreover, we use these bounds to obtain sharp asymptotic bounds for the approximation of Sobolev functions using generalized translation networks and complex-valued neural networks.
- [778] arXiv:2412.10863 (replaced) [pdf, html, other]
-
Title: The structure of rough sets defined by reflexive relationsSubjects: Rings and Algebras (math.RA); Logic in Computer Science (cs.LO)
For several types of information relations, the induced rough sets system RS does not form a lattice but only a partially ordered set. However, by studying its Dedekind-MacNeille completion DM(RS), one may reveal new important properties of rough set structures. Building upon D. Umadevi's work on describing joins and meets in DM(RS), we previously investigated pseudo-Kleene algebras defined on DM(RS) for reflexive relations. This paper delves deeper into the order-theoretic properties of DM(RS) in the context of reflexive relations. We describe the completely join-irreducible elements of DM(RS) and characterize when DM(RS) is a spatial completely distributive lattice. We show that even in the case of a non-transitive reflexive relation, DM(RS) can form a Nelson algebra, a property generally associated with quasiorders. We introduce a novel concept, the core of a relational neighborhood, and use it to provide a necessary and sufficient condition for DM(RS) to determine a Nelson algebra.
- [779] arXiv:2501.05001 (replaced) [pdf, html, other]
-
Title: 40 Years of Interdisciplinary Research: Phases, Origins, and Key Turning Points (1981-2020)Comments: 16 pages, 3 figuresSubjects: Applications (stat.AP); Digital Libraries (cs.DL); Physics and Society (physics.soc-ph)
This study examines the historical evolution of interdisciplinary research (IDR) over a 40-year period, focusing on its dynamic trends, phases, and key turning points. We apply time series analysis to identify critical years for interdisciplinary citations (CYICs) and categorizes IDR into three distinct phases based on these trends: Period I (1981-2002), marked by sporadic and limited interdisciplinary activity; Period II (2003-2016), characterized by the emergence of large-scale IDR led primarily by Medicine, with significant breakthroughs in cloning and medical technology; and Period III (2017-present), where IDR became a widely adopted research paradigm. Our findings indicate that IDR has been predominantly concentrated within the Natural Sciences, with Medicine consistently at the forefront, and highlights increasing contributions from Engineering and Environmental disciplines as a new trend. These insights enhance the understanding of the evolution of IDR, its driving factors, and the shifts in the focus of interdisciplinary collaborations.
- [780] arXiv:2501.15128 (replaced) [pdf, html, other]
-
Title: MAP-based Problem-Agnostic diffusion model for Inverse ProblemsComments: 17 pages, 10 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Diffusion models have indeed shown great promise in solving inverse problems in image processing. In this paper, we propose a novel, problem-agnostic diffusion model called the maximum a posteriori (MAP)-based guided term estimation method for inverse problems. To leverage unconditionally pretrained diffusion models to address conditional generation tasks, we divide the conditional score function into two terms according to Bayes' rule: an unconditional score function (approximated by a pretrained score network) and a guided term, which is estimated using a novel MAP-based method that incorporates a Gaussian-type prior of natural images. This innovation allows us to better capture the intrinsic properties of the data, leading to improved performance. Numerical results demonstrate that our method preserves contents more effectively compared to state-of-the-art methods--for example, maintaining the structure of glasses in super-resolution tasks and producing more coherent results in the neighborhood of masked regions during inpainting.
- [781] arXiv:2502.00471 (replaced) [pdf, html, other]
-
Title: Evolution of Society Caused by Collective and Individual DecisionsComments: 15 pages, 9 figures, a converence submissionSubjects: Physics and Society (physics.soc-ph); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
Decision-making societies may vary in their level of cooperation and degree of conservatism, both of which influence their overall performance. Moreover, these factors are not fixed -- they can change based on the decisions agents in the society make in their interests. But can these changes lead to cyclical patterns in societal evolution? To explore this question, we use the ViSE (Voting in Stochastic Environment) model. In this framework, the level of cooperation can be measured by group size, while the degree of conservatism is determined by the voting threshold. Agents can adopt either individualistic or group-oriented strategies when voting on stochastically generated external proposals. For Gaussian proposal generators, the expected capital gain (ECG) -- a measure of agents' performance -- can be expressed in standard mathematical functions. Our findings show that in neutral environments, societal evolution with open or democratic groups can follow cyclic patterns. We also find that highly conservative societies or conservative societies with low levels of cooperation can evolve into liberal (less conservative than majoritarian) societies and that mafia groups never let their members go when they want to.
- [782] arXiv:2502.18924 (replaced) [pdf, html, other]
-
Title: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech SynthesisZiyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou ZhaoSubjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{MegaTTS 3}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at this https URL.
- [783] arXiv:2503.02892 (replaced) [pdf, html, other]
-
Title: Segmenting Bi-Atrial Structures Using ResNext Based FrameworkSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Atrial fibrillation (AF) is the most common cardiac arrhythmia, significantly contributing to mortality, particularly in older populations. While pulmonary vein isolation is a standard treatment, its effectiveness is limited in patients with persistent AF. Recent research highlights the importance of targeting additional atrial regions, particularly fibrotic areas identified via late gadolinium-enhanced MRI (LGE-MRI). However, existing manual segmentation methods are time-consuming and prone to variability. Deep learning techniques, particularly convolutional neural networks (CNNs), have shown promise in automating segmentation. However, most studies focus solely on the left atrium (LA) and rely on small datasets, limiting generalizability. In this paper, we propose a novel two-stage framework incorporating ResNeXt encoders and a cyclic learning rate to segment both the right atrium (RA) and LA walls and cavities in LGE-MRIs. Our method aims to improve the segmentation of challenging small structures, such as atrial walls while maintaining high performance in larger regions like the atrial cavities. The results demonstrate that our approach offers superior segmentation accuracy and robustness compared to traditional architectures, particularly for imbalanced class structures.
- [784] arXiv:2503.10158 (replaced) [pdf, html, other]
-
Title: Solving Modular Linear Systems with a Constraint by parallel decomposition of the Smith form and extended Euclidean division modulo powers of primes divisorsComments: 17 pagesSubjects: Number Theory (math.NT); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS)
Integral linear systems $Ax=b$ with matrices $A$, $b$ and solutions $x$ are also required to be in integers, can be solved using invariant factors of $A$ (by computing the Smith Canonical Form of $A$). This paper explores a new problem which arises in applications, that of obtaining conditions for solving the Modular Linear System $Ax=b\rem n$ given $A,b$ in $\zz_n$ for $x$ in $\zz_n$ along with the constraint that the value of the linear function $\phi(x)=\la w,x\ra$ is coprime to $n$ for some solution $x$. In this paper we develop decomposition of the system to coprime moduli $p^{r(p)}$ which are divisors of $n$ and show how such a decomposition simplifies the computation of Smith form. This extends the well known index calculus method of computing the discrete logarithm where the moduli over which the linear system is reduced were assumed to be prime (to solve the reduced systems over prime fields) to the case when the factors of the modulus are prime powers $p^{r(p)}$. It is shown how this problem can be addressed effciently using the invariant factors and Smith form of the augmented matrix $[A,-p^{r(p)}I]$ and conditions modulo $p$ satisfied by $w$, where $p^{r(p)}$ vary over all divisors of $n$ with $p$ prime.
- [785] arXiv:2503.13379 (replaced) [pdf, html, other]
-
Title: Error bounds for composite quantum hypothesis testing and a new characterization of the weighted Kubo-Ando geometric meansComments: 32 pages. v2: Minor typos correctedSubjects: Quantum Physics (quant-ph); Information Theory (cs.IT); Mathematical Physics (math-ph); Functional Analysis (math.FA)
The optimal error exponents of binary composite i.i.d. state discrimination are trivially bounded by the worst-case pairwise exponents of discriminating individual elements of the sets representing the two hypotheses, and in the finite-dimensional classical case, these bounds in fact give exact single-copy expressions for the error exponents. In contrast, in the non-commutative case, the optimal exponents are only known to be expressible in terms of regularized divergences, resulting in formulas that, while conceptually relevant, practically not very useful. In this paper, we develop further an approach initiated in [Mosonyi, Szilágyi, Weiner, IEEE Trans. Inf. Th. 68(2):1032--1067, 2022] to give improved single-copy bounds on the error exponents by comparing not only individual states from the two hypotheses, but also various unnormalized positive semi-definite operators associated to them. Here, we show a number of equivalent characterizations of such operators giving valid bounds, and show that in the commutative case, considering weighted geometric means of the states, and in the case of two states per hypothesis, considering weighted Kubo-Ando geometric means, are optimal for this approach. As a result, we give a new characterization of the weighted Kubo-Ando geometric means as the only $2$-variable operator geometric means that are block additive, tensor multiplicative, and satisfy the arithmetic-geometric mean inequality. We also extend our results to composite quantum channel discrimination, and show an analogous optimality property of the weighted Kubo-Ando geometric means of two quantum channels, a notion that seems to be new. We extend this concept to defining the notion of superoperator perspective function and establish some of its basic properties, which may be of independent interest.
- [786] arXiv:2503.13388 (replaced) [pdf, html, other]
-
Title: A mathematical model for a universal digital quantum computer with an application to the Grover-Rudolph algorithmSubjects: Quantum Physics (quant-ph); Numerical Analysis (math.NA)
In this work, we develop a novel mathematical framework for universal digital quantum computation using algebraic probability theory. We rigorously define quantum circuits as finite sequences of elementary quantum gates and establish their role in implementing unitary transformations. A key result demonstrates that every unitary matrix in \(\mathrm{U}(N)\) can be expressed as a product of elementary quantum gates, leading to the concept of a universal dictionary for quantum computation. We apply this framework to the construction of quantum circuits that encode probability distributions, focusing on the Grover-Rudolph algorithm. By leveraging controlled quantum gates and rotation matrices, we design a quantum circuit that approximates a given probability density function. Numerical simulations, conducted using Qiskit, confirm the theoretical predictions and validate the effectiveness of our approach. These results provide a rigorous foundation for quantum circuit synthesis within an algebraic probability framework and offer new insights into the encoding of probability distributions in quantum algorithms. Potential applications include quantum machine learning, circuit optimization, and experimental implementations on real quantum hardware.
- [787] arXiv:2503.16678 (replaced) [pdf, html, other]
-
Title: QCPINN: Quantum Classical Physics-Informed Neural Networks for Solving PDEsSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG)
Physics-informed neural networks (PINNs) have emerged as promising methods for solving partial differential equations (PDEs) by embedding physical laws into neural architectures. However, these classical approaches often require large number of parameters for solving complex problems or achieving reasonable accuracy. We investigate whether quantum-enhanced architectures can achieve comparable performance while significantly reducing model complexity. We propose a quantum-classical physics-informed neural network (QCPINN) combining quantum and classical components to solve PDEs with fewer parameters while maintaining comparable accuracy and training convergence. Our approach systematically evaluates two quantum circuit paradigms (e.g., continuous-variable (CV) and discrete-variable (DV)) implementations with four circuit topologies (e.g., alternate, cascade, cross-mesh, and layered), two embedding schemes (e.g., amplitude and angle) on five benchmark PDEs (e.g., Helmholtz, lid-driven cavity, wave, Klein-Gordon, and convection-diffusion equations). Results demonstrate that QCPINNs achieve comparable accuracy to classical PINNs while requiring approximately 10% trainable parameters across different PDEs, and resulting in a further 40% reduction in relative L2 error for the convection-diffusion equation. DV-based circuits with angle embedding and cascade configurations consistently exhibited enhanced convergence stability across all problem types. Our finding establishes parameter efficiency as a quantifiable quantum advantage in physics-informed machine learning. By significantly reducing model complexity while maintaining solution quality, QCPINNs represent a potential direction for overcoming computational bottlenecks in scientific computing applications where traditional approaches require large parameter spaces.
- [788] arXiv:2503.19923 (replaced) [pdf, html, other]
-
Title: Mapping fMRI Signal and Image Stimuli in an Artificial Neural Network Latent Space: Bringing Artificial and Natural Minds TogetherComments: 4 pages, 3 figuresSubjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
The goal of this study is to investigate whether latent space representations of visual stimuli and fMRI data share common information. Decoding and reconstructing stimuli from fMRI data remains a challenge in AI and neuroscience, with significant implications for understanding neural representations and improving the interpretability of Artificial Neural Networks (ANNs). In this preliminary study, we investigate the feasibility of such reconstruction by examining the similarity between the latent spaces of one autoencoder (AE) and one vision transformer (ViT) trained on fMRI and image data, respectively. Using representational similarity analysis (RSA), we found that the latent spaces of the two domains appear different. However, these initial findings are inconclusive, and further research is needed to explore this relationship more thoroughly.
- [789] arXiv:2503.20711 (replaced) [pdf, html, other]
-
Title: Demand Estimation with Text and Image DataSubjects: General Economics (econ.GN); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
We propose a demand estimation method that leverages unstructured text and image data to infer substitution patterns. Using pre-trained deep learning models, we extract embeddings from product images and textual descriptions and incorporate them into a random coefficients logit model. This approach enables researchers to estimate demand even when they lack data on product attributes or when consumers value hard-to-quantify attributes, such as visual design or functional benefits. Using data from a choice experiment, we show that our approach outperforms standard attribute-based models in counterfactual predictions of consumers' second choices. We also apply it across 40 product categories on Amazon and consistently find that text and image data help identify close substitutes within each category.