Statistics
See recent articles
Showing new listings for Friday, 25 April 2025
- [1] arXiv:2504.17013 [pdf, html, other]
-
Title: A Weighted-likelihood framework for class imbalance in Bayesian prediction modelsSubjects: Applications (stat.AP); Machine Learning (stat.ML)
Class imbalance occurs when data used for training classification models has a different number of observations or samples within each category or class. Models built on such data can be biased towards the majority class and have poor predictive performance and generalisation for the minority class. We propose a Bayesian weighted-likelihood (power-likelihood) approach to deal with class imbalance: each observation's likelihood is raised to a weight inversely proportional to its class proportion, with weights normalized to sum to the number of samples. This embeds cost-sensitive learning directly into Bayesian updating and is applicable to binary, multinomial and ordered logistic prediction models. Example models are implemented in Stan, PyMC, and this http URL, and all code and reproducible scripts are archived on Github: this https URL. This approach is simple to implement and extends naturally to arbitrary error-cost matrices.
- [2] arXiv:2504.17043 [pdf, html, other]
-
Title: A Sensitivity Analysis Framework for Quantifying Confidence in Decisions in the Presence of Data UncertaintyComments: 17 pages, 3 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
Nearly all statistical analyses that inform policy-making are based on imperfect data. As examples, the data may suffer from measurement errors, missing values, sample selection bias, or record linkage errors. Analysts have to decide how to handle such data imperfections, e.g., analyze only the complete cases or impute values for the missing items via some posited model. Their choices can influence estimates and hence, ultimately, policy decisions. Thus, it is prudent for analysts to evaluate the sensitivity of estimates and policy decisions to the assumptions underlying their choices. To facilitate this goal, we propose that analysts define metrics and visualizations that target the sensitivity of the ultimate decision to the assumptions underlying their approach to handling the data imperfections. Using these visualizations, the analyst can assess their confidence in the policy decision under their chosen analysis. We illustrate metrics and corresponding visualizations with two examples, namely considering possible measurement error in the inputs of predictive models of presidential vote share and imputing missing values when evaluating the percentage of children exposed to high levels of lead.
- [3] arXiv:2504.17089 [pdf, html, other]
-
Title: Conditional-Marginal Nonparametric Estimation for Stage Waiting Times from Multi-Stage Models under Dependent Right CensoringComments: 54 pages, 22 figures, 8 tablesSubjects: Methodology (stat.ME)
We investigate two population-level quantities (corresponding to complete data) related to uncensored stage waiting times in a progressive multi-stage model, conditional on a prior stage visit. We show how to estimate these quantities consistently using right-censored data. The first quantity is the stage waiting time distribution (survival function), representing the proportion of individuals who remain in stage j within time t after entering stage j. The second quantity is the cumulative incidence function, representing the proportion of individuals who transition from stage j to stage j' within time t after entering stage j. To estimate these quantities, we present two nonparametric approaches. The first uses an inverse probability of censoring weighting (IPCW) method, which reweights the counting processes and the number of individuals at risk (the at-risk set) to address dependent right censoring. The second method utilizes the notion of fractional observations (FRE) that modifies the at-risk set by incorporating probabilities of individuals (who might have been censored in a prior stage) eventually entering the stage of interest in the uncensored or full data experiment. Neither approach is limited to the assumption of independent censoring or Markovian multi-stage frameworks. Simulation studies demonstrate satisfactory performance for both sets of estimators, though the IPCW estimator generally outperforms the FRE estimator in the setups considered in our simulations. These estimations are further illustrated through applications to two real-world datasets: one from patients undergoing bone marrow transplants and the other from patients diagnosed with breast cancer.
- [4] arXiv:2504.17101 [pdf, html, other]
-
Title: MOOSE ProbML: Parallelized Probabilistic Machine Learning and Uncertainty Quantification for Computational Energy ApplicationsSomayajulu L. N. Dhulipala, Peter German, Yifeng Che, Zachary M. Prince, Pierre-Clement A. Simon, Xianjian Xie, Vincent M. Laboure, Hao YanSubjects: Applications (stat.AP)
This paper presents the development and demonstration of massively parallel probabilistic machine learning (ML) and uncertainty quantification (UQ) capabilities within the Multiphysics Object-Oriented Simulation Environment (MOOSE), an open-source computational platform for parallel finite element and finite volume analyses. In addressing the computational expense and uncertainties inherent in complex multiphysics simulations, this paper integrates Gaussian process (GP) variants, active learning, Bayesian inverse UQ, adaptive forward UQ, Bayesian optimization, evolutionary optimization, and Markov chain Monte Carlo (MCMC) within MOOSE. It also elaborates on the interaction among key MOOSE systems -- Sampler, MultiApp, Reporter, and Surrogate -- in enabling these capabilities. The modularity offered by these systems enables development of a multitude of probabilistic ML and UQ algorithms in MOOSE. Example code demonstrations include parallel active learning and parallel Bayesian inference via active learning. The impact of these developments is illustrated through five applications relevant to computational energy applications: UQ of nuclear fuel fission product release, using parallel active learning Bayesian inference; very rare events analysis in nuclear microreactors using active learning; advanced manufacturing process modeling using multi-output GPs (MOGPs) and dimensionality reduction; fluid flow using deep GPs (DGPs); and tritium transport model parameter optimization for fusion energy, using batch Bayesian optimization.
- [5] arXiv:2504.17104 [pdf, html, other]
-
Title: Target trial emulation without matching: a more efficient approach for evaluating vaccine effectiveness using observational dataComments: 24 pages, 5 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
Real-world vaccine effectiveness has increasingly been studied using matching-based approaches, particularly in observational cohort studies following the target trial emulation framework. Although matching is appealing in its simplicity, it suffers important limitations in terms of clarity of the target estimand and the efficiency or precision with which is it estimated. Scientifically justified causal estimands of vaccine effectiveness may be difficult to define owing to the fact that vaccine uptake varies over calendar time when infection dynamics may also be rapidly changing. We propose a causal estimand of vaccine effectiveness that summarizes vaccine effectiveness over calendar time, similar to how vaccine efficacy is summarized in a randomized controlled trial. We describe the identification of our estimand, including its underlying assumptions, and propose simple-to-implement estimators based on two hazard regression models. We apply our proposed estimator in simulations and in a study to assess the effectiveness of the Pfizer-BioNTech COVID-19 vaccine to prevent infections with SARS-CoV2 in children 5-11 years old. In both settings, we find that our proposed estimator yields similar scientific inferences while providing significant efficiency gains over commonly used matching-based estimators.
- [6] arXiv:2504.17112 [pdf, html, other]
-
Title: Physics-informed features in supervised machine learningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
Supervised machine learning involves approximating an unknown functional relationship from a limited dataset of features and corresponding labels. The classical approach to feature-based machine learning typically relies on applying linear regression to standardized features, without considering their physical meaning. This may limit model explainability, particularly in scientific applications. This study proposes a physics-informed approach to feature-based machine learning that constructs non-linear feature maps informed by physical laws and dimensional analysis. These maps enhance model interpretability and, when physical laws are unknown, allow for the identification of relevant mechanisms through feature ranking. The method aims to improve both predictive performance in regression tasks and classification skill scores by integrating domain knowledge into the learning process, while also enabling the potential discovery of new physical equations within the context of explainable machine learning.
- [7] arXiv:2504.17126 [pdf, html, other]
-
Title: Estimation and Inference for the Average Treatment Effect in a Score-Explained Heterogeneous Treatment Effect ModelComments: 44 pagesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
In many practical situations, randomly assigning treatments to subjects is uncommon due to feasibility constraints. For example, economic aid programs and merit-based scholarships are often restricted to those meeting specific income or exam score thresholds. In these scenarios, traditional approaches to estimating treatment effects typically focus solely on observations near the cutoff point, thereby excluding a significant portion of the sample and potentially leading to information loss. Moreover, these methods generally achieve a non-parametric convergence rate. While some approaches, e.g., Mukherjee et al. (2021), attempt to tackle these issues, they commonly assume that treatment effects are constant across individuals, an assumption that is often unrealistic in practice. In this study, we propose a differencing and matching-based estimator of the average treatment effect on the treated (ATT) in the presence of heterogeneous treatment effects, utilizing all available observations. We establish the asymptotic normality of our estimator and illustrate its effectiveness through various synthetic and real data analyses. Additionally, we demonstrate that our method yields non-parametric estimates of the conditional average treatment effect (CATE) and individual treatment effect (ITE) as a byproduct.
- [8] arXiv:2504.17147 [pdf, html, other]
-
Title: A Delayed Acceptance Auxiliary Variable MCMC for Spatial Models with Intractable Likelihood FunctionSubjects: Methodology (stat.ME); Computation (stat.CO)
A large class of spatial models contains intractable normalizing functions, such as spatial lattice models, interaction spatial point processes, and social network models. Bayesian inference for such models is challenging since the resulting posterior distribution is doubly intractable. Although auxiliary variable MCMC (AVM) algorithms are known to be the most practical, they are computationally expensive due to the repeated auxiliary variable simulations. To address this, we propose delayed-acceptance AVM (DA-AVM) methods, which can reduce the number of auxiliary variable simulations. The first stage of the kernel uses a cheap surrogate to decide whether to accept or reject the proposed parameter value. The second stage guarantees detailed balance with respect to the posterior. The auxiliary variable simulation is performed only on the parameters accepted in the first stage. We construct various surrogates specifically tailored for doubly intractable problems, including subsampling strategy, Gaussian process emulation, and frequentist estimator-based approximation. We validate our method through simulated and real data applications, demonstrating its practicality for complex spatial models.
- [9] arXiv:2504.17166 [pdf, html, other]
-
Title: Causal rule ensemble approach for multi-arm dataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Heterogeneous treatment effect (HTE) estimation is critical in medical research. It provides insights into how treatment effects vary among individuals, which can provide statistical evidence for precision medicine. While most existing methods focus on binary treatment situations, real-world applications often involve multiple interventions. However, current HTE estimation methods are primarily designed for binary comparisons and often rely on black-box models, which limit their applicability and interpretability in multi-arm settings. To address these challenges, we propose an interpretable machine learning framework for HTE estimation in multi-arm trials. Our method employs a rule-based ensemble approach consisting of rule generation, rule ensemble, and HTE estimation, ensuring both predictive accuracy and interpretability. Through extensive simulation studies and real data applications, the performance of our method was evaluated against state-of-the-art multi-arm HTE estimation approaches. The results indicate that our approach achieved lower bias and higher estimation accuracy compared with those of existing methods. Furthermore, the interpretability of our framework allows clearer insights into how covariates influence treatment effects, facilitating clinical decision making. By bridging the gap between accuracy and interpretability, our study contributes a valuable tool for multi-arm HTE estimation, supporting precision medicine.
- [10] arXiv:2504.17195 [pdf, html, other]
-
Title: A general approach to modeling environmental mixtures with multivariate outcomesSubjects: Methodology (stat.ME)
An important goal of environmental health research is to assess the health risks posed by mixtures of multiple environmental exposures. In these mixtures analyses, flexible models like Bayesian kernel machine regression and multiple index models are appealing because they allow for arbitrary non-linear exposure-outcome relationships. However, this flexibility comes at the cost of low power, particularly when exposures are highly correlated and the health effects are weak, as is typical in environmental health studies. We propose an adaptive index modelling strategy that borrows strength across exposures and outcomes by exploiting similar mixture component weights and exposure-response relationships. In the special case of distributed lag models, in which exposures are measured repeatedly over time, we jointly encourage co-clustering of lag profiles and exposure-response curves to more efficiently identify critical windows of vulnerability and characterize important exposure effects. We then extend the proposed approach to the multivariate index model setting where the true index structure -- the number of indices and their composition -- is unknown, and introduce variable importance measures to quantify component contributions to mixture effects. Using time series data from the National Morbidity, Mortality and Air Pollution Study, we demonstrate the proposed methods by jointly modelling three mortality outcomes and two cumulative air pollution measurements with a maximum lag of 14 days.
- [11] arXiv:2504.17202 [pdf, html, other]
-
Title: Graph Quasirandomness for Hypothesis Testing of Stochastic Block ModelsSubjects: Statistics Theory (math.ST); Combinatorics (math.CO); Probability (math.PR)
The celebrated theorem of Chung, Graham, and Wilson on quasirandom graphs implies that if the 4-cycle and edge counts in a graph $G$ are both close to their typical number in $\mathbb{G}(n,1/2),$ then this also holds for the counts of subgraphs isomorphic to $H$ for any $H$ of constant size. We aim to prove a similar statement where the notion of close is whether the given (signed) subgraph count can be used as a test between $\mathbb{G}(n,1/2)$ and a stochastic block model $\mathbb{SBM}.$
Quantitatively, this is related to approximately maximizing $H \longrightarrow |\Phi(H)|^{\frac{1}{|\mathsf{V}(H)|}},$ where $\Phi(H)$ is the Fourier coefficient of $\mathbb{SBM}$, indexed by subgraph $H.$ This formulation turns out to be equivalent to approximately maximizing the partition function of a spin model over alphabet equal to the community labels in $\mathbb{SBM}.$
We resolve the approximate maximization when $\mathbb{SBM}$ satisfies one of four conditions: 1) the probability of an edge between any two vertices in different communities is exactly $1/2$; 2) the probability of an edge between two vertices from any two communities is at least $1/2$ (this case is also covered in a recent work of Yu, Zadik, and Zhang); 3) the probability of belonging to any given community is at least $c$ for some universal constant $c>0$; 4) $\mathbb{SBM}$ has two communities. In each of these cases, we show that there is an approximate maximizer of $|\Phi(H)|^{\frac{1}{|\mathsf{V}(H)|}}$ in the set $\mathsf{A} = \{\text{stars, 4-cycle}\}.$ This implies that if there exists a constant-degree polynomial test distinguishing $\mathbb{G}(n,1/2)$ and $\mathbb{SBM},$ then the two distributions can also be distinguished via the signed count of some graph in $\mathsf{A}.$ We conjecture that the same holds true for distinguishing $\mathbb{G}(n,1/2)$ and any graphon if we also add triangles to $\mathsf{A}.$ - [12] arXiv:2504.17205 [pdf, html, other]
-
Title: A New Look at the Odds Ratio in Logistic RegressionComments: 23 pagesSubjects: Methodology (stat.ME)
The standard odds ratio of logistic regression is foundational but limited to individual explanatory variables. This work derives a multivariable odds ratio that applies to all the explanatory variables in all their combinations.
- [13] arXiv:2504.17322 [pdf, other]
-
Title: Testing Conditional Independence via Density Ratio RegressionSubjects: Methodology (stat.ME)
This paper develops a conditional independence (CI) test from a conditional density ratio (CDR) for weakly dependent data. The main contribution is presenting a closed-form expression for the estimated conditional density ratio function with good finite-sample performance. The key idea is exploiting the linear sieve combined with the quadratic norm. Matsushita et al. (2022) exploited the linear sieve to estimate the unconditional density ratio. We must exploit the linear sieve twice to estimate the conditional density ratio. First, we estimate an unconditional density ratio with an unweighted sieve least-squares regression, as done in Matsushita et al. (2022), and then the conditional density ratio with a weighted sieve least-squares regression, where the weights are the estimated unconditional density ratio. The proposed test has several advantages over existing alternatives. First, the test statistic is invariant to the monotone transformation of the data distribution and has a closed-form expression that enhances computational speed and efficiency. Second, the conditional density ratio satisfies the moment restrictions. The estimated ratio satisfies the empirical analog of those moment restrictions. As a result, the estimated density ratio is unlikely to have extreme values. Third, the proposed test can detect all deviations from conditional independence at rates arbitrarily close to $n^{-1/2}$ , and the local power loss is independent of the data dimension. A small-scale simulation study indicates that the proposed test outperforms the alternatives in various dependence structures.
- [14] arXiv:2504.17451 [pdf, html, other]
-
Title: Functional $K$ Sample Problem via Multivariate Optimal Measure Transport-Based Permutation TestSubjects: Statistics Theory (math.ST)
The null hypothesis of equality of distributions of functional data coming from $K$ samples is considered. The proposed test statistic is multivariate and its components are based on pairwise Cramér von Mises comparisons of empirical characteristic functionals. The significance of the test statistic is evaluated via the novel multivariate permutation test, where the final single $p$-value is computed using the discrete optimal measure transport. The methodology is illustrated by real data on cumulative intraday returns of Bitcoin.
- [15] arXiv:2504.17479 [pdf, html, other]
-
Title: Probabilistic modeling of delays for train journeys with transfersNikolaus Stratil-Sauer (1), Nils Breyer (1) ((1) Linköping University)Comments: 25 pages, submitted to Journal of Public TransportationSubjects: Applications (stat.AP)
Reliability plays a key role in the experience of a rail traveler. The reliability of journeys involving transfers is affected by the reliability of the transfers and the consequences of missing a transfer, as well as the possible delay of the train used to reach the destination. In this paper, we propose a flexible method to model the reliability of train journeys with any number of transfers. The method combines a transfer reliability model based on gradient boosting responsible for predicting the reliability of transfers between trains and a delay model based on probabilistic Bayesian regression, which is used to model train arrival delays. The models are trained on delay data from four Swedish train stations and evaluated on delay data from another two stations, in order to evaluate the generalization performance of the models. We show that the probabilistic delay model, which models train delays following a mixture distribution with two lognormal components, allows to much more realistically model the distribution of actual train delays compared to a standard lognormal model. Finally, we show how these models can be used together to sample the arrival delay at the final destination of the entire journey. The results indicate that the method accurately predicts the reliability for nine out of ten tested journeys. The method could be used to improve journey planners by providing reliability information to travelers. Further applications include timetable planning and transport modeling.
- [16] arXiv:2504.17546 [pdf, html, other]
-
Title: An introduction to R package `mvs`Comments: 15 pages, 4 figures. Package vignette corresponding to this https URLSubjects: Computation (stat.CO); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
In biomedical science, a set of objects or persons can often be described by multiple distinct sets of features obtained from different data sources or modalities (called "multi-view data"). Classical machine learning methods ignore the multi-view structure of such data, limiting model interpretability and performance. The R package `mvs` provides methods that were designed specifically for dealing with multi-view data, based on the multi-view stacking (MVS) framework. MVS is a form of supervised (machine) learning used to train multi-view classification or prediction models. MVS works by training a learning algorithm on each view separately, estimating the predictive power of each view-specific model through cross-validation, and then using another learning algorithm to assign weights to the view-specific models based on their estimated predictions. MVS is a form of ensemble learning, dividing the large multi-view learning problem into smaller sub-problems. Most of these sub-problems can be solved in parallel, making it computationally attractive. Additionally, the number of features of the sub-problems is greatly reduced compared with the full multi-view learning problem. This makes MVS especially useful when the total number of features is larger than the number of observations (i.e., high-dimensional data). MVS can still be applied even if the sub-problems are themselves high-dimensional by adding suitable penalty terms to the learning algorithms. Furthermore, MVS can be used to automatically select the views which are most important for prediction. The R package `mvs` makes fitting MVS models, including such penalty terms, easily and openly accessible. `mvs` allows for the fitting of stacked models with any number of levels, with different penalty terms, different outcome distributions, and provides several options for missing data handling.
- [17] arXiv:2504.17559 [pdf, html, other]
-
Title: Concentration inequalities and cut-off phenomena for penalized model selection within a basic Rademacher frameworkSubjects: Statistics Theory (math.ST)
This article exists first and foremost to contribute to a tribute to Patrick Cattiaux. One of the two authors has known Patrick Cattiaux for a very long time, and owes him a great deal. If we are to illustrate the adage that life is made up of chance, then what could be better than the meeting of two young people in the 80s, both of whom fell in love with the mathematics of randomness, and one of whom changed the other's life by letting him in on a secret: if you really believe in it, you can turn this passion into a profession. By another happy coincidence, this tribute comes at just the right time, as Michel Talagrand has been awarded the Abel prize. The temptation was therefore great to do a double. Following one of the many galleries opened up by mathematics, we shall first draw a link between the mathematics of Patrick Cattiaux and that of Michel Talagrand. Then we shall show how the abstract probabilistic material on the concentration of product measures thus revisited can be used to shed light on cut-off phenomena in our field of expertise, mathematical statistics. Nothing revolutionary here, as everyone knows the impact that Talagrand's work has had on the development of mathematical statistics since the late 90s, but we've chosen a very simple framework in which everything can be explained with minimal technicality, leaving the main ideas to the fore.
- [18] arXiv:2504.17611 [pdf, html, other]
-
Title: Some Results on Generalized Familywise Error Rate Controlling Procedures under DependenceSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
The topic of multiple hypotheses testing now has a potpourri of novel theories and ubiquitous applications in diverse scientific fields. However, the universal utility of this field often hinders the possibility of having a generalized theory that accommodates every scenario. This tradeoff is better reflected through the lens of dependence, a central piece behind the theoretical and applied developments of multiple testing. Although omnipresent in many scientific avenues, the nature and extent of dependence vary substantially with the context and complexity of the particular scenario. Positive dependence is the norm in testing many treatments versus a single control or in spatial statistics. On the contrary, negative dependence arises naturally in tests based on split samples and in cyclical, ordered comparisons. In GWAS, the SNP markers are generally considered to be weakly dependent. Generalized familywise error rate (k-FWER) control has been one of the prominent frequentist approaches in simultaneous inference. However, the performances of k-FWER controlling procedures are yet unexplored under different dependencies. This paper revisits the classical testing problem of normal means in different correlated frameworks. We establish upper bounds on the generalized familywise error rates under each dependence, consequently giving rise to improved testing procedures. Towards this, we present improved probability inequalities, which are of independent theoretical interest
- [19] arXiv:2504.17622 [pdf, html, other]
-
Title: Likelihood-Free Variational AutoencodersSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Variational Autoencoders (VAEs) typically rely on a probabilistic decoder with a predefined likelihood, most commonly an isotropic Gaussian, to model the data conditional on latent variables. While convenient for optimization, this choice often leads to likelihood misspecification, resulting in blurry reconstructions and poor data fidelity, especially for high-dimensional data such as images. In this work, we propose \textit{EnVAE}, a novel likelihood-free generative framework that has a deterministic decoder and employs the energy score -- a proper scoring rule -- to build the reconstruction loss. This enables likelihood-free inference without requiring explicit parametric density functions. To address the computational inefficiency of the energy score, we introduce a fast variant, \textit{FEnVAE}, based on the local smoothness of the decoder and the sharpness of the posterior distribution of latent variables. This yields an efficient single-sample training objective that integrates seamlessly into existing VAE pipelines with minimal overhead. Empirical results on standard benchmarks demonstrate that \textit{EnVAE} achieves superior reconstruction and generation quality compared to likelihood-based baselines. Our framework offers a general, scalable, and statistically principled alternative for flexible and nonparametric distribution learning in generative modeling.
- [20] arXiv:2504.17651 [pdf, html, other]
-
Title: Practical aspects of the virtual noise convex optimum design approach for correlated responsesComments: 33 pagesSubjects: Methodology (stat.ME); Computation (stat.CO)
In this paper we present several practically-oriented extensions and considerations for the virtual noise method in optimal design under correlation. First we introduce a slightly modified virtual noise representation which further illuminates the parallels to the classical design approach for uncorrelated observations. We suggest more efficient algorithms to obtain the design measures. Furthermore, we show that various convex relaxation methods used for sensor selection are special cases of our approach and can be solved within our framework. Finally, we provide practical guidelines on how to generally approach a design problem with correlated observations and demonstrate how to utilize the virtual noise method in this context in a meaningful way.
- [21] arXiv:2504.17719 [pdf, html, other]
-
Title: Evaluating Uncertainty in Deep Gaussian ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Reliable uncertainty estimates are crucial in modern machine learning. Deep Gaussian Processes (DGPs) and Deep Sigma Point Processes (DSPPs) extend GPs hierarchically, offering promising methods for uncertainty quantification grounded in Bayesian principles. However, their empirical calibration and robustness under distribution shift relative to baselines like Deep Ensembles remain understudied. This work evaluates these models on regression (CASP dataset) and classification (ESR dataset) tasks, assessing predictive performance (MAE, Accu- racy), calibration using Negative Log-Likelihood (NLL) and Expected Calibration Error (ECE), alongside robustness under various synthetic feature-level distribution shifts. Results indicate DSPPs provide strong in-distribution calibration leveraging their sigma point approximations. However, compared to Deep Ensembles, which demonstrated superior robustness in both per- formance and calibration under the tested shifts, the GP-based methods showed vulnerabilities, exhibiting particular sensitivity in the observed metrics. Our findings underscore ensembles as a robust baseline, suggesting that while deep GP methods offer good in-distribution calibration, their practical robustness under distribution shift requires careful evaluation. To facilitate reproducibility, we make our code available at this https URL.
- [22] arXiv:2504.17733 [pdf, html, other]
-
Title: Fuzzy clustering and community detection: an integrated approachComments: 38 pages, 12 figuresSubjects: Computation (stat.CO)
This paper addresses the ambitious goal of merging two different approaches to group detection in complex domains: one based on fuzzy clustering and the other on community detection theory. To achieve this, two clustering algorithms are proposed: Fuzzy C-Medoids Clustering with Modularity Spatial Correction and Fuzzy C-Modes Clustering with Modularity Spatial Correction. The former is designed for quantitative data, while the latter is intended for qualitative data. The concept of fuzzy modularity is introduced into the standard objective function of fuzzy clustering algorithms as a spatial regularization term, whose contribution to the clustering criterion based on attributes is controlled by an exogenous parameter. An extensive simulation study is conducted to support the theoretical framework, complemented by two applications to real-world data related to the theme of sustainability. The first application involves data from the 2030 Agenda for Sustainable Development, while the second focuses on urban green spaces in Italian provincial capitals and metropolitan cities. Both the simulation results and the applications demonstrate the advantages of this new methodological proposal.
New submissions (showing 22 of 22 entries)
- [23] arXiv:2504.08824 (cross-list from cs.LG) [pdf, html, other]
-
Title: ColonScopeX: Leveraging Explainable Expert Systems with Multimodal Data for Improved Early Diagnosis of Colorectal CancerComments: Published to AAAI-25 Bridge ProgramSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Applications (stat.AP)
Colorectal cancer (CRC) ranks as the second leading cause of cancer-related deaths and the third most prevalent malignant tumour worldwide. Early detection of CRC remains problematic due to its non-specific and often embarrassing symptoms, which patients frequently overlook or hesitate to report to clinicians. Crucially, the stage at which CRC is diagnosed significantly impacts survivability, with a survival rate of 80-95\% for Stage I and a stark decline to 10\% for Stage IV. Unfortunately, in the UK, only 14.4\% of cases are diagnosed at the earliest stage (Stage I).
In this study, we propose ColonScopeX, a machine learning framework utilizing explainable AI (XAI) methodologies to enhance the early detection of CRC and pre-cancerous lesions. Our approach employs a multimodal model that integrates signals from blood sample measurements, processed using the Savitzky-Golay algorithm for fingerprint smoothing, alongside comprehensive patient metadata, including medication history, comorbidities, age, weight, and BMI. By leveraging XAI techniques, we aim to render the model's decision-making process transparent and interpretable, thereby fostering greater trust and understanding in its predictions. The proposed framework could be utilised as a triage tool or a screening tool of the general population.
This research highlights the potential of combining diverse patient data sources and explainable machine learning to tackle critical challenges in medical diagnostics. - [24] arXiv:2504.17004 (cross-list from cs.LG) [pdf, html, other]
-
Title: (Im)possibility of Automated Hallucination Detection in Large Language ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Is automated hallucination detection possible? In this work, we introduce a theoretical framework to analyze the feasibility of automatically detecting hallucinations produced by large language models (LLMs). Inspired by the classical Gold-Angluin framework for language identification and its recent adaptation to language generation by Kleinberg and Mullainathan, we investigate whether an algorithm, trained on examples drawn from an unknown target language $K$ (selected from a countable collection) and given access to an LLM, can reliably determine whether the LLM's outputs are correct or constitute hallucinations.
First, we establish an equivalence between hallucination detection and the classical task of language identification. We prove that any hallucination detection method can be converted into a language identification method, and conversely, algorithms solving language identification can be adapted for hallucination detection. Given the inherent difficulty of language identification, this implies that hallucination detection is fundamentally impossible for most language collections if the detector is trained using only correct examples from the target language.
Second, we show that the use of expert-labeled feedback, i.e., training the detector with both positive examples (correct statements) and negative examples (explicitly labeled incorrect statements), dramatically changes this conclusion. Under this enriched training regime, automated hallucination detection becomes possible for all countable language collections.
These results highlight the essential role of expert-labeled examples in training hallucination detectors and provide theoretical support for feedback-based methods, such as reinforcement learning with human feedback (RLHF), which have proven critical for reliable LLM deployment. - [25] arXiv:2504.17008 (cross-list from cs.IT) [pdf, html, other]
-
Title: Relationship between Hölder Divergence and Functional Density Power Divergence: Intersection and GeneralizationComments: 20 pages, 1 figureSubjects: Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
In this study, we discuss the relationship between two families of density-power-based divergences with functional degrees of freedom -- the Hölder divergence and the functional density power divergence (FDPD) -- based on their intersection and generalization. These divergence families include the density power divergence and the $\gamma$-divergence as special cases. First, we prove that the intersection of the Hölder divergence and the FDPD is limited to a general divergence family introduced by Jones et al. (Biometrika, 2001). Subsequently, motivated by the fact that Hölder's inequality is used in the proofs of nonnegativity for both the Hölder divergence and the FDPD, we define a generalized divergence family, referred to as the $\xi$-Hölder divergence. The nonnegativity of the $\xi$-Hölder divergence is established through a combination of the inequalities used to prove the nonnegativity of the Hölder divergence and the FDPD. Furthermore, we derive an inequality between the composite scoring rules corresponding to different FDPDs based on the $\xi$-Hölder divergence. Finally, we prove that imposing the mathematical structure of the Hölder score on a composite scoring rule results in the $\xi$-Hölder divergence.
- [26] arXiv:2504.17066 (cross-list from cs.LG) [pdf, html, other]
-
Title: Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score MatchingSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Software Engineering (cs.SE); Machine Learning (stat.ML)
Fairness-aware learning aims to mitigate discrimination against specific protected social groups (e.g., those categorized by gender, ethnicity, age) while minimizing predictive performance loss. Despite efforts to improve fairness in machine learning, prior studies have shown that many models remain unfair when measured against various fairness metrics. In this paper, we examine whether the way training and testing data are sampled affects the reliability of reported fairness metrics. Since training and test sets are often randomly sampled from the same population, bias present in the training data may still exist in the test data, potentially skewing fairness assessments. To address this, we propose FairMatch, a post-processing method that applies propensity score matching to evaluate and mitigate bias. FairMatch identifies control and treatment pairs with similar propensity scores in the test set and adjusts decision thresholds for different subgroups accordingly. For samples that cannot be matched, we perform probabilistic calibration using fairness-aware loss functions. Experimental results demonstrate that our approach can (a) precisely locate subsets of the test data where the model is unbiased, and (b) significantly reduce bias on the remaining data. Overall, propensity score matching offers a principled way to improve both fairness evaluation and mitigation, without sacrificing predictive performance.
- [27] arXiv:2504.17079 (cross-list from cs.LG) [pdf, html, other]
-
Title: A Novel Hybrid Approach Using an Attention-Based Transformer + GRU Model for Predicting Cryptocurrency PricesSubjects: Machine Learning (cs.LG); Applications (stat.AP)
In this article, we introduce a novel deep learning hybrid model that integrates attention Transformer and Gated Recurrent Unit (GRU) architectures to improve the accuracy of cryptocurrency price predictions. By combining the Transformer's strength in capturing long-range patterns with the GRU's ability to model short-term and sequential trends, the hybrid model provides a well-rounded approach to time series forecasting. We apply the model to predict the daily closing prices of Bitcoin and Ethereum based on historical data that include past prices, trading volumes, and the Fear and Greed index. We evaluate the performance of our proposed model by comparing it with four other machine learning models: two are non-sequential feedforward models: Radial Basis Function Network (RBFN) and General Regression Neural Network (GRNN), and two are bidirectional sequential memory-based models: Bidirectional Long-Short-Term Memory (BiLSTM) and Bidirectional Gated Recurrent Unit (BiGRU). The performance of the model is assessed using several metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE), along with statistical validation through the nonparametric Friedman test followed by a post hoc Wilcoxon signed rank test. The results demonstrate that our hybrid model consistently achieves superior accuracy, highlighting its effectiveness for financial prediction tasks. These findings provide valuable insights for improving real-time decision making in cryptocurrency markets and support the growing use of hybrid deep learning models in financial analytics.
- [28] arXiv:2504.17160 (cross-list from cs.LG) [pdf, html, other]
-
Title: OUI Need to Talk About Weight Decay: A New Perspective on Overfitting DetectionComments: 10 pages, 3 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
We introduce the Overfitting-Underfitting Indicator (OUI), a novel tool for monitoring the training dynamics of Deep Neural Networks (DNNs) and identifying optimal regularization hyperparameters. Specifically, we validate that OUI can effectively guide the selection of the Weight Decay (WD) hyperparameter by indicating whether a model is overfitting or underfitting during training without requiring validation data. Through experiments on DenseNet-BC-100 with CIFAR- 100, EfficientNet-B0 with TinyImageNet and ResNet-34 with ImageNet-1K, we show that maintaining OUI within a prescribed interval correlates strongly with improved generalization and validation scores. Notably, OUI converges significantly faster than traditional metrics such as loss or accuracy, enabling practitioners to identify optimal WD (hyperparameter) values within the early stages of training. By leveraging OUI as a reliable indicator, we can determine early in training whether the chosen WD value leads the model to underfit the training data, overfit, or strike a well-balanced trade-off that maximizes validation scores. This enables more precise WD tuning for optimal performance on the tested datasets and DNNs. All code for reproducing these experiments is available at this https URL.
- [29] arXiv:2504.17175 (cross-list from math.PR) [pdf, html, other]
-
Title: Asymptotics of Yule's nonsense correlation for Ornstein-Uhlenbeck paths: The correlated caseSubjects: Probability (math.PR); Statistics Theory (math.ST)
We study the continuous-time version of the empirical correlation coefficient between the paths of two possibly correlated Ornstein-Uhlenbeck processes, known as Yule's nonsense correlation for these paths. Using sharp tools from the analysis on Wiener chaos, we establish the asymptotic normality of the fluctuations of this correlation coefficient around its long-time limit, which is the mathematical correlation coefficient between the two processes. This asymptotic normality is quantified in Kolmogorov distance, which allows us to establish speeds of convergence in the Type-II error for two simple tests of independence of the paths, based on the empirical correlation, and based on its numerator. An application to independence of two observations of solutions to the stochastic heat equation is given, with excellent asymptotic power properties using merely a small number of the solutions' Fourier modes.
- [30] arXiv:2504.17274 (cross-list from cs.LG) [pdf, other]
-
Title: Signal Recovery from Random Dot-Product Graphs Under Local Differential PrivacySubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
We consider the problem of recovering latent information from graphs under $\varepsilon$-edge local differential privacy where the presence of relationships/edges between two users/vertices remains confidential, even from the data curator. For the class of generalized random dot-product graphs, we show that a standard local differential privacy mechanism induces a specific geometric distortion in the latent positions. Leveraging this insight, we show that consistent recovery of the latent positions is achievable by appropriately adjusting the statistical inference procedure for the privatized graph. Furthermore, we prove that our procedure is nearly minimax-optimal under local edge differential privacy constraints. Lastly, we show that this framework allows for consistent recovery of geometric and topological information underlying the latent positions, as encoded in their persistence diagrams. Our results extend previous work from the private community detection literature to a substantially richer class of models and inferential tasks.
- [31] arXiv:2504.17655 (cross-list from cs.LG) [pdf, html, other]
-
Title: Aerial Image Classification in Scarce and Unconstrained Environments via Conformal PredictionComments: 17 pages, 5 figures, and 2 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
This paper presents a comprehensive empirical analysis of conformal prediction methods on a challenging aerial image dataset featuring diverse events in unconstrained environments. Conformal prediction is a powerful post-hoc technique that takes the output of any classifier and transforms it into a set of likely labels, providing a statistical guarantee on the coverage of the true label. Unlike evaluations on standard benchmarks, our study addresses the complexities of data-scarce and highly variable real-world settings. We investigate the effectiveness of leveraging pretrained models (MobileNet, DenseNet, and ResNet), fine-tuned with limited labeled data, to generate informative prediction sets. To further evaluate the impact of calibration, we consider two parallel pipelines (with and without temperature scaling) and assess performance using two key metrics: empirical coverage and average prediction set size. This setup allows us to systematically examine how calibration choices influence the trade-off between reliability and efficiency. Our findings demonstrate that even with relatively small labeled samples and simple nonconformity scores, conformal prediction can yield valuable uncertainty estimates for complex tasks. Moreover, our analysis reveals that while temperature scaling is often employed for calibration, it does not consistently lead to smaller prediction sets, underscoring the importance of careful consideration in its application. Furthermore, our results highlight the significant potential of model compression techniques within the conformal prediction pipeline for deployment in resource-constrained environments. Based on our observations, we advocate for future research to delve into the impact of noisy or ambiguous labels on conformal prediction performance and to explore effective model reduction strategies.
Cross submissions (showing 9 of 9 entries)
- [32] arXiv:2102.09552 (replaced) [pdf, html, other]
-
Title: Linear Functions to the Extended RealsComments: 23 pagesSubjects: Statistics Theory (math.ST); Computer Science and Game Theory (cs.GT)
This paper investigates functions from $\mathbb{R}^d$ to $\mathbb{R} \cup \{\pm \infty\}$ that satisfy axioms of linearity wherever allowed by extended-value arithmetic. They have a nontrivial structure defined inductively on $d$, and unlike finite linear functions, they require $\Omega(d^2)$ parameters to uniquely identify. In particular they can capture vertical tangent planes to epigraphs: a function (never $-\infty$) is convex if and only if it has an extended-valued subgradient at every point in its effective domain, if and only if it is the supremum of a family of "affine extended" functions. These results are applied to the well-known characterization of proper scoring rules, for the finite-dimensional case: it is carefully and rigorously extended here to a more constructive form. In particular it is investigated when proper scoring rules can be constructed from a given convex function.
- [33] arXiv:2108.12827 (replaced) [pdf, html, other]
-
Title: Survival Analysis with Graph-Based Regularization for PredictorsSubjects: Statistics Theory (math.ST)
We study the variable selection problem in survival analysis to identify the most important factors affecting survival time. Our method incorporates prior knowledge of mutual correlations among variables, represented through a graph. We utilize the Cox proportional hazard model with a graph-based regularizer for variable selection. We present a computationally efficient algorithm developed to solve the graph regularized maximum likelihood problem by establishing connections with the group lasso, and provide theoretical guarantees about the recovery error and asymptotic distribution of the proposed estimators. The improved performance of the proposed approach compared with existing methods are demonstrated in both synthetic and real organ transplantation datasets.
- [34] arXiv:2302.09034 (replaced) [pdf, html, other]
-
Title: Bayesian Mixtures Models with Repulsive and Attractive AtomsSubjects: Statistics Theory (math.ST); Probability (math.PR); Methodology (stat.ME)
The study of almost surely discrete random probability measures is an active line of research in Bayesian nonparametrics. The idea of assuming interaction across the atoms of the random probability measure has recently spurred significant interest in the context of Bayesian mixture models. This allows the definition of priors that encourage well-separated and interpretable clusters. In this work, we provide a unified framework for the construction and the Bayesian analysis of random probability measures with interacting atoms, encompassing both repulsive and attractive behaviours. Specifically, we derive closed-form expressions for the posterior distribution, the marginal and predictive distributions, which were not previously available except for the case of measures with i.i.d. atoms. We show how these quantities are fundamental both for prior elicitation and to develop new posterior simulation algorithms for hierarchical mixture models. Our results are obtained without any assumption on the finite point process that governs the atoms of the random measure. Their proofs rely on analytical tools borrowed from the Palm calculus theory, which might be of independent interest. We specialise our treatment to the classes of Poisson, Gibbs, and determinantal point processes, as well as in the case of shot-noise Cox processes. Finally, we illustrate the performance of different modelling strategies on simulated and real datasets.
- [35] arXiv:2303.16843 (replaced) [pdf, html, other]
-
Title: An Optimal Design Framework for Lasso Sign RecoverySubjects: Statistics Theory (math.ST)
Supersaturated designs investigate more factors than there are runs, and are often constructed under a criterion measuring a design's proximity to an unattainable orthogonal design. The most popular analysis identifies active factors by inspecting the solution path of a penalized estimator, such as the lasso. Recent criteria encouraging positive correlations between factors have been shown to produce designs with more definitive solution paths so long as the active factors have positive effects. Two open problems affecting the understanding and practicality of supersaturated designs are: (1) do optimal designs under existing criteria maximize support recovery probability across an estimator's solution path, and (2) why do designs with positively correlated columns produce more definitive solution paths when the active factors have positive sign effects? To answer these questions, we develop criteria maximizing the lasso's sign recovery probability. We prove that an orthogonal design is an ideal structure when the signs of the active factors are unknown, and a design constant small, positive correlations is ideal when the signs are assumed known. A computationally-efficient design search algorithm is proposed that first filters through optimal designs under new heuristic criteria to select the one that maximizes the lasso sign recovery probability.
- [36] arXiv:2310.16975 (replaced) [pdf, html, other]
-
Title: Efficient Neural Network Approaches for Conditional Optimal Transport with Applications in Bayesian InferenceComments: 26 pages, 7 tables, 8 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We present two neural network approaches that approximate the solutions of static and dynamic $\unicode{x1D450}\unicode{x1D45C}\unicode{x1D45B}\unicode{x1D451}\unicode{x1D456}\unicode{x1D461}\unicode{x1D456}\unicode{x1D45C}\unicode{x1D45B}\unicode{x1D44E}\unicode{x1D459}\unicode{x0020}\unicode{x1D45C}\unicode{x1D45D}\unicode{x1D461}\unicode{x1D456}\unicode{x1D45A}\unicode{x1D44E}\unicode{x1D459}\unicode{x0020}\unicode{x1D461}\unicode{x1D45F}\unicode{x1D44E}\unicode{x1D45B}\unicode{x1D460}\unicode{x1D45D}\unicode{x1D45C}\unicode{x1D45F}\unicode{x1D461}$ (COT) problems. Both approaches enable conditional sampling and conditional density estimation, which are core tasks in Bayesian inference$\unicode{x2013}$particularly in the simulation-based ($\unicode{x201C}$likelihood-free$\unicode{x201D}$) setting. Our methods represent the target conditional distribution as a transformation of a tractable reference distribution. Obtaining such a transformation, chosen here to be an approximation of the COT map, is computationally challenging even in moderate dimensions. To improve scalability, our numerical algorithms use neural networks to parameterize candidate maps and further exploit the structure of the COT problem. Our static approach approximates the map as the gradient of a partially input-convex neural network. It uses a novel numerical implementation to increase computational efficiency compared to state-of-the-art alternatives. Our dynamic approach approximates the conditional optimal transport via the flow map of a regularized neural ODE; compared to the static approach, it is slower to train but offers more modeling choices and can lead to faster sampling. We demonstrate both algorithms numerically, comparing them with competing state-of-the-art approaches, using benchmark datasets and simulation-based Bayesian inverse problems.
- [37] arXiv:2311.00289 (replaced) [pdf, other]
-
Title: Precise Error Rates for Computationally Efficient TestingComments: v2 is the journal version; v3 contains an update on the status of the main conjectureSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
We revisit the fundamental question of simple-versus-simple hypothesis testing with an eye towards computational complexity, as the statistically optimal likelihood ratio test is often computationally intractable in high-dimensional settings. In the classical spiked Wigner model with a general i.i.d. spike prior we show (conditional on a conjecture) that an existing test based on linear spectral statistics achieves the best possible tradeoff curve between type I and type II error rates among all computationally efficient tests, even though there are exponential-time tests that do better. This result is conditional on an appropriate complexity-theoretic conjecture, namely a natural strengthening of the well-established low-degree conjecture. Our result shows that the spectrum is a sufficient statistic for computationally bounded tests (but not for all tests).
To our knowledge, our approach gives the first tool for reasoning about the precise asymptotic testing error achievable with efficient computation. The main ingredients required for our hardness result are a sharp bound on the norm of the low-degree likelihood ratio along with (counterintuitively) a positive result on achievability of testing. This strategy appears to be new even in the setting of unbounded computation, in which case it gives an alternate way to analyze the fundamental statistical limits of testing. - [38] arXiv:2311.04084 (replaced) [pdf, html, other]
-
Title: Minimax Sequential Testing for Poisson ProcessesSubjects: Statistics Theory (math.ST); Optimization and Control (math.OC)
Suppose we observe a Poisson process in real time for which the intensity may take on two possible values $\lambda_0$ and $\lambda_1$. Suppose further that the priori probability of the true intensity is not given. We solve a minimax version of Bayesian problem of sequential testing of two simple hypotheses to minimize a linear combination of the probability of wrong detection and the expected waiting time in the worst scenario of all possible priori distributions. An equivalent characterization for the least favorable distributions is derived and a sufficient condition for the existence is concluded.
- [39] arXiv:2401.08175 (replaced) [pdf, html, other]
-
Title: Bayesian Function-on-Function Regression for Spatial Functional DataSubjects: Methodology (stat.ME)
Spatial functional data arise in many settings, such as particulate matter curves observed at monitoring stations and age population curves at each areal unit. Most existing functional regression models have limited applicability because they do not consider spatial correlations. Although functional kriging methods can predict the curves at unobserved spatial locations, they are based on variogram fittings rather than constructing hierarchical statistical models. In this manuscript, we propose a Bayesian framework for spatial function-on-function regression that can carry out parameter estimations and predictions. However, the proposed model has computational and inferential challenges because the model needs to account for within and between-curve dependencies. Furthermore, high-dimensional and spatially correlated parameters can lead to the slow mixing of Markov chain Monte Carlo algorithms. To address these issues, we first utilize a basis transformation approach to simplify the covariance and apply projection methods for dimension reduction. We also develop a simultaneous band score for the proposed model to detect the significant region in the regression function. We apply our method to both areal and point-level spatial functional data, showing the proposed method is computationally efficient and provides accurate estimations and predictions.
- [40] arXiv:2403.10671 (replaced) [pdf, html, other]
-
Title: Variation Due to Regularization Tractably Recovers Bayesian Deep LearningComments: 16 pages, 9 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Uncertainty quantification in deep learning is crucial for safe and reliable decision-making in downstream tasks. Existing methods quantify uncertainty at the last layer or other approximations of the network which may miss some sources of uncertainty in the model. To address this gap, we propose an uncertainty quantification method for large networks based on variation due to regularization. Essentially, predictions that are more (less) sensitive to the regularization of network parameters are less (more, respectively) certain. This principle can be implemented by deterministically tweaking the training loss during the fine-tuning phase and reflects confidence in the output as a function of all layers of the network. We show that regularization variation (RegVar) provides rigorous uncertainty estimates that, in the infinitesimal limit, exactly recover the Laplace approximation in Bayesian deep learning. We demonstrate its success in several deep learning architectures, showing it can scale tractably with the network size while maintaining or improving uncertainty quantification quality. Our experiments across multiple datasets show that RegVar not only identifies uncertain predictions effectively but also provides insights into the stability of learned representations.
- [41] arXiv:2405.16780 (replaced) [pdf, html, other]
-
Title: Analysis of Broken Randomized Experiments by Principal StratificationSubjects: Methodology (stat.ME)
Although randomized controlled trials have long been regarded as the ``gold standard'' for evaluating treatment effects, there is no natural prevention from post-treatment events. For example, non-compliance makes the actual treatment different from the assigned treatment, truncation-by-death renders the outcome undefined or ill-defined, and missingness prevents the outcomes from being measured. In this paper, we develop a statistical analysis framework using principal stratification to investigate the treatment effect in broken randomized experiments. The average treatment effect in compliers and always-survivors is adopted as the target causal estimand. We establish the asymptotic property for the estimator. To relax the identification assumptions, we also propose an interventionist estimand defined in compliers by adjusting for baseline covariates. We apply the framework to study the effect of training on earnings in the Job Corps study and find that the training program improves employment and earnings in the long term.
- [42] arXiv:2409.01444 (replaced) [pdf, html, other]
-
Title: A causal viewpoint on prediction model performance under changes in case-mix: discrimination and calibration respond differently for prognosis and diagnosis predictionsSubjects: Methodology (stat.ME); Machine Learning (cs.LG)
Prediction models need reliable predictive performance as they inform clinical decisions, aiding in diagnosis, prognosis, and treatment planning. The predictive performance of these models is typically assessed through discrimination and calibration. Changes in the distribution of the data impact model performance and there may be important changes between a model's current application and when and where its performance was last evaluated. In health-care, a typical change is a shift in case-mix. For example, for cardiovascular risk management, a general practitioner sees a different mix of patients than a specialist in a tertiary hospital.
This work introduces a novel framework that differentiates the effects of case-mix shifts on discrimination and calibration based on the causal direction of the prediction task. When prediction is in the causal direction (often the case for prognosis predictions), calibration remains stable under case-mix shifts, while discrimination does not. Conversely, when predicting in the anti-causal direction (often with diagnosis predictions), discrimination remains stable, but calibration does not.
A simulation study and empirical validation using cardiovascular disease prediction models demonstrate the implications of this framework. The causal case-mix framework provides insights for developing, evaluating and deploying prediction models across different clinical settings, emphasizing the importance of understanding the causal structure of the prediction task. - [43] arXiv:2409.18804 (replaced) [pdf, other]
-
Title: Convergence of Diffusion Models Under the Manifold Hypothesis in High-DimensionsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
Denoising Diffusion Probabilistic Models (DDPM) are powerful state-of-the-art methods used to generate synthetic data from high-dimensional data distributions and are widely used for image, audio, and video generation as well as many more applications in science and beyond. The \textit{manifold hypothesis} states that high-dimensional data often lie on lower-dimensional manifolds within the ambient space, and is widely believed to hold in provided examples. While recent results have provided invaluable insight into how diffusion models adapt to the manifold hypothesis, they do not capture the great empirical success of these models, making this a very fruitful research direction.
In this work, we study DDPMs under the manifold hypothesis and prove that they achieve rates independent of the ambient dimension in terms of score learning. In terms of sampling complexity, we obtain rates independent of the ambient dimension w.r.t. the Kullback-Leibler divergence, and $O(\sqrt{D})$ w.r.t. the Wasserstein distance. We do this by developing a new framework connecting diffusion models to the well-studied theory of extrema of Gaussian Processes. - [44] arXiv:2409.19729 (replaced) [pdf, html, other]
-
Title: Prior Sensitivity Analysis without Model Re-fitComments: 19 pagesSubjects: Methodology (stat.ME); Computation (stat.CO)
Prior sensitivity analysis is a fundamental method to check the effects of prior distributions on the posterior distribution in Bayesian inference. Exploring the posteriors under several alternative priors can be computationally intensive, particularly for complex latent variable models. To address this issue, we propose a novel method for quantifying the prior sensitivity that does not require model re-fit. Specifically, we present a method to compute the Hellinger and Kullback-Leibler distances between two posterior distributions with base and alternative priors, using Monte Carlo integration based only on the base posterior distribution, through novel integral expressions of the two distances. We also extend the above approach for assessing the influence of hyperpriors in general latent variable models. We demonstrate the proposed method through examples of a simple normal distribution model, hierarchical binomial-beta model, and Gaussian process regression model.
- [45] arXiv:2410.09046 (replaced) [pdf, html, other]
-
Title: Linear Convergence of Diffusion Models Under the Manifold HypothesisSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
Score-matching generative models have proven successful at sampling from complex high-dimensional data distributions. In many applications, this distribution is believed to concentrate on a much lower $d$-dimensional manifold embedded into $D$-dimensional space; this is known as the manifold hypothesis. The current best-known convergence guarantees are either linear in $D$ or polynomial (superlinear) in $d$. The latter exploits a novel integration scheme for the backward SDE. We take the best of both worlds and show that the number of steps diffusion models require in order to converge in Kullback-Leibler~(KL) divergence is linear (up to logarithmic terms) in the intrinsic dimension $d$. Moreover, we show that this linear dependency is sharp.
- [46] arXiv:2411.10400 (replaced) [pdf, html, other]
-
Title: The Loser's Curse and the Critical Role of the Utility FunctionSubjects: Applications (stat.AP)
A longstanding question in the judgment and decision making literature is whether experts, even in high-stakes environments, exhibit the same cognitive biases observed in controlled experiments with inexperienced participants. Massey and Thaler (2013) claim to have found an example of bias and irrationality in expert decision making: general managers' behavior in the National Football League draft pick trade market. They argue that general managers systematically overvalue top draft picks, which generate less surplus value on average than later first-round picks, a phenomenon known as the loser's curse. Their conclusion hinges on the assumption that general managers should use expected surplus value as their utility function for evaluating draft picks. This assumption, however, is neither explicitly justified nor necessarily aligned with the strategic complexities of constructing a National Football League roster. In this paper, we challenge their framework by considering alternative utility functions, particularly those that emphasize the acquisition of transformational players--those capable of dramatically increasing a team's chances of winning the Super Bowl. Under a decision rule that prioritizes the probability of acquiring elite players, which we construct from a novel Bayesian hierarchical Beta regression model, general managers' draft trade behavior appears rational rather than systematically flawed. More broadly, our findings highlight the critical role of carefully specifying a utility function when evaluating the quality of decisions.
- [47] arXiv:2411.13922 (replaced) [pdf, other]
-
Title: Exponentially Consistent Nonparametric Linkage-Based Clustering of Data SequencesSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
In this paper, we consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data sequences generated from {\em unknown} distributions. The distributions of the $M$ data sequences belong to $K$ underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and $k$-medoids distribution clustering, assume that the maximum intra-cluster distance ($d_L$) is smaller than the minimum inter-cluster distance ($d_H$). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, $d_I < d_H$, where $d_I$ is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that $d_I < d_L$ in general. Thus, our results show that SLINK is exponentially consistent for a larger class of problems than previously known. In our simulations, we also identify examples where $k$-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.
- [48] arXiv:2412.07735 (replaced) [pdf, other]
-
Title: Theoretical and Practical Limits of Signal Strength Estimate Precision for Kolmogorov-Zurbenko Periodograms with Dynamic SmoothingComments: 32 pages, 8 figures, this article draws from arXiv:2007.03031v3; typos correctedSubjects: Applications (stat.AP); Computation (stat.CO)
This investigation establishes the theoretical and practical limits of signal strength estimate precision for Kolmogorov-Zurbenko periodograms with dynamic smoothing and compares them to those of standard log-periodograms with static smoothing. Previous research has established the sensitivity, accuracy, resolution, and robustness of Kolmogorov-Zurbenko periodograms with dynamic smoothing in estimating signal frequencies. However, the precision with which they estimate signal strength has never been evaluated. To this point, the width of the confidence interval for a signal strength estimate can serve as a criterion for assessing the precision of such estimates: the narrower the confidence interval, the more precise the estimate. The statistical background for confidence intervals of periodograms is presented, followed by candidate functions to compute and plot them when using Kolmogorov-Zurbenko periodograms with dynamic smoothing. Given an identified signal frequency, a static smoothing window and its smoothing window width can be selected such that its confidence interval is narrower and, thus, its signal strength estimate more precise, than that of dynamic smoothing windows, all while maintaining a level of frequency resolution as good as or better than that of a dynamic smoothing window. These findings suggest the need for a two-step protocol in spectral analysis: computation of a Kolmogorov-Zurbenko periodogram with dynamic smoothing to detect, identify, and separate signal frequencies, followed by computation of a Kolmogorov-Zurbenko periodogram with static smoothing to precisely estimate signal strength and compute its confidence intervals.
- [49] arXiv:2501.12314 (replaced) [pdf, html, other]
-
Title: Uncertainty Quantification With Noise Injection in Neural Networks: A Bayesian PerspectiveSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Model uncertainty quantification involves measuring and evaluating the uncertainty linked to a model's predictions, helping assess their reliability and confidence. Noise injection is a technique used to enhance the robustness of neural networks by introducing randomness. In this paper, we establish a connection between noise injection and uncertainty quantification from a Bayesian standpoint. We theoretically demonstrate that injecting noise into the weights of a neural network is equivalent to Bayesian inference on a deep Gaussian process. Consequently, we introduce a Monte Carlo Noise Injection (MCNI) method, which involves injecting noise into the parameters during training and performing multiple forward propagations during inference to estimate the uncertainty of the prediction. Through simulation and experiments on regression and classification tasks, our method demonstrates superior performance compared to the baseline model.
- [50] arXiv:2501.13304 (replaced) [pdf, html, other]
-
Title: Model selection tests for truncated vine copulas under nested hypothesesSubjects: Methodology (stat.ME)
Vine copulas, constructed using bivariate copulas as building blocks, provide a flexible framework for modeling multi-dimensional dependencies. However, this flexibility is accompanied by rapidly increasing complexity as dimensionality grows, necessitating appropriate truncation to manage this challenge. While use of Vuong's model selection test has been proposed as a method to determine the optimal truncation level, its application to vine copulas has been heuristic, assuming only strictly non-nested hypotheses. This assumption conflicts with the inherent nesting within truncated vine copula structures. In this paper, we systematically apply Vuong's model selection tests to distinguish competing models of truncated vine copulas under both nested and strictly non-nested hypotheses. Through extensive simulation studies, we characterize the conditions under which the nested hypotheses provide improved discernibility and demonstrate that the strictly non-nested framework can still yield valid distinctions in certain settings. This broader perspective on model comparison contributes to both methodological clarity and practical guidance for vine copula truncation.
- [51] arXiv:2501.18577 (replaced) [pdf, html, other]
-
Title: Prediction-Powered Inference with Imputed Covariates and Nonuniform SamplingSubjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Machine learning models are increasingly used to produce predictions that serve as input data in subsequent statistical analyses. For example, computer vision predictions of economic and environmental indicators based on satellite imagery are used in downstream regressions; similarly, language models are widely used to approximate human ratings and opinions in social science research. However, failure to properly account for errors in the machine learning predictions renders standard statistical procedures invalid. Prior work uses what we call the Predict-Then-Debias estimator to give valid confidence intervals when machine learning algorithms impute missing variables, assuming a small complete sample from the population of interest. We expand the scope by introducing bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. Importantly, the method can be applied to many settings without requiring additional calculations. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.
- [52] arXiv:2502.04654 (replaced) [pdf, html, other]
-
Title: A sliced Wasserstein and diffusion approach to random coefficient modelsComments: This version added a new section relating the proposed approach to treatment effect distribution estimationSubjects: Statistics Theory (math.ST); Econometrics (econ.EM)
We propose a new minimum-distance estimator for linear random coefficient models. This estimator integrates the recently advanced sliced Wasserstein distance with the nearest neighbor methods, both of which enhance computational efficiency. We demonstrate that the proposed method is consistent in approximating the true distribution. Moreover, our formulation naturally leads to a diffusion process-based algorithm and is closely connected to treatment effect distribution estimation -- both of which are of independent interest and hold promise for broader applications.
- [53] arXiv:2502.05730 (replaced) [pdf, html, other]
-
Title: Attainability of Two-Point Testing Rates for Finite-Sample Location EstimationSubjects: Statistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
LeCam's two-point testing method yields perhaps the simplest lower bound for estimating the mean of a distribution: roughly, if it is impossible to well-distinguish a distribution centered at $\mu$ from the same distribution centered at $\mu+\Delta$, then it is impossible to estimate the mean by better than $\Delta/2$. It is setting-dependent whether or not a nearly matching upper bound is attainable. We study the conditions under which the two-point testing lower bound can be attained for univariate mean estimation; both in the setting of location estimation (where the distribution is known up to translation) and adaptive location estimation (unknown distribution). Roughly, we will say an estimate nearly attains the two-point testing lower bound if it incurs error that is at most polylogarithmically larger than the Hellinger modulus of continuity for $\tilde{\Omega}(n)$ samples.
Adaptive location estimation is particularly interesting as some distributions admit much better guarantees than sub-Gaussian rates (e.g. $\operatorname{Unif}(\mu-1,\mu+1)$ permits error $\Theta(\frac{1}{n})$, while the sub-Gaussian rate is $\Theta(\frac{1}{\sqrt{n}})$), yet it is not obvious whether these rates may be adaptively attained by one unified approach. Our main result designs an algorithm that nearly attains the two-point testing rate for mixtures of symmetric, log-concave distributions with a common mean. Moreover, this algorithm runs in near-linear time and is parameter-free. In contrast, we show the two-point testing rate is not nearly attainable even for symmetric, unimodal distributions.
We complement this with results for location estimation, showing the two-point testing rate is nearly attainable for unimodal distributions, but unattainable for symmetric distributions. - [54] arXiv:2502.07991 (replaced) [pdf, html, other]
-
Title: Exact Simulation of Longitudinal Data from Marginal Structural ModelsSubjects: Methodology (stat.ME)
Simulating longitudinal data from specified marginal structural models is a crucial but challenging task for evaluating causal inference methods and informing study design. While data generation typically proceeds in a fully conditional manner using structural equations according to a temporal ordering, it is difficult to ensure alignment between conditional distributions and the target marginal causal effects, which presents a fundamental challenge. To address this, we propose a flexible and efficient algorithm for simulating longitudinal data that adheres exactly to a specified marginal structural model. Our approach accommodates time-to-event outcomes and extends naturally to survival settings, which are prevalent in applied research. Compared to existing approaches, it offers several advantages: it enables exact simulation from a known causal model rather than relying on approximations; avoids restrictive assumptions about the data-generating process; and remains computationally efficient by requiring only the evaluation of analytical expressions, rather than Monte Carlo methods or numerical integration. Through simulation studies replicating realistic scenarios, we validate the method's accuracy and utility. Our method will facilitate researchers in effectively simulating data with target causal structures for their specific scenarios.
- [55] arXiv:2503.03659 (replaced) [pdf, html, other]
-
Title: Conformal prediction of future insurance claims in the regression problemSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
In the current insurance literature, prediction of insurance claims in the regression problem is often performed with a statistical model. This model-based approach may potentially suffer from several drawbacks: (i) model misspecification, (ii) selection effect, and (iii) lack of finite-sample validity. This article addresses these three issues simultaneously by employing conformal prediction -- a general machine learning strategy for valid predictions. The proposed method is both model-free and tuning-parameter-free. It also guarantees finite-sample validity at a pre-assigned coverage probability level. Examples, based on both simulated and real data, are provided to demonstrate the excellent performance of the proposed method and its applications in insurance, especially regarding meeting the solvency capital requirement of European insurance regulation, Solvency II.
- [56] arXiv:2503.05297 (replaced) [pdf, html, other]
-
Title: regMMD: An R package for parametric estimation and regression with maximum mean discrepancyComments: 21 pages, 3 figuresSubjects: Computation (stat.CO); Methodology (stat.ME)
The Maximum Mean Discrepancy (MMD) is a kernel-based metric widely used for nonparametric tests and estimation. Recently, it has also been studied as an objective function for parametric estimation, as it has been shown to yield robust estimators. We have implemented MMD minimization for parameter inference in a wide range of statistical models, including various regression models, within an R package called regMMD. This paper provides an introduction to the regMMD package. We describe the available kernels and optimization procedures, as well as the default settings. Detailed applications to simulated and real data are provided.
- [57] arXiv:2504.01650 (replaced) [pdf, html, other]
-
Title: Sparse Gaussian Neural ProcessesComments: Proceedings of the 7th Symposium on Advances in Approximate Bayesian Inference, PMLR, 2025. 25 pages, 6 figures, 5 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Despite significant recent advances in probabilistic meta-learning, it is common for practitioners to avoid using deep learning models due to a comparative lack of interpretability. Instead, many practitioners simply use non-meta-models such as Gaussian processes with interpretable priors, and conduct the tedious procedure of training their model from scratch for each task they encounter. While this is justifiable for tasks with a limited number of data points, the cubic computational cost of exact Gaussian process inference renders this prohibitive when each task has many observations. To remedy this, we introduce a family of models that meta-learn sparse Gaussian process inference. Not only does this enable rapid prediction on new tasks with sparse Gaussian processes, but since our models have clear interpretations as members of the neural process family, it also allows manual elicitation of priors in a neural process for the first time. In meta-learning regimes for which the number of observed tasks is small or for which expert domain knowledge is available, this offers a crucial advantage.
- [58] arXiv:2504.01949 (replaced) [pdf, other]
-
Title: Comparison of Bayesian methods for extrapolation of treatment effects: a large scale simulation studySubjects: Methodology (stat.ME); Applications (stat.AP)
Extrapolating treatment effects from related studies is a promising strategy for designing and analyzing clinical trials in situations where achieving an adequate sample size is challenging. Bayesian methods are well-suited for this purpose, as they enable the synthesis of prior information through the use of prior distributions. While the operating characteristics of Bayesian approaches for borrowing data from control arms have been extensively studied, methods that borrow treatment effects -- quantities derived from the comparison between two arms -- remain less well understood. In this paper, we present the findings of an extensive simulation study designed to address this gap. We evaluate the frequentist operating characteristics of these methods, including the probability of success, mean squared error, bias, precision, and credible interval coverage. Our results provide insights into the strengths and limitations of existing methods in the context of confirmatory trials. In particular, we show that the Conditional Power Prior and the Robust Mixture Prior perform better overall, while the test-then-pool variants and the p-value-based power prior display suboptimal performance.
- [59] arXiv:2504.07347 (replaced) [pdf, html, other]
-
Title: Throughput-Optimal Scheduling Algorithms for LLM Inference and AI AgentsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
As demand for Large Language Models (LLMs) and AI agents rapidly grows, optimizing systems for efficient LLM inference becomes critical. While significant efforts have focused on system-level engineering, little is explored from a mathematical modeling and queuing perspective.
In this paper, we aim to develop the queuing fundamentals for large language model (LLM) inference, bridging the gap between the queueing theory and LLM system communities. In particular, we study the throughput aspect in LLM inference systems. We prove that a large class of 'work-conserving' scheduling algorithms can achieve maximum throughput for individual inference LLM engine, highlighting 'work-conserving' as a key design principle in practice. In a network of LLM agents, work-conserving scheduling alone is insufficient, particularly when facing specific workload structures and multi-class workflows that require more sophisticated scheduling strategies. Evaluations of real-world systems show that Orca and Sarathi-serve are throughput-optimal, reassuring practitioners, while FasterTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our results highlight the substantial benefits that the queueing community can offer in improving LLM inference systems and call for more interdisciplinary development. - [60] arXiv:2504.14555 (replaced) [pdf, html, other]
-
Title: Nonparametric Estimation in Uniform Deconvolution and Interval CensoringComments: 16 pages, 4 figuresSubjects: Statistics Theory (math.ST)
In the uniform deconvolution problem one is interested in estimating the distribution function $F_0$ of a nonnegative random variable, based on a sample with additive uniform noise. A peculiar and not well understood phenomenon of the nonparametric maximum likelihood estimator in this setting is the dichotomy between the situations where $F_0(1)=1$ and $F_0(1)<1$. If $F_0(1)=1$, the MLE can be computed in a straightforward way and its asymptotic pointwise behavior can be derived using the connection to the so-called current status problem. However, if $F_0(1)<1$, one needs an iterative procedure to compute it and the asymptotic pointwise behavior of the nonparametric maximum likelihood estimator is not known. In this paper we describe the problem, connect it to interval censoring problems and a more general model studied in Groeneboom (2024) to state two competing naturally occurring conjectures for the case $F_0(1)<1$. Asymptotic arguments related to smooth functional theory and extensive simulations lead us to to bet on one of these two conjectures.
- [61] arXiv:2402.14781 (replaced) [pdf, html, other]
-
Title: Effective Bayesian Causal Inference via Structural Marginalisation and Autoregressive OrdersComments: 9 pages + references + appendices (37 pages total)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
The traditional two-stage approach to causal inference first identifies a single causal model (or equivalence class of models), which is then used to answer causal queries. However, this neglects any epistemic model uncertainty. In contrast, Bayesian causal inference does incorporate epistemic uncertainty into query estimates via Bayesian marginalisation (posterior averaging) over all causal models. While principled, this marginalisation over entire causal models, i.e., both causal structures (graphs) and mechanisms, poses a tremendous computational challenge. In this work, we address this challenge by decomposing structure marginalisation into the marginalisation over (i) causal orders and (ii) directed acyclic graphs (DAGs) given an order. We can marginalise the latter in closed form by limiting the number of parents per variable and utilising Gaussian processes to model mechanisms. To marginalise over orders, we use a sampling-based approximation, for which we devise a novel auto-regressive distribution over causal orders (ARCO). Our method outperforms state-of-the-art in structure learning on simulated non-linear additive noise benchmarks, and yields competitive results on real-world data. Furthermore, we can accurately infer interventional distributions and average causal effects.
- [62] arXiv:2403.11743 (replaced) [pdf, html, other]
-
Title: PARMESAN: Parameter-Free Memory Search and Transduction for Dense Prediction TasksPhilip Matthias Winter, Maria Wimmer, David Major, Dimitrios Lenis, Astrid Berg, Theresa Neubauer, Gaia Romana De Paolis, Johannes Novotny, Sophia Ulonska, Katja BühlerComments: This is the author's accepted manuscript of a paper published in Lecture Notes in Computer Science (LNCS), volume 15297, Proceedings of DAGM GCPR 2024. 25 pages, 7 figuresJournal-ref: LNCS, volume 15297, 2025Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This work addresses flexibility in deep learning by means of transductive reasoning. For adaptation to new data and tasks, e.g., in continual learning, existing methods typically involve tuning learnable parameters or complete re-training from scratch, rendering such approaches unflexible in practice. We argue that the notion of separating computation from memory by the means of transduction can act as a stepping stone for solving these issues. We therefore propose PARMESAN (parameter-free memory search and transduction), a scalable method which leverages a memory module for solving dense prediction tasks. At inference, hidden representations in memory are being searched to find corresponding patterns. In contrast to other methods that rely on continuous training of learnable parameters, PARMESAN learns via memory consolidation simply by modifying stored contents. Our method is compatible with commonly used architectures and canonically transfers to 1D, 2D, and 3D grid-based data. The capabilities of our approach are demonstrated at the complex task of continual learning. PARMESAN learns by 3-4 orders of magnitude faster than established baselines while being on par in terms of predictive performance, hardware-efficiency, and knowledge retention.
- [63] arXiv:2411.18218 (replaced) [pdf, other]
-
Title: Exponential speed up in Monte Carlo sampling through Radial UpdatesComments: 16 + 12 pages, 5 figures, 1 table, 2 algorithms; v2: revised, publishedSubjects: Computational Physics (physics.comp-ph); High Energy Physics - Lattice (hep-lat); Numerical Analysis (math.NA); Computation (stat.CO)
Recently, it has been shown that the hybrid Monte Carlo (HMC) algorithm is guaranteed to converge exponentially to a given target probability distribution $p(x)\propto e^{-V(x)}$ on non-compact spaces if augmented by an appropriate radial update. In this work we present a simple way to derive efficient radial updates meeting the necessary requirements for any potential $V$. We reduce the problem to finding a substitution for the radial direction $||x||=f(z)$ so that the effective potential $V(f(z))$ grows exponentially with $z\rightarrow\pm\infty$. Any additive update of $z$ then leads to the desired convergence. We show that choosing this update from a normal distribution with standard deviation $\sigma\approx 1/\sqrt{d}$ in $d$ dimensions yields very good results. We further generalise the previous results on radial updates to a wide class of Markov chain Monte Carlo (MCMC) algorithms beyond the HMC and we quantify the convergence behaviour of MCMC algorithms with badly chosen radial update. Finally, we apply the radial update to the sampling of heavy-tailed distributions and achieve a speed up of many orders of magnitude.
- [64] arXiv:2412.11003 (replaced) [pdf, html, other]
-
Title: Optimal Rates for Robust Stochastic Convex OptimizationComments: The 6th annual Symposium on Foundations of Responsible Computing (FORC 2025)Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Machine learning algorithms in high-dimensional settings are highly susceptible to the influence of even a small fraction of structured outliers, making robust optimization techniques essential. In particular, within the $\epsilon$-contamination model, where an adversary can inspect and replace up to an $\epsilon$-fraction of the samples, a fundamental open problem is determining the optimal rates for robust stochastic convex optimization (SCO) under such contamination. We develop novel algorithms that achieve minimax-optimal excess risk (up to logarithmic factors) under the $\epsilon$-contamination model. Our approach improves over existing algorithms, which are not only suboptimal but also require stringent assumptions, including Lipschitz continuity and smoothness of individual sample functions. By contrast, our optimal algorithms do not require these stringent assumptions, assuming only population-level smoothness of the loss. Moreover, our algorithms can be adapted to handle the case in which the covariance parameter is unknown, and can be extended to nonsmooth population risks via convolutional smoothing. We complement our algorithmic developments with a tight information-theoretic lower bound for robust SCO.
- [65] arXiv:2501.10117 (replaced) [pdf, html, other]
-
Title: Prediction Sets and Conformal Inference with Interval OutcomesSubjects: Econometrics (econ.EM); Methodology (stat.ME)
Given data on a scalar random variable $Y$, a prediction set for $Y$ with miscoverage level $\alpha$ is a set of values for $Y$ that contains a randomly drawn $Y$ with probability $1 - \alpha$, where $\alpha \in (0,1)$. Among all prediction sets that satisfy this coverage property, the oracle prediction set is the one with the smallest volume. This paper provides estimation methods of such prediction sets given observed conditioning covariates when $Y$ is \textit{censored} or \textit{measured in intervals}. We first characterise the oracle prediction set under interval censoring and develop a consistent estimator for the shortest prediction {\it interval} that satisfies this coverage this http URL consistency results are extended to accommodate cases where the prediction set consists of multiple disjoint intervals. We use conformal inference to construct a prediction set that achieves finite-sample validity under censoring and maintains consistency as sample size increases, using a conformity score function designed for interval data. The procedure accommodates the prediction uncertainty that is irreducible (due to the stochastic nature of outcomes), the modelling uncertainty due to partial identification and also sampling uncertainty that gets reduced as samples get larger. We conduct a set of Monte Carlo simulations and an application to data from the Current Population Survey. The results highlight the robustness and efficiency of the proposed methods.
- [66] arXiv:2501.18374 (replaced) [pdf, html, other]
-
Title: Proofs for Folklore Theorems on the Radon-Nikodym DerivativeComments: Submitted to the IEEE Information Theory Workshop 2025, 6 pagesSubjects: Information Theory (cs.IT); History and Overview (math.HO); Statistics Theory (math.ST); Machine Learning (stat.ML)
In this paper, rigorous statements and formal proofs are presented for both foundational and advanced folklore theorems on the Radon-Nikodym derivative. The cases of conditional and marginal probability measures are carefully considered, which leads to an identity involving the sum of mutual and lautum information suggesting a new interpretation for such a sum.
- [67] arXiv:2502.17060 (replaced) [pdf, html, other]
-
Title: Data Analysis Prediction over Multiple Unseen Datasets: A Vector Embedding ApproachSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The massive increase in the data volume and dataset availability for analysts compels researchers to focus on data content and select high-quality datasets to enhance the performance of analytics operators. While selecting the highest quality data for analysis highly increases task accuracy and efficiency, it is still a hard task, especially when the number of available inputs is very large. To address this issue, we propose a novel methodology that infers the outcome of analytics operators by creating a model from datasets similar to the queried one. Dataset similarity is performed via projecting each dataset to a vector embedding representation. The vectorization process is performed using our proposed deep learning model NumTabData2Vec, which takes a whole dataset and projects it into a lower vector embedding representation space. Through experimental evaluation, we compare the prediction performance and the execution time of our framework to another state-of-the-art modelling operator framework, illustrating that our approach predicts analytics outcomes accurately. Furthermore, our vectorization model can project different real-world scenarios to a lower vector embedding representation and distinguish between them.
- [68] arXiv:2503.21495 (replaced) [pdf, html, other]
-
Title: Adaptive Resampling with Bootstrap for Noisy Multi-Objective Optimization ProblemsComments: 14 pages. 5 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The challenge of noisy multi-objective optimization lies in the constant trade-off between exploring new decision points and improving the precision of known points through resampling. This decision should take into account both the variability of the objective functions and the current estimate of a point in relation to the Pareto front. Since the amount and distribution of noise are generally unknown, it is desirable for a decision function to be highly adaptive to the properties of the optimization problem. This paper presents a resampling decision function that incorporates the stochastic nature of the optimization problem by using bootstrapping and the probability of dominance. The distribution-free estimation of the probability of dominance is achieved using bootstrap estimates of the means. To make the procedure applicable even with very few observations, we transfer the distribution observed at other decision points. The efficiency of this resampling approach is demonstrated by applying it in the NSGA-II algorithm with a sequential resampling procedure under multiple noise variations.
- [69] arXiv:2504.16450 (replaced) [pdf, html, other]
-
Title: An Effective Gram Matrix Characterizes Generalization in Deep NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We derive a differential equation that governs the evolution of the generalization gap when a deep network is trained by gradient descent. This differential equation is controlled by two quantities, a contraction factor that brings together trajectories corresponding to slightly different datasets, and a perturbation factor that accounts for them training on different datasets. We analyze this differential equation to compute an ``effective Gram matrix'' that characterizes the generalization gap after training in terms of the alignment between this Gram matrix and a certain initial ``residual''. Empirical evaluations on image classification datasets indicate that this analysis can predict the test loss accurately. Further, at any point during training, the residual predominantly lies in the subspace of the effective Gram matrix with the smallest eigenvalues. This indicates that the training process is benign, i.e., it does not lead to significant deterioration of the generalization gap (which is zero at initialization). The alignment between the effective Gram matrix and the residual is different for different datasets and architectures. The match/mismatch of the data and the architecture is primarily responsible for good/bad generalization.
- [70] arXiv:2504.16580 (replaced) [pdf, html, other]
-
Title: Hyper-Transforming Latent Diffusion ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce a novel generative framework for functions by integrating Implicit Neural Representations (INRs) and Transformer-based hypernetworks into latent variable models. Unlike prior approaches that rely on MLP-based hypernetworks with scalability limitations, our method employs a Transformer-based decoder to generate INR parameters from latent variables, addressing both representation capacity and computational efficiency. Our framework extends latent diffusion models (LDMs) to INR generation by replacing standard decoders with a Transformer-based hypernetwork, which can be trained either from scratch or via hyper-transforming-a strategy that fine-tunes only the decoder while freezing the pre-trained latent space. This enables efficient adaptation of existing generative models to INR-based representations without requiring full retraining.