Statistics
See recent articles
Showing new listings for Thursday, 13 March 2025
- [1] arXiv:2503.08808 [pdf, html, other]
-
Title: Distribution and Moments of a Normalized Dissimilarity Ratio for two Correlated Gamma VariablesSubjects: Statistics Theory (math.ST); Mathematical Physics (math-ph); Instrumentation and Detectors (physics.ins-det); Optics (physics.optics)
We consider two random variables $X$ and $Y$ following correlated Gamma distributions, characterized by identical scale and shape parameters and a linear correlation coefficient $\rho$. Our focus is on the parameter: \[
D(X,Y) = \frac{|X - Y|}{X + Y}, \] which appears in applied contexts such as dynamic speckle imaging, where it is known as the \textit{Fujii index}. In this work, we derive a closed-form expression for the probability density function of $D(X,Y)$ as well as analytical formulas for its moments of order $k$. Our derivation starts by representing $X$ and $Y$ as two correlated exponential random variables, obtained from the squared magnitudes of circular complex Gaussian variables. By considering the sum of $k$ independent exponential variables, we then derive the joint density of $(X,Y)$ when $X$ and $Y$ are two correlated Gamma variables. Through appropriate varable transformations, we obtain the theoretical distribution of $D(X,Y)$ and evaluate its moments analytically. These theoretical findings are validated through numerical simulations, with particular attention to two specific cases: zero correlation and unit shape parameter. - [2] arXiv:2503.08821 [pdf, html, other]
-
Title: Questioning Normality: A study of wavelet leaders distributionComments: 44 pagesSubjects: Applications (stat.AP); Methodology (stat.ME)
The motivation of this article is to estimate multifractality classification and model selection parameters: the first-order scaling exponent $c_1$ and the second-order scaling exponent (or intermittency coefficient) $c_2$. These exponents are built on wavelet leaders, which therefore constitute fundamental tools in applied multifractal analysis. While most estimation methods, particularly Bayesian approaches, rely on the assumption of log-normality, we challenge this hypothesis by statistically testing the normality of log-leaders. Upon rejecting this common assumption, we propose instead a novel model based on log-concave distributions. We validate this new model on well-known stochastic processes, including fractional Brownian motion, the multifractal random walk, and the canonical Mandelbrot cascade, as well as on real-world marathon runner data. Furthermore, we revisit the estimation procedure for $c_1$, providing confidence intervals, and for $c_2$, applying it to fractional Brownian motions with various Hurst indices as well as to the multifractal random walk. Finally, we establish several theoretical results on the distribution of log-leaders in random wavelet series, which are consistent with our numerical findings.
- [3] arXiv:2503.08849 [pdf, html, other]
-
Title: Learning Pareto manifolds in high dimensions: How can regularization help?Comments: Published in Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Simultaneously addressing multiple objectives is becoming increasingly important in modern machine learning. At the same time, data is often high-dimensional and costly to label. For a single objective such as prediction risk, conventional regularization techniques are known to improve generalization when the data exhibits low-dimensional structure like sparsity. However, it is largely unexplored how to leverage this structure in the context of multi-objective learning (MOL) with multiple competing objectives. In this work, we discuss how the application of vanilla regularization approaches can fail, and propose a two-stage MOL framework that can successfully leverage low-dimensional structure. We demonstrate its effectiveness experimentally for multi-distribution learning and fairness-risk trade-offs.
- [4] arXiv:2503.08881 [pdf, html, other]
-
Title: Bayesian local clustering of functional data via semi-Markovian random partitionsSubjects: Methodology (stat.ME)
We introduce a Bayesian framework for indirect local clustering of functional data, leveraging B-spline basis expansions and a novel dependent random partition model. By exploiting the local support properties of B-splines, our approach allows partially coincident functional behaviors, achieved when shared basis coefficients span sufficiently contiguous regions. This is accomplished through a cutting-edge dependent random partition model that enforces semi-Markovian dependence across a sequence of partitions. By matching the order of the B-spline basis with the semi-Markovian dependence structure, the proposed model serves as a highly flexible prior, enabling efficient modeling of localized features in functional data. Furthermore, we extend the utility of the dependent random partition model beyond functional data, demonstrating its applicability to a broad class of problems where sequences of dependent partitions are central, and standard Markovian assumptions prove overly restrictive. Empirical illustrations, including analyses of simulated data and tide level measurements from the Venice Lagoon, showcase the effectiveness and versatility of the proposed methodology.
- [5] arXiv:2503.08896 [pdf, html, other]
-
Title: Risk-sensitive Bandits: Arm Mixture Optimality and Regret-efficient AlgorithmsComments: AISTATS 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
This paper introduces a general framework for risk-sensitive bandits that integrates the notions of risk-sensitive objectives by adopting a rich class of distortion riskmetrics. The introduced framework subsumes the various existing risk-sensitive models. An important and hitherto unknown observation is that for a wide range of riskmetrics, the optimal bandit policy involves selecting a mixture of arms. This is in sharp contrast to the convention in the multi-arm bandit algorithms that there is generally a solitary arm that maximizes the utility, whether purely reward-centric or risk-sensitive. This creates a major departure from the principles for designing bandit algorithms since there are uncountable mixture possibilities. The contributions of the paper are as follows: (i) it formalizes a general framework for risk-sensitive bandits, (ii) identifies standard risk-sensitive bandit models for which solitary arm selections is not optimal, (iii) and designs regret-efficient algorithms whose sampling strategies can accurately track optimal arm mixtures (when mixture is optimal) or the solitary arms (when solitary is optimal). The algorithms are shown to achieve a regret that scales according to $O((\log T/T )^{\nu})$, where $T$ is the horizon, and $\nu>0$ is a riskmetric-specific constant.
- [6] arXiv:2503.08902 [pdf, html, other]
-
Title: A Deep Bayesian Nonparametric Framework for Robust Mutual Information EstimationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
Mutual Information (MI) is a crucial measure for capturing dependencies between variables, but exact computation is challenging in high dimensions with intractable likelihoods, impacting accuracy and robustness. One idea is to use an auxiliary neural network to train an MI estimator; however, methods based on the empirical distribution function (EDF) can introduce sharp fluctuations in the MI loss due to poor out-of-sample performance, destabilizing convergence. We present a Bayesian nonparametric (BNP) solution for training an MI estimator by constructing the MI loss with a finite representation of the Dirichlet process posterior to incorporate regularization in the training process. With this regularization, the MI loss integrates both prior knowledge and empirical data to reduce the loss sensitivity to fluctuations and outliers in the sample data, especially in small sample settings like mini-batches. This approach addresses the challenge of balancing accuracy and low variance by effectively reducing variance, leading to stabilized and robust MI loss gradients during training and enhancing the convergence of the MI approximation while offering stronger theoretical guarantees for convergence. We explore the application of our estimator in maximizing MI between the data space and the latent space of a variational autoencoder. Experimental results demonstrate significant improvements in convergence over EDF-based methods, with applications across synthetic and real datasets, notably in 3D CT image generation, yielding enhanced structure discovery and reduced overfitting in data synthesis. While this paper focuses on generative models in application, the proposed estimator is not restricted to this setting and can be applied more broadly in various BNP learning procedures.
- [7] arXiv:2503.08971 [pdf, html, other]
-
Title: Data-Driven Adjustment for Multiple TreatmentsComments: 17 pages, 6 figuresSubjects: Methodology (stat.ME)
Covariate adjustment is one method of causal effect identification in non-experimental settings. Prior research provides routes for finding appropriate adjustments sets, but much of this research assumes knowledge of the underlying causal graph. In this paper, we present two routes for finding adjustment sets that do not require knowledge of a graph -- and instead rely on dependencies and independencies in the data directly. We consider a setting where the adjustment set is unaffected by treatment or outcome. Our first route shows how to extend prior research in this area using a concept known as c-equivalence. Our second route provides sufficient criteria for finding adjustment sets in the setting of multiple treatments.
- [8] arXiv:2503.08987 [pdf, html, other]
-
Title: Multilevel Primary Aim Analyses of Clustered SMARTs: With Applications in Health PolicyGabriel Durham, Anil Battalahalli, Amy Kilbourne, Andrew Quanbeck, Wenchu Pan, Tim Lycurgus, Daniel AlmirallComments: 55 pages, 8 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
In many health policy settings, adaptive interventions target a population of clusters (e.g., schools), with the ultimate intent of impacting outcomes at the level of individuals within the clusters. Health policy researchers can use clustered, sequential, multiple assignment, randomized trials (SMARTs) to answer important scientific questions concerning clustered adaptive interventions. A common primary aim is to compare the mean of a nested, end-of-study outcome between two clustered adaptive interventions. However, existing methods are not suitable when the primary outcome in a clustered SMART is nested and longitudinal (e.g., repeated outcome measures nested within mental healthcare providers, and mental healthcare providers nested within schools). This manuscript proposes a three-level marginal mean modeling and estimation approach for comparing adaptive interventions in a clustered SMART. The proposed method enables policy analysts to answer a wider array of scientific questions in the marginal comparison of clustered adaptive interventions. Further, relative to using an existing two-level method with a nested end-of-study outcome, the proposed method benefits from improved statistical efficiency. With this approach, we examine longitudinal comparisons of adaptive interventions for improving school-based mental healthcare and contrast its performance with existing approaches for studying static end-of-study outcomes. Methods were motivated by the Adaptive School-Based Implementation of CBT (ASIC) study, a clustered SMART designed to construct an adaptive health policy to improve the adoption of evidence-based CBT by mental healthcare professionals in high schools across Michigan.
- [9] arXiv:2503.09026 [pdf, html, other]
-
Title: A Sparse Linear Model for Positive Definite Estimation of Covariance MatricesSubjects: Methodology (stat.ME)
Sparse covariance matrices play crucial roles by encoding the interdependencies between variables in numerous fields such as genetics and neuroscience. Despite substantial studies on sparse covariance matrices, existing methods face several challenges such as the correlation among the elements in the sample covariance matrix, positive definiteness and unbiased estimation of the diagonal elements. To address these challenges, we formulate a linear covariance model for estimating sparse covariance matrices and propose a penalized regression. This method is general enough to encompass existing sparse covariance estimators and can additionally consider correlation among the elements in the sample covariance matrix while preserving positive definiteness and fixing the diagonal elements to the sample variance, hence avoiding unnecessary bias in the diagonal elements. We apply our estimator to simulated data and real data from neuroscience and genetics to describe the efficacy of our proposed method.
- [10] arXiv:2503.09065 [pdf, html, other]
-
Title: WOMBAT v2.S: A Bayesian inversion framework for attributing global CO$_2$ flux components from multiprocess dataComments: 25 pages, 5 figures, 1 tableSubjects: Applications (stat.AP); Atmospheric and Oceanic Physics (physics.ao-ph)
Contributions from photosynthesis and other natural components of the carbon cycle present the largest uncertainties in our understanding of carbon dioxide (CO$_2$) sources and sinks. While the global spatiotemporal distribution of the net flux (the sum of all contributions) can be inferred from atmospheric CO$_2$ concentrations through flux inversion, attributing the net flux to its individual components remains challenging. The advent of solar-induced fluorescence (SIF) satellite observations provides an opportunity to isolate natural components by anchoring gross primary productivity (GPP), the photosynthetic component of the net flux. Here, we introduce a novel statistical flux-inversion framework that simultaneously assimilates observations of SIF and CO$_2$ concentration, extending WOMBAT v2.0 (WOllongong Methodology for Bayesian Assimilation of Trace-gases, version 2.0) with a hierarchical model of spatiotemporal dependence between GPP and SIF processes. We call the new framework WOMBAT v2.S, and we apply it to SIF and CO$_2$ data from NASA's Orbiting Carbon Observatory-2 (OCO-2) satellite and other instruments to estimate natural fluxes over the globe during a recent six-year period. In a simulation experiment that matches OCO-2's retrieval characteristics, the inclusion of SIF improves accuracy and uncertainty quantification of component flux estimates. Comparing estimates from WOMBAT v2.S, v2.0, and the independent FLUXCOM initiative, we observe that linking GPP to SIF has little effect on net flux, as expected, but leads to spatial redistribution and more realistic seasonal structure in natural flux components.
- [11] arXiv:2503.09072 [pdf, html, other]
-
Title: High-dimensional covariance matrix regularization using informative targetsSubjects: Methodology (stat.ME)
The sample covariance matrix becomes non-invertible in high-dimensional settings, making classical multivariate statistical methods inapplicable. Various regularization techniques address this issue by imposing a structured target matrix to improve stability and invertibility. While diagonal matrices are commonly used as targets due to their simplicity, more informative target matrices can enhance performance. This paper explores the use of such targets and estimates the underlying correlation parameter using maximum likelihood. The proposed method is analytically straightforward, computationally efficient, and more accurate than recent regularization techniques when targets are correctly specified. Its effectiveness is demonstrated through extensive simulations and a real-world application.
- [12] arXiv:2503.09097 [pdf, html, other]
-
Title: Self-Consistent Equation-guided Neural Networks for Censored Time-to-Event DataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
In survival analysis, estimating the conditional survival function given predictors is often of interest. There is a growing trend in the development of deep learning methods for analyzing censored time-to-event data, especially when dealing with high-dimensional predictors that are complexly interrelated. Many existing deep learning approaches for estimating the conditional survival functions extend the Cox regression models by replacing the linear function of predictor effects by a shallow feed-forward neural network while maintaining the proportional hazards assumption. Their implementation can be computationally intensive due to the use of the full dataset at each iteration because the use of batch data may distort the at-risk set of the partial likelihood function. To overcome these limitations, we propose a novel deep learning approach to non-parametric estimation of the conditional survival functions using the generative adversarial networks leveraging self-consistent equations. The proposed method is model-free and does not require any parametric assumptions on the structure of the conditional survival function. We establish the convergence rate of our proposed estimator of the conditional survival function. In addition, we evaluate the performance of the proposed method through simulation studies and demonstrate its application on a real-world dataset.
- [13] arXiv:2503.09156 [pdf, html, other]
-
Title: Spectral Clustering on Multilayer Networks with CovariatesComments: 20 ages, 1 figureSubjects: Methodology (stat.ME)
The community detection problem on multilayer networks have drawn much interest. When the nodal covariates ar also present, few work has been done to integrate information from both sources. To leverage the multilayer networks and the covariates, we propose two new algorithms: the spectral clustering on aggregated networks with covariates (SCANC), and the spectral clustering on aggregated Laplacian with covariates (SCALC). These two algorithms are easy to implement, computationally fast, and feature a data-driven approach for tuning parameter selection.
We establish theoretical guarantees for both methods under the Multilayer Stochastic Blockmodel with Covariates (MSBM-C), demonstrating their consistency in recovering community structure. Our analysis reveals that increasing the number of layers, incorporating covariate information, and enhancing network density all contribute to improved clustering accuracy. Notably, SCANC is most effective when all layers exhibit similar assortativity, whereas SCALC performs better when both assortative and disassortative layers are present. On the simulation studies and a primary school contact data analysis, our method outperforms other methods. Our results highlight the advantages of spectral-based aggregation techniques in leveraging both network structure and nodal attributes for robust community detection. - [14] arXiv:2503.09194 [pdf, html, other]
-
Title: Addressing pitfalls in implicit unobserved confounding synthesis using explicit block hierarchical ancestral samplingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
Unbiased data synthesis is crucial for evaluating causal discovery algorithms in the presence of unobserved confounding, given the scarcity of real-world datasets. A common approach, implicit parameterization, encodes unobserved confounding by modifying the off-diagonal entries of the idiosyncratic covariance matrix while preserving positive definiteness. Within this approach, state-of-the-art protocols have two distinct issues that hinder unbiased sampling from the complete space of causal models: first, the use of diagonally dominant constructions, which restrict the spectrum of partial correlation matrices; and second, the restriction of possible graphical structures when sampling bidirected edges, unnecessarily ruling out valid causal models. To address these limitations, we propose an improved explicit modeling approach for unobserved confounding, leveraging block-hierarchical ancestral generation of ground truth causal graphs. Algorithms for converting the ground truth DAG into ancestral graph is provided so that the output of causal discovery algorithms could be compared with. We prove that our approach fully covers the space of causal models, including those generated by the implicit parameterization, thus enabling more robust evaluation of methods for causal discovery and inference.
- [15] arXiv:2503.09226 [pdf, html, other]
-
Title: Towards Regulatory-Confirmed Adaptive Clinical Trials: Machine Learning Opportunities and SolutionsComments: AISTATS 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Randomized Controlled Trials (RCTs) are the gold standard for evaluating the effect of new medical treatments. Treatments must pass stringent regulatory conditions in order to be approved for widespread use, yet even after the regulatory barriers are crossed, real-world challenges might arise: Who should get the treatment? What is its true clinical utility? Are there discrepancies in the treatment effectiveness across diverse and under-served populations? We introduce two new objectives for future clinical trials that integrate regulatory constraints and treatment policy value for both the entire population and under-served populations, thus answering some of the questions above in advance. Designed to meet these objectives, we formulate Randomize First Augment Next (RFAN), a new framework for designing Phase III clinical trials. Our framework consists of a standard randomized component followed by an adaptive one, jointly meant to efficiently and safely acquire and assign patients into treatment arms during the trial. Then, we propose strategies for implementing RFAN based on causal, deep Bayesian active learning. Finally, we empirically evaluate the performance of our framework using synthetic and real-world semi-synthetic datasets.
- [16] arXiv:2503.09299 [pdf, html, other]
-
Title: Low-Rank Graphon Estimation: Theory and Applications to Graphon GamesSubjects: Statistics Theory (math.ST)
This paper tackles the challenge of estimating a low-rank graphon from sampled network data, employing a singular value thresholding (SVT) estimator to create a piecewise-constant graphon based on the network's adjacency matrix. Under certain assumptions about the graphon's structural properties, we establish bounds on the operator norm distance between the true graphon and its estimator, as well as on the rank of the estimated graphon. In the second part of the paper, we apply our estimator to graphon games. We derive bounds on the suboptimality of interventions in the social welfare problem in graphon games when the intervention is based on the estimated graphon. These bounds are expressed in terms of the operator norm of the difference between the true and estimated graphons. We also emphasize the computational benefits of using the low-rank estimated graphon to solve these problems.
- [17] arXiv:2503.09310 [pdf, html, other]
-
Title: Competing-risk Weibull survival model with multiple causesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
The failure of a system can result from the simultaneous effects of multiple causes, where assigning a specific cause may be inappropriate or unavailable. Examples include contributing causes of death in epidemiology and the aetiology of neurodegenerative diseases like Alzheimer's. We propose a parametric Weibull accelerated failure time model for multiple causes, incorporating a data-driven, individualized, and time-varying winning probability (relative importance) matrix. Using maximum likelihood estimation and the expectation-maximization (EM) algorithm, our approach enables simultaneous estimation of regression coefficients and relative cause importance, ensuring consistency and asymptotic normality. A simulation study and an application to Alzheimer's disease demonstrate its effectiveness in addressing cause-mixture problems and identifying informative biomarker combinations, with comparisons to Weibull and Cox proportional hazards models.
- [18] arXiv:2503.09451 [pdf, other]
-
Title: Bayesian nonparametric modeling of mixed-type bounded dataSubjects: Methodology (stat.ME)
We propose a Bayesian nonparametric model for mixed-type bounded data, where some variables are compositional and others are interval-bounded. Compositional variables are non-negative and sum to a given constant, such as the proportion of time an individual spends on different activities during the day or the fraction of different types of nutrients in a person's diet. Interval-bounded variables, on the other hand, are real numbers constrained by both a lower and an upper bound. Our approach relies on a novel class of random multivariate Bernstein polynomials, which induce a Dirichlet process mixture model of products of Dirichlet and beta densities. We study the theoretical properties of the model, including its topological support and posterior consistency. The model can be used for density and conditional density estimation, where both the response and predictors take values in the simplex space and/or hypercube. We illustrate the model's behavior through the analysis of simulated data and data from the 2005-2006 cycle of the U.S. National Health and Nutrition Examination Survey.
- [19] arXiv:2503.09507 [pdf, html, other]
-
Title: Parameter estimation for the stochastic Burgers equation driven by white noise from local measurementsSubjects: Statistics Theory (math.ST); Probability (math.PR)
For one dimensional stochastic Burgers equation driven by space-time white noise we consider the problem of estimation of the diffusivity parameter in front of the second-order spatial derivative. Based on local observations in space, we study the estimator derived in [Altmeyer, Reiß, Ann. Appl. Probab.(2021)] for linear stochastic heat equation that has also been used in [Altmeyer, Cialenco, Pasemann, Bernoulli (2023)] to cover large class of semilinear SPDEs and has been examined for the stochastic Burgers equation driven by trace class noise. We extend the achieved results by considering the space-time white noise case which has also relevant physical motivations. After we establish new regularity results for the solution, we are able to show that our proposed estimator is strongly consistent and asymptotically normal.
- [20] arXiv:2503.09541 [pdf, html, other]
-
Title: Neural Network-Based Change Point Detection for Large-Scale Time-Evolving DataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME)
The paper studies the problem of detecting and locating change points in multivariate time-evolving data. The problem has a long history in statistics and signal processing and various algorithms have been developed primarily for simple parametric models. In this work, we focus on modeling the data through feed-forward neural networks and develop a detection strategy based on the following two-step procedure. In the first step, the neural network is trained over a prespecified window of the data, and its test error function is calibrated over another prespecified window. Then, the test error function is used over a moving window to identify the change point. Once a change point is detected, the procedure involving these two steps is repeated until all change points are identified. The proposed strategy yields consistent estimates for both the number and the locations of the change points under temporal dependence of the data-generating process. The effectiveness of the proposed strategy is illustrated on synthetic data sets that provide insights on how to select in practice tuning parameters of the algorithm and in real data sets. Finally, we note that although the detection strategy is general and can work with different neural network architectures, the theoretical guarantees provided are specific to feed-forward neural architectures.
New submissions (showing 20 of 20 entries)
- [21] arXiv:2503.08743 (cross-list from cs.SI) [pdf, html, other]
-
Title: Hard negative sampling in hyperedge predictionComments: 24 pages, 8 figuresSubjects: Social and Information Networks (cs.SI); Other Statistics (stat.OT)
Hypergraph, which allows each hyperedge to encompass an arbitrary number of nodes, is a powerful tool for modeling multi-entity interactions. Hyperedge prediction is a fundamental task that aims to predict future hyperedges or identify existent but unobserved hyperedges based on those observed. In link prediction for simple graphs, most observed links are treated as positive samples, while all unobserved links are considered as negative samples. However, this full-sampling strategy is impractical for hyperedge prediction, due to the number of unobserved hyperedges in a hypergraph significantly exceeds the number of observed ones. Therefore, one has to utilize some negative sampling methods to generate negative samples, ensuring their quantity is comparable to that of positive samples. In current hyperedge prediction, randomly selecting negative samples is a routine practice. But through experimental analysis, we discover a critical limitation of random selecting that the generated negative samples are too easily distinguishable from positive samples. This leads to premature convergence of the model and reduces the accuracy of prediction. To overcome this issue, we propose a novel method to generate negative samples, named as hard negative sampling (HNS). Unlike traditional methods that construct negative hyperedges by selecting node sets from the original hypergraph, HNS directly synthesizes negative samples in the hyperedge embedding space, thereby generating more challenging and informative negative samples. Our results demonstrate that HNS significantly enhances both accuracy and robustness of the prediction. Moreover, as a plug-and-play technique, HNS can be easily applied in the training of various hyperedge prediction models based on representation learning.
- [22] arXiv:2503.08756 (cross-list from eess.IV) [pdf, other]
-
Title: Frequency selection for the diagnostic characterization of human brain tumoursSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
The diagnosis of brain tumours is an extremely sensitive and complex clinical task that must rely upon information gathered through non-invasive techniques. One such technique is magnetic resonance, in the modalities of imaging or spectroscopy. The latter provides plenty of metabolic information about the tumour tissue, but its high dimensionality makes resorting to pattern recognition techniques advisable. In this brief paper, an international database of brain tumours is analyzed resorting to an ad hoc spectral frequency selection procedure combined with nonlinear classification.
- [23] arXiv:2503.08760 (cross-list from cs.LG) [pdf, html, other]
-
Title: Heterogeneous Graph Structure Learning through the Lens of Data-generating ProcessesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Inferring the graph structure from observed data is a key task in graph machine learning to capture the intrinsic relationship between data entities. While significant advancements have been made in learning the structure of homogeneous graphs, many real-world graphs exhibit heterogeneous patterns where nodes and edges have multiple types. This paper fills this gap by introducing the first approach for heterogeneous graph structure learning (HGSL). To this end, we first propose a novel statistical model for the data-generating process (DGP) of heterogeneous graph data, namely hidden Markov networks for heterogeneous graphs (H2MN). Then we formalize HGSL as a maximum a-posterior estimation problem parameterized by such DGP and derive an alternating optimization method to obtain a solution together with a theoretical justification of the optimization conditions. Finally, we conduct extensive experiments on both synthetic and real-world datasets to demonstrate that our proposed method excels in learning structure on heterogeneous graphs in terms of edge type identification and edge weight recovery.
- [24] arXiv:2503.08801 (cross-list from cs.LG) [pdf, html, other]
-
Title: Enhanced Estimation Techniques for Certified Radii in Randomized SmoothingComments: IEEE The 8th International Conference on Artificial Intelligence and Big Data (ICAIBD 2025)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper presents novel methods for estimating certified radii in randomized smoothing, a technique crucial for certifying the robustness of neural networks against adversarial perturbations. Our proposed techniques significantly improve the accuracy of certified test-set accuracy by providing tighter bounds on the certified radii. We introduce advanced algorithms for both discrete and continuous domains, demonstrating their effectiveness on CIFAR-10 and ImageNet datasets. The new methods show considerable improvements over existing approaches, particularly in reducing discrepancies in certified radii estimates. We also explore the impact of various hyperparameters, including sample size, standard deviation, and temperature, on the performance of these methods. Our findings highlight the potential for more efficient certification processes and pave the way for future research on tighter confidence sequences and improved theoretical frameworks. The study concludes with a discussion of potential future directions, including enhanced estimation techniques for discrete domains and further theoretical advancements to bridge the gap between empirical and theoretical performance in randomized smoothing.
- [25] arXiv:2503.08870 (cross-list from cs.LG) [pdf, other]
-
Title: Comprehensive Benchmarking of Machine Learning Methods for Risk Prediction Modelling from Large-Scale Survival Data: A UK Biobank StudyRafael R. Oexner, Robin Schmitt, Hyunchan Ahn, Ravi A. Shah, Anna Zoccarato, Konstantinos Theofilatos, Ajay M. ShahSubjects: Machine Learning (cs.LG); Applications (stat.AP)
Predictive modelling is vital to guide preventive efforts. Whilst large-scale prospective cohort studies and a diverse toolkit of available machine learning (ML) algorithms have facilitated such survival task efforts, choosing the best-performing algorithm remains challenging. Benchmarking studies to date focus on relatively small-scale datasets and it is unclear how well such findings translate to large datasets that combine omics and clinical features. We sought to benchmark eight distinct survival task implementations, ranging from linear to deep learning (DL) models, within the large-scale prospective cohort study UK Biobank (UKB). We compared discrimination and computational requirements across heterogenous predictor matrices and endpoints. Finally, we assessed how well different architectures scale with sample sizes ranging from n = 5,000 to n = 250,000 individuals. Our results show that discriminative performance across a multitude of metrices is dependent on endpoint frequency and predictor matrix properties, with very robust performance of (penalised) COX Proportional Hazards (COX-PH) models. Of note, there are certain scenarios which favour more complex frameworks, specifically if working with larger numbers of observations and relatively simple predictor matrices. The observed computational requirements were vastly different, and we provide solutions in cases where current implementations were impracticable. In conclusion, this work delineates how optimal model choice is dependent on a variety of factors, including sample size, endpoint frequency and predictor matrix properties, thus constituting an informative resource for researchers working on similar datasets. Furthermore, we showcase how linear models still display a highly effective and scalable platform to perform risk modelling at scale and suggest that those are reported alongside non-linear ML models.
- [26] arXiv:2503.08918 (cross-list from cs.LG) [pdf, html, other]
-
Title: Multilevel Generative Samplers for Investigating Critical PhenomenaComments: 10 pages, 4 figures (main text); 13th International Conference on Learning Representations (ICLR 2025)Subjects: Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat); Machine Learning (stat.ML)
Investigating critical phenomena or phase transitions is of high interest in physics and chemistry, for which Monte Carlo (MC) simulations, a crucial tool for numerically analyzing macroscopic properties of given systems, are often hindered by an emerging divergence of correlation length -- known as scale invariance at criticality (SIC) in the renormalization group theory. SIC causes the system to behave the same at any length scale, from which many existing sampling methods suffer: long-range correlations cause critical slowing down in Markov chain Monte Carlo (MCMC), and require intractably large receptive fields for generative samplers. In this paper, we propose a Renormalization-informed Generative Critical Sampler (RiGCS) -- a novel sampler specialized for near-critical systems, where SIC is leveraged as an advantage rather than a nuisance. Specifically, RiGCS builds on MultiLevel Monte Carlo (MLMC) with Heat Bath (HB) algorithms, which perform ancestral sampling from low-resolution to high-resolution lattice configurations with site-wise-independent conditional HB sampling. Although MLMC-HB is highly efficient under exact SIC, it suffers from a low acceptance rate under slight SIC violation. Notably, SIC violation always occurs in finite-size systems, and may induce long-range and higher-order interactions in the renormalized distributions, which are not considered by independent HB samplers. RiGCS enhances MLMC-HB by replacing a part of the conditional HB sampler with generative models that capture those residual interactions and improve the sampling efficiency. Our experiments show that the effective sample size of RiGCS is a few orders of magnitude higher than state-of-the-art generative model baselines in sampling configurations for 128x128 two-dimensional Ising systems.
- [27] arXiv:2503.08984 (cross-list from math.PR) [pdf, html, other]
-
Title: "All-Something-Nothing" Phase Transitions in Planted k-Factor RecoveryComments: 43 pages, 5 figuresSubjects: Probability (math.PR); Statistics Theory (math.ST)
This paper studies the problem of inferring a $k$-factor, specifically a spanning $k$-regular graph, planted within an Erdos-Renyi random graph $G(n,\lambda/n)$. We uncover an interesting "all-something-nothing" phase transition. Specifically, we show that as the average degree $\lambda$ surpasses the critical threshold of $1/k$, the inference problem undergoes a transition from almost exact recovery ("all" phase) to partial recovery ("something" phase). Moreover, as $\lambda$ tends to infinity, the accuracy of recovery diminishes to zero, leading to the onset of the "nothing" phase. This finding complements the recent result by Mossel, Niles-Weed, Sohn, Sun, and Zadik who established that for certain sufficiently dense graphs, the problem undergoes an "all-or-nothing" phase transition, jumping from near-perfect to near-zero recovery. In addition, we characterize the recovery accuracy of a linear-time iterative pruning algorithm and show that it achieves almost exact recovery when $\lambda < 1/k$. A key component of our analysis is a two-step cycle construction: we first build trees through local neighborhood exploration and then connect them by sprinkling using reserved edges. Interestingly, for proving impossibility of almost exact recovery, we construct $\Theta(n)$ many small trees of size $\Theta(1)$, whereas for establishing the algorithmic lower bound, a single large tree of size $\Theta(\sqrt{n\log n})$ suffices.
- [28] arXiv:2503.09069 (cross-list from cs.LG) [pdf, html, other]
-
Title: Theoretical Guarantees for High Order Trajectory Refinement in Generative FlowsComments: arXiv admin note: text overlap with arXiv:2410.11261Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Flow matching has emerged as a powerful framework for generative modeling, offering computational advantages over diffusion models by leveraging deterministic Ordinary Differential Equations (ODEs) instead of stochastic dynamics. While prior work established the worst case optimality of standard flow matching under Wasserstein distances, the theoretical guarantees for higher-order flow matching - which incorporates acceleration terms to refine sample trajectories - remain unexplored. In this paper, we bridge this gap by proving that higher-order flow matching preserves worst case optimality as a distribution estimator. We derive upper bounds on the estimation error for second-order flow matching, demonstrating that the convergence rates depend polynomially on the smoothness of the target distribution (quantified via Besov spaces) and key parameters of the ODE dynamics. Our analysis employs neural network approximations with carefully controlled depth, width, and sparsity to bound acceleration errors across both small and large time intervals, ultimately unifying these results into a general worst case optimal bound for all time steps.
- [29] arXiv:2503.09134 (cross-list from cs.LG) [pdf, html, other]
-
Title: Clustering by Nonparametric SmoothingComments: Under submission for possible publication by IEEESubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from this https URL CNS
- [30] arXiv:2503.09199 (cross-list from cs.LG) [pdf, html, other]
-
Title: GENEOnet: Statistical analysis supporting explainability and trustworthinessGiovanni Bocchi, Patrizio Frosini, Alessandra Micheletti, Alessandro Pedretti, Carmen Gratteri, Filippo Lunghini, Andrea Rosario Beccari, Carmine TalaricoSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Applications (stat.AP)
Group Equivariant Non-Expansive Operators (GENEOs) have emerged as mathematical tools for constructing networks for Machine Learning and Artificial Intelligence. Recent findings suggest that such models can be inserted within the domain of eXplainable Artificial Intelligence (XAI) due to their inherent interpretability. In this study, we aim to verify this claim with respect to GENEOnet, a GENEO network developed for an application in computational biochemistry by employing various statistical analyses and experiments. Such experiments first allow us to perform a sensitivity analysis on GENEOnet's parameters to test their significance. Subsequently, we show that GENEOnet exhibits a significantly higher proportion of equivariance compared to other methods. Lastly, we demonstrate that GENEOnet is on average robust to perturbations arising from molecular dynamics. These results collectively serve as proof of the explainability, trustworthiness, and robustness of GENEOnet and confirm the beneficial use of GENEOs in the context of Trustworthy Artificial Intelligence.
- [31] arXiv:2503.09244 (cross-list from cs.CV) [pdf, other]
-
Title: How To Make Your Cell Tracker Say "I dunno!"Subjects: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM); Applications (stat.AP)
Cell tracking is a key computational task in live-cell microscopy, but fully automated analysis of high-throughput imaging requires reliable and, thus, uncertainty-aware data analysis tools, as the amount of data recorded within a single experiment exceeds what humans are able to overlook. We here propose and benchmark various methods to reason about and quantify uncertainty in linear assignment-based cell tracking algorithms. Our methods take inspiration from statistics and machine learning, leveraging two perspectives on the cell tracking problem explored throughout this work: Considering it as a Bayesian inference problem and as a classification problem. Our methods admit a framework-like character in that they equip any frame-to-frame tracking method with uncertainty quantification. We demonstrate this by applying it to various existing tracking algorithms including the recently presented Transformer-based trackers. We demonstrate empirically that our methods yield useful and well-calibrated tracking uncertainties.
- [32] arXiv:2503.09287 (cross-list from econ.EM) [pdf, html, other]
-
Title: On the Wisdom of Crowds (of Economists)Subjects: Econometrics (econ.EM); Applications (stat.AP)
We study the properties of macroeconomic survey forecast response averages as the number of survey respondents grows. Such averages are "portfolios" of forecasts. We characterize the speed and pattern of the gains from diversification and their eventual decrease with portfolio size (the number of survey respondents) in both (1) the key real-world data-based environment of the U.S. Survey of Professional Forecasters (SPF), and (2) the theoretical model-based environment of equicorrelated forecast errors. We proceed by proposing and comparing various direct and model-based "crowd size signature plots," which summarize the forecasting performance of k-average forecasts as a function of k, where k is the number of forecasts in the average. We then estimate the equicorrelation model for growth and inflation forecast errors by choosing model parameters to minimize the divergence between direct and model-based signature plots. The results indicate near-perfect equicorrelation model fit for both growth and inflation, which we explicate by showing analytically that, under conditions, the direct and fitted equicorrelation model-based signature plots are identical at a particular model parameter configuration, which we characterize. We find that the gains from diversification are greater for inflation forecasts than for growth forecasts, but that both gains nevertheless decrease quite quickly, so that fewer SPF respondents than currently used may be adequate.
- [33] arXiv:2503.09309 (cross-list from cs.LG) [pdf, html, other]
-
Title: Steering No-Regret Agents in MFGs under Model UncertaintyComments: AISTATS 2025; 34 PagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
Incentive design is a popular framework for guiding agents' learning dynamics towards desired outcomes by providing additional payments beyond intrinsic rewards. However, most existing works focus on a finite, small set of agents or assume complete knowledge of the game, limiting their applicability to real-world scenarios involving large populations and model uncertainty. To address this gap, we study the design of steering rewards in Mean-Field Games (MFGs) with density-independent transitions, where both the transition dynamics and intrinsic reward functions are unknown. This setting presents non-trivial challenges, as the mediator must incentivize the agents to explore for its model learning under uncertainty, while simultaneously steer them to converge to desired behaviors without incurring excessive incentive payments. Assuming agents exhibit no(-adaptive) regret behaviors, we contribute novel optimistic exploration algorithms. Theoretically, we establish sub-linear regret guarantees for the cumulative gaps between the agents' behaviors and the desired ones. In terms of the steering cost, we demonstrate that our total incentive payments incur only sub-linear excess, competing with a baseline steering strategy that stabilizes the target policy as an equilibrium. Our work presents an effective framework for steering agents behaviors in large-population systems under uncertainty.
- [34] arXiv:2503.09411 (cross-list from cs.LG) [pdf, html, other]
-
Title: Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic OptimizationComments: 22 pagesSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor $\rho$ (i.e., the grid resolution), achieving a rate of $O(\rho^{1/(2p+1)}/\sqrt{T})$ where $p$ is the degree of polynomial decay and $T$ is the number of steps, in contrast to the $O(\rho/\sqrt{T})$ rate that arises with fixed stepsizes and exhibits a linear dependence on $\rho$. Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.
- [35] arXiv:2503.09485 (cross-list from cs.LG) [pdf, html, other]
-
Title: A Novel Approach for Intrinsic Dimension EstimationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The real-life data have a complex and non-linear structure due to their nature. These non-linearities and the large number of features can usually cause problems such as the empty-space phenomenon and the well-known curse of dimensionality. Finding the nearly optimal representation of the dataset in a lower-dimensional space (i.e. dimensionality reduction) offers an applicable mechanism for improving the success of machine learning tasks. However, estimating the required data dimension for the nearly optimal representation (intrinsic dimension) can be very costly, particularly if one deals with big data. We propose a highly efficient and robust intrinsic dimension estimation approach that only relies on matrix-vector products for dimensionality reduction methods. An experimental study is also conducted to compare the performance of proposed method with state of the art approaches.
- [36] arXiv:2503.09494 (cross-list from cs.LG) [pdf, html, other]
-
Title: Representation Retrieval Learning for Heterogeneous Data IntegrationSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
In the era of big data, large-scale, multi-modal datasets are increasingly ubiquitous, offering unprecedented opportunities for predictive modeling and scientific discovery. However, these datasets often exhibit complex heterogeneity, such as covariate shift, posterior drift, and missing modalities, that can hinder the accuracy of existing prediction algorithms. To address these challenges, we propose a novel Representation Retrieval ($R^2$) framework, which integrates a representation learning module (the representer) with a sparsity-induced machine learning model (the learner). Moreover, we introduce the notion of "integrativeness" for representers, characterized by the effective data sources used in learning representers, and propose a Selective Integration Penalty (SIP) to explicitly improve the property. Theoretically, we demonstrate that the $R^2$ framework relaxes the conventional full-sharing assumption in multi-task learning, allowing for partially shared structures, and that SIP can improve the convergence rate of the excess risk bound. Extensive simulation studies validate the empirical performance of our framework, and applications to two real-world datasets further confirm its superiority over existing approaches.
- [37] arXiv:2503.09565 (cross-list from cs.LG) [pdf, html, other]
-
Title: Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P ParametrizationComments: 29 pages, 5 figures, 2 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.
- [38] arXiv:2503.09583 (cross-list from cs.LG) [pdf, other]
-
Title: Minimax Optimality of the Probability Flow ODE for Diffusion ModelsSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
Score-based diffusion models have become a foundational paradigm for modern generative modeling, demonstrating exceptional capability in generating samples from complex high-dimensional distributions. Despite the dominant adoption of probability flow ODE-based samplers in practice due to their superior sampling efficiency and precision, rigorous statistical guarantees for these methods have remained elusive in the literature. This work develops the first end-to-end theoretical framework for deterministic ODE-based samplers that establishes near-minimax optimal guarantees under mild assumptions on target data distributions. Specifically, focusing on subgaussian distributions with $\beta$-Hölder smooth densities for $\beta\leq 2$, we propose a smooth regularized score estimator that simultaneously controls both the $L^2$ score error and the associated mean Jacobian error. Leveraging this estimator within a refined convergence analysis of the ODE-based sampling process, we demonstrate that the resulting sampler achieves the minimax rate in total variation distance, modulo logarithmic factors. Notably, our theory comprehensively accounts for all sources of error in the sampling process and does not require strong structural conditions such as density lower bounds or Lipschitz/smooth scores on target distributions, thereby covering a broad range of practical data distributions.
Cross submissions (showing 18 of 18 entries)
- [39] arXiv:1810.01683 (replaced) [pdf, html, other]
-
Title: Safe RuleFit: Learning Optimal Sparse Rule Model by Meta Safe ScreeningJournal-ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 2 (2023), pp. 2330-2343Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We consider the problem of learning a sparse rule model, a prediction model in the form of a sparse linear combination of rules, where a rule is an indicator function defined over a hyper-rectangle in the input space. Since the number of all possible such rules is extremely large, it has been computationally intractable to select the optimal set of active rules. In this paper, to solve this difficulty for learning the optimal sparse rule model, we propose Safe RuleFit (SRF). Our basic idea is to develop meta safe screening (mSS), which is a non-trivial extension of well-known safe screening (SS) techniques. While SS is used for screening out one feature, mSS can be used for screening out multiple features by exploiting the inclusion-relations of hyper-rectangles in the input space. SRF provides a general framework for fitting sparse rule models for regression and classification, and it can be extended to handle more general sparse regularizations such as group regularization. We demonstrate the advantages of SRF through intensive numerical experiments.
- [40] arXiv:2211.14897 (replaced) [pdf, html, other]
-
Title: Characterization and Greedy Learning of Gaussian Structural Causal Models under Unknown InterventionsComments: 60 pages, 13 figuresSubjects: Methodology (stat.ME); Computation (stat.CO); Machine Learning (stat.ML)
We consider the problem of recovering the causal structure underlying observations from different experimental conditions when the targets of the interventions in each experiment are unknown. We assume a linear structural causal model with additive Gaussian noise and consider interventions that perturb their targets while maintaining the causal relationships in the system. Different models may entail the same distributions, offering competing causal explanations for the given observations. We fully characterize this equivalence class and offer identifiability results, which we use to derive a greedy algorithm called GnIES to recover the equivalence class of the data-generating model without knowledge of the intervention targets. In addition, we develop a novel procedure to generate semi-synthetic data sets with known causal ground truth but distributions closely resembling those of a real data set of choice. We leverage this procedure and evaluate the performance of GnIES on an array of synthetic and semi-synthetic data sets, and real data from a biological system and a tightly controlled physical system. We provide, in the Python packages gnies and sempler, implementations of GnIES and our semi-synthetic data generation procedure.
- [41] arXiv:2211.15072 (replaced) [pdf, html, other]
-
Title: FaiREE: Fair Classification with Finite-Sample and Distribution-Free GuaranteeComments: Accepted at ICLR 2023Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Algorithmic fairness plays an increasingly critical role in machine learning research. Several group fairness notions and algorithms have been proposed. However, the fairness guarantee of existing fair classification methods mainly depends on specific data distributional assumptions, often requiring large sample sizes, and fairness could be violated when there is a modest number of samples, which is often the case in practice. In this paper, we propose FaiREE, a fair classification algorithm that can satisfy group fairness constraints with finite-sample and distribution-free theoretical guarantees. FaiREE can be adapted to satisfy various group fairness notions (e.g., Equality of Opportunity, Equalized Odds, Demographic Parity, etc.) and achieve the optimal accuracy. These theoretical guarantees are further supported by experiments on both synthetic and real data. FaiREE is shown to have favorable performance over state-of-the-art algorithms.
- [42] arXiv:2310.15877 (replaced) [pdf, html, other]
-
Title: Regression analysis of multiplicative hazards model with time-dependent coefficient for sparse longitudinal covariatesSubjects: Methodology (stat.ME)
We study the multiplicative hazards model with intermittently observed longitudinal covariates and time-varying coefficients. For such models, the existing ad hoc approach, such as the last value carried forward, is biased. We propose a kernel weighting approach to get an unbiased estimation of the non-parametric coefficient function and establish asymptotic normality for any fixed time point. Furthermore, we construct the simultaneous confidence band to examine the overall magnitude of the variation. Simulation studies support our theoretical predictions and show favorable performance of the proposed method. A data set from Alzheimer's Disease Neuroimaging Initiative study is used to illustrate our methodology.
- [43] arXiv:2311.18613 (replaced) [pdf, other]
-
Title: Wasserstein GANs are Minimax Optimal Distribution EstimatorsSubjects: Statistics Theory (math.ST)
We provide non asymptotic rates of convergence of the Wasserstein Generative Adversarial networks (WGAN) estimator. We build neural networks classes representing the generators and discriminators which yield a GAN that achieves the minimax optimal rate for estimating a certain probability measure $\mu$ with support in $\mathbb{R}^p$. The probability $\mu$ is considered to be the push forward of the Lebesgue measure on the $d$-dimensional torus $\mathbb{T}^d$ by a map $g^\star:\mathbb{T}^d\rightarrow \mathbb{R}^p$ of smoothness $\beta+1$. Measuring the error with the $\gamma$-Hölder Integral Probability Metric (IPM), we obtain up to logarithmic factors, the minimax optimal rate $O(n^{-\frac{\beta+\gamma}{2\beta +d}}\vee n^{-\frac{1}{2}})$ where $n$ is the sample size, $\beta$ determines the smoothness of the target measure $\mu$, $\gamma$ is the smoothness of the IPM ($\gamma=1$ is the Wasserstein case) and $d\leq p$ is the intrinsic dimension of $\mu$. In the process, we derive a sharp interpolation inequality between Hölder IPMs. This novel result of theory of functions spaces generalizes classical interpolation inequalities to the case where the measures involved have densities on different manifolds.
- [44] arXiv:2402.01900 (replaced) [pdf, other]
-
Title: Distributional Off-policy Evaluation with Bellman Residual MinimizationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable expectation-extended statistical distances and provide a novel theoretical justification on their validity for learning the return distribution. Based on this attractive property, we propose a new method called Energy Bellman Residual Minimizer (EBRM) for distributional OPE. We provide corresponding in-depth theoretical analyses. We establish a finite-sample error bound for the EBRM estimator under the realizability assumption. Furthermore, we introduce a variant of our method based on a multi-step extension which improves the error bound for non-realizable settings. Notably, unlike prior distributional OPE methods, the theoretical guarantees of our method do not require the completeness assumption.
- [45] arXiv:2404.08717 (replaced) [pdf, other]
-
Title: State-space systems as dynamic generative modelsJournal-ref: Proc. R. Soc. A. 481: 20240308 (2025)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Probability (math.PR); Statistics Theory (math.ST)
A probabilistic framework to study the dependence structure induced by deterministic discrete-time state-space systems between input and output processes is introduced. General sufficient conditions are formulated under which output processes exist and are unique once an input process has been fixed, a property that in the deterministic state-space literature is known as the echo state property. When those conditions are satisfied, the given state-space system becomes a generative model for probabilistic dependences between two sequence spaces. Moreover, those conditions guarantee that the output depends continuously on the input when using the Wasserstein metric. The output processes whose existence is proved are shown to be causal in a specific sense and to generalize those studied in purely deterministic situations. The results in this paper constitute a significant stochastic generalization of sufficient conditions for the deterministic echo state property to hold, in the sense that the stochastic echo state property can be satisfied under contractivity conditions that are strictly weaker than those in deterministic situations. This means that state-space systems can induce a purely probabilistic dependence structure between input and output sequence spaces even when there is no functional relation between those two spaces.
- [46] arXiv:2405.20086 (replaced) [pdf, html, other]
-
Title: Analysis of a multi-target linear shrinkage covariance estimatorSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
Multi-target linear shrinkage is an extension of the standard single-target linear shrinkage for covariance estimation. We combine several constant matrices - the targets - with the sample covariance matrix. We derive the oracle and a \textit{bona fide} multi-target linear shrinkage estimator with exact and empirical mean. In both settings, we proved its convergence towards the oracle under Kolmogorov asymptotics. Finally, we show empirically that it outperforms other standard estimators in various situations.
- [47] arXiv:2406.05714 (replaced) [pdf, html, other]
-
Title: A conversion theorem and minimax optimality for continuum contextual banditsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We study the contextual continuum bandits problem, where the learner sequentially receives a side information vector and has to choose an action in a convex set, minimizing a function associated with the context. The goal is to minimize all the underlying functions for the received contexts, leading to the contextual notion of regret, which is stronger than the standard static regret. Assuming that the objective functions are $\gamma$-Hölder with respect to the contexts, $0<\gamma\le 1,$ we demonstrate that any algorithm achieving a sub-linear static regret can be extended to achieve a sub-linear contextual regret. We prove a static-to-contextual regret conversion theorem that provides an upper bound for the contextual regret of the output algorithm as a function of the static regret of the input algorithm. We further study the implications of this general result for three fundamental cases of dependency of the objective function on the action variable: (a) Lipschitz bandits, (b) convex bandits, (c) strongly convex and smooth bandits. For Lipschitz bandits and $\gamma=1,$ combining our results with the lower bound of Slivkins (2014), we prove that the minimax optimal contextual regret for the noise-free adversarial setting is achieved. Then, we prove that in the presence of noise, the contextual regret rate as a function of the number of queries is the same for convex bandits as it is for strongly convex and smooth bandits. Lastly, we present a minimax lower bound, implying two key facts. First, obtaining a sub-linear contextual regret may be impossible over functions that are not continuous with respect to the context. Second, for convex bandits and strongly convex and smooth bandits, the algorithms that we propose achieve, up to a logarithmic factor, the minimax optimal rate of contextual regret as a function of the number of queries.
- [48] arXiv:2407.06835 (replaced) [pdf, other]
-
Title: A flexible model for Record LinkageComments: Published in JRSSSC in February 2025Subjects: Methodology (stat.ME); Applications (stat.AP)
Combining data from various sources empowers researchers to explore innovative questions, for example those raised by conducting healthcare monitoring studies. However, the lack of a unique identifier often poses challenges. Record linkage procedures determine whether pairs of observations collected on different occasions belong to the same individual using partially identifying variables (e.g. birth year, postal code). Existing methodologies typically involve a compromise between computational efficiency and accuracy. Traditional approaches simplify this task by condensing information, yet they neglect dependencies among linkage decisions and disregard the one-to-one relationship required to establish coherent links. Modern approaches offer a comprehensive representation of the data generation process, at the expense of computational overhead and reduced flexibility. We propose a flexible method, that adapts to varying data complexities, addressing registration errors and accommodating changes of the identifying information over time. Our approach balances accuracy and scalability, estimating the linkage using a Stochastic Expectation Maximisation algorithm on a latent variable model. We illustrate the ability of our methodology to connect observations using large real data applications and demonstrate the robustness of our model to the linking variables quality in a simulation study. The proposed algorithm FlexRL is implemented and available in an open source R package.
- [49] arXiv:2410.02951 (replaced) [pdf, html, other]
-
Title: Non-Asymptotic Analysis of Classical Spectrum Estimators with $L$-mixing Time-series DataComments: 6 pages, 2 figures, accepted by American Control ConferenceSubjects: Statistics Theory (math.ST)
Spectral estimation is a fundamental problem for time series analysis, which is widely applied in economics, speech analysis, seismology, and control systems. The asymptotic convergence theory for classical, non-parametric estimators, is well-understood, but the non-asymptotic theory is still rather limited. Our recent work gave the first non-asymptotic error bounds on the well-known Bartlett and Welch methods, but under restrictive assumptions. In this paper, we derive non-asymptotic error bounds for a class of non-parametric spectral estimators, which includes the classical Bartlett and Welch methods, under the assumption that the data is an $L$-mixing stochastic process. A broad range of processes arising in time-series analysis, such as autoregressive processes and measurements of geometrically ergodic Markov chains, can be shown to be $L$-mixing. In particular, $L$-mixing processes can model a variety of nonlinear phenomena which do not satisfy the assumptions of our prior work. Our new error bounds for $L$-mixing processes match the error bounds in the restrictive settings from prior work up to logarithmic factors.
- [50] arXiv:2410.22333 (replaced) [pdf, html, other]
-
Title: Hypothesis tests and model parameter estimation on data sets with missing correlation informationSubjects: Methodology (stat.ME); High Energy Physics - Phenomenology (hep-ph); Applications (stat.AP)
Ideally, all analyses of normally distributed data should include the full covariance information between all data points. In practice, the full covariance matrix between all data points is not always available. Either because a result was published without a covariance matrix, or because one tries to combine multiple results from separate publications. For simple hypothesis tests, it is possible to define robust test statistics that will behave conservatively in the presence on unknown correlations. For model parameter fits, one can inflate the variance by a factor to ensure that things remain conservative at least up to a chosen confidence level. This paper describes a class of robust test statistics for simple hypothesis tests, as well as an algorithm to determine the necessary inflation factor for model parameter fits and Goodness of Fit tests and composite hypothesis tests. It then presents some example applications of the methods to real neutrino interaction data and model comparisons.
- [51] arXiv:2411.10153 (replaced) [pdf, html, other]
-
Title: A unifying framework for generalised Bayesian online learning in non-stationary environmentsComments: Published in Transactions on Machine Learning Research (03/2025)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We propose a unifying framework for methods that perform probabilistic online learning in non-stationary environments. We call the framework BONE, which stands for generalised (B)ayesian (O)nline learning in (N)on-stationary (E)nvironments. BONE provides a common structure to tackle a variety of problems, including online continual learning, prequential forecasting, and contextual bandits. The framework requires specifying three modelling choices: (i) a model for measurements (e.g., a neural network), (ii) an auxiliary process to model non-stationarity (e.g., the time since the last changepoint), and (iii) a conditional prior over model parameters (e.g., a multivariate Gaussian). The framework also requires two algorithmic choices, which we use to carry out approximate inference under this framework: (i) an algorithm to estimate beliefs (posterior distribution) about the model parameters given the auxiliary variable, and (ii) an algorithm to estimate beliefs about the auxiliary variable. We show how the modularity of our framework allows for many existing methods to be reinterpreted as instances of BONE, and it allows us to propose new methods. We compare experimentally existing methods with our proposed new method on several datasets, providing insights into the situations that make each method more suitable for a specific task. We provide a Jax open source library to facilitate the adoption of this framework.
- [52] arXiv:2412.05673 (replaced) [pdf, other]
-
Title: A generalized Bayesian approach for high-dimensional robust regression with serially correlated errors and predictorsSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Computation (stat.CO)
This paper introduces a loss-based generalized Bayesian methodology for high-dimensional robust regression with serially correlated errors and predictors. The proposed framework employs a novel scaled pseudo-Huber (SPH) loss function, which smooths the well-known Huber loss, effectively balancing quadratic ($\ell_2$) and absolute linear ($\ell_1$) loss behaviors. This flexibility enables the framework to accommodate both thin-tailed and heavy-tailed data efficiently. The generalized Bayesian approach constructs a working likelihood based on the SPH loss, facilitating efficient and stable estimation while providing rigorous uncertainty quantification for all model parameters. Notably, this approach allows formal statistical inference without requiring ad hoc tuning parameter selection while adaptively addressing a wide range of tail behavior in the errors. By specifying appropriate prior distributions for the regression coefficients--such as ridge priors for small or moderate-dimensional settings and spike-and-slab priors for high-dimensional settings--the framework ensures principled inference. We establish rigorous theoretical guarantees for accurate parameter estimation and correct predictor selection under sparsity assumptions for a wide range of data generating setups. Extensive simulation studies demonstrate the superior performance of our approach compared to traditional Bayesian regression methods based on $\ell_2$ and $\ell_1$-loss functions. The results highlight its flexibility and robustness, particularly in challenging high-dimensional settings characterized by data contamination.
- [53] arXiv:2412.08934 (replaced) [pdf, html, other]
-
Title: A cheat sheet for probability distributions of orientational dataComments: Added section 7, improved the experiments description (Sec. 8), fixed typosSubjects: Methodology (stat.ME); Robotics (cs.RO)
The need for statistical models of orientations arises in many applications in engineering and computer science. Orientational data appear as sets of angles, unit vectors, rotation matrices or quaternions. In the field of directional statistics, a lot of advances have been made in modelling such types of data. However, only a few of these tools are used in engineering and computer science applications. Hence, this paper aims to serve as a cheat sheet for those probability distributions of orientations. Models for 1-DOF, 2-DOF and 3-DOF orientations are discussed. For each of them, expressions for the density function, fitting to data, and sampling are presented. The paper is written with a compromise between engineering and statistics in terms of notation and terminology. A Python library with functions for some of these models is provided. Using this library, two examples of applications to real data are presented.
- [54] arXiv:2412.09539 (replaced) [pdf, html, other]
-
Title: Bayesian nonparametric mixtures of Archimedean copulasSubjects: Methodology (stat.ME)
Copula-based dependence modeling often relies on parametric formulations. This is mathematically convenient, but can be statistically inefficient when the parametric families are not suitable for the data and model in focus. A Bayesian nonparametric mixture of Archimedean copulas is introduced to increase the flexibility of copula-based dependence modeling. Specifically, the Poisson-Dirichlet process is used as a mixing distribution over the Archimedean copulas' parameter. Properties of the mixture model are studied for the main Archimedean families, and posterior distributions are sampled via their full conditional distributions. Performance of the model is shown via numerical experiments involving simulated and real data.
- [55] arXiv:2501.04615 (replaced) [pdf, other]
-
Title: Doubly Robust and Efficient Calibration of Prediction Sets for Censored Time-to-Event OutcomesComments: 39 pages, 6 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Our objective is to construct well-calibrated prediction sets for a time-to-event outcome subject to right-censoring with guaranteed coverage. Our approach is inspired by modern conformal inference literature in that, unlike classical frameworks, we obviate the need for a well-specified parametric or semiparametric survival model to accomplish our goal. In contrast to existing conformal prediction methods for survival data, which restrict censoring to be of Type I, whereby potential censoring times are assumed to be fully observed on all units in both training and validation samples, we consider the more common right-censoring setting in which either only the censoring time or only the event time of primary interest is directly observed, whichever comes first. Under a standard conditional independence assumption between the potential survival and censoring times given covariates, we propose and analyze two methods to construct valid and efficient lower predictive bounds for the survival time of a future observation. The proposed methods build upon modern semiparametric efficiency theory for censored data, in that the first approach incorporates inverse-probability-of-censoring weighting to account for censoring, while the second approach is based on augmenting this method with an additional correction term. For both methods, we formally establish asymptotic coverage guarantees and demonstrate, both theoretically and through empirical experiments, that the augmented approach substantially improves efficiency over the inverse-probability-of-censoring weighting method. Specifically, its coverage error bound is of second-order mixed bias type, that is doubly robust, and therefore guaranteed to be asymptotically negligible relative to the coverage error of the non-augmented method.
- [56] arXiv:2502.07641 (replaced) [pdf, html, other]
-
Title: Distributional Instrumental Variable MethodSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
The instrumental variable (IV) approach is commonly used to infer causal effects in the presence of unmeasured confounding. Existing methods typically aim to estimate the mean causal effects, whereas a few other methods focus on quantile treatment effects. The aim of this work is to estimate the entire interventional distribution. We propose a method called Distributional Instrumental Variable (DIV), which uses generative modelling in a nonlinear IV setting. We establish identifiability of the interventional distribution under general assumptions and demonstrate an 'under-identified' case, where DIV can identify the causal effects while two-step least squares fails to. Our empirical results show that the DIV method performs well for a broad range of simulated data, exhibiting advantages over existing IV approaches in terms of the identifiability and estimation error of the mean or quantile treatment effects. Furthermore, we apply DIV to an economic data set to examine the causal relation between institutional quality and economic development and our results align well with the original study. We also apply DIV to a single-cell data set, where we study the generalizability and stability in predicting gene expression under unseen interventions. The software implementations of DIV are available in R and Python.
- [57] arXiv:2502.20206 (replaced) [pdf, html, other]
-
Title: On the Glivenko-Cantelli theorem for real-valued empirical functions of stationary $α$-mixing and $β$-mixing sequencesSubjects: Statistics Theory (math.ST)
In this paper we extend the classical Glivenko-Cantelli theorem to real-valued empirical functions under dependence structures characterised by $\alpha$-mixing and $\beta$-mixing conditions. We investigate sufficient conditions ensuring that families of real-valued functions exhibit the Glivenko-Cantelli (GC) property in these dependence settings. Our analysis focuses on function classes satisfying uniform entropy conditions, and we establish conditions on mixing coefficients to obtain GC theorems.
- [58] arXiv:2503.05961 (replaced) [pdf, html, other]
-
Title: Model-based bi-clustering using multivariate Poisson-lognormal with general block-diagonal covariance matrix and its applicationsComments: 39 pages, 15 figures, submitted to The Classification Society Annual Meeting and International Federation of Classification SocietiesSubjects: Methodology (stat.ME); Applications (stat.AP)
While several Gaussian mixture models-based biclustering approaches currently exist in the literature for continuous data, approaches to handle discrete data have not been well researched. A multivariate Poisson-lognormal (MPLN) model-based bi-clustering approach that utilizes a block-diagonal covariance structure is introduced to allow for a more flexible structure of the covariance matrix. Two variations of the algorithm are developed where the number of column clusters: 1) are assumed equal across groups or 2) can vary across groups. Variational Gaussian approximation is utilized for parameter estimation, and information criteria are used for model selection. The proposed models are investigated in the context of clustering multivariate count data. Using simulated data the models display strong accuracy and computational efficiency and is applied to breast cancer RNA-sequence data from The Cancer Genome Atlas.
- [59] arXiv:2503.06001 (replaced) [pdf, html, other]
-
Title: Analyzing the Role of Permutation Invariance in Linear Mode ConnectivityComments: Accepted at AISTATS 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
It was empirically observed in Entezari et al. (2021) that when accounting for the permutation invariance of neural networks, there is likely no loss barrier along the linear interpolation between two SGD solutions -- a phenomenon known as linear mode connectivity (LMC) modulo permutation. This phenomenon has sparked significant attention due to both its theoretical interest and practical relevance in applications such as model merging. In this paper, we provide a fine-grained analysis of this phenomenon for two-layer ReLU networks under a teacher-student setup. We show that as the student network width $m$ increases, the LMC loss barrier modulo permutation exhibits a double descent behavior. Particularly, when $m$ is sufficiently large, the barrier decreases to zero at a rate $O(m^{-1/2})$. Notably, this rate does not suffer from the curse of dimensionality and demonstrates how substantial permutation can reduce the LMC loss barrier. Moreover, we observe a sharp transition in the sparsity of GD/SGD solutions when increasing the learning rate and investigate how this sparsity preference affects the LMC loss barrier modulo permutation. Experiments on both synthetic and MNIST datasets corroborate our theoretical predictions and reveal a similar trend for more complex network architectures.
- [60] arXiv:2503.06770 (replaced) [pdf, html, other]
-
Title: Unique Rashomon Sets for Robust Active LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Collecting labeled data for machine learning models is often expensive and time-consuming. Active learning addresses this challenge by selectively labeling the most informative observations, but when initial labeled data is limited, it becomes difficult to distinguish genuinely informative points from those appearing uncertain primarily due to noise. Ensemble methods like random forests are a powerful approach to quantifying this uncertainty but do so by aggregating all models indiscriminately. This includes poor performing models and redundant models, a problem that worsens in the presence of noisy data. We introduce UNique Rashomon Ensembled Active Learning (UNREAL), which selectively ensembles only distinct models from the Rashomon set, which is the set of nearly optimal models. Restricting ensemble membership to high-performing models with different explanations helps distinguish genuine uncertainty from noise-induced variation. We show that UNREAL achieves faster theoretical convergence rates than traditional active learning approaches and demonstrates empirical improvements of up to 20% in predictive accuracy across five benchmark datasets, while simultaneously enhancing model interpretability.
- [61] arXiv:1905.09884 (replaced) [pdf, html, other]
-
Title: Naive Feature Selection: a Nearly Tight Convex Relaxation for Sparse Naive BayesComments: Fixed some cosmetic issuesJournal-ref: Mathematics of Operations Research 49 (1), 278-296, 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Due to its linear complexity, naive Bayes classification remains an attractive supervised learning method, especially in very large-scale settings. We propose a sparse version of naive Bayes, which can be used for feature selection. This leads to a combinatorial maximum-likelihood problem, for which we provide an exact solution in the case of binary data, or a bound in the multinomial case. We prove that our convex relaxation bounds becomes tight as the marginal contribution of additional features decreases, using a priori duality gap bounds dervied from the Shapley-Folkman theorem. We show how to produce primal solutions satisfying these bounds. Both binary and multinomial sparse models are solvable in time almost linear in problem size, representing a very small extra relative cost compared to the classical naive Bayes. Numerical experiments on text data show that the naive Bayes feature selection method is as statistically effective as state-of-the-art feature selection methods such as recursive feature elimination, $l_1$-penalized logistic regression and LASSO, while being orders of magnitude faster.
- [62] arXiv:2311.01797 (replaced) [pdf, html, other]
-
Title: On the Generalization Properties of Diffusion ModelsComments: Accepted at NeurIPS 2023Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of the generalization attributes of diffusion models. We establish theoretical estimates of the generalization gap that evolves in tandem with the training dynamics of score-based diffusion models, suggesting a polynomially small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$ and the model capacity $m$, evading the curse of dimensionality (i.e., not exponentially large in the data dimension) when early-stopped. Furthermore, we extend our quantitative analysis to a data-dependent scenario, wherein target distributions are portrayed as a succession of densities with progressively increasing distances between modes. This precisely elucidates the adverse effect of "modes shift" in ground truths on the model generalization. Moreover, these estimates are not solely theoretical constructs but have also been confirmed through numerical simulations. Our findings contribute to the rigorous understanding of diffusion models' generalization properties and provide insights that may guide practical applications.
- [63] arXiv:2404.17008 (replaced) [pdf, html, other]
-
Title: The TruEnd-procedure: Treating trailing zero-valued balances in credit dataComments: 21 pages, 7545 words, 10 FiguresSubjects: Risk Management (q-fin.RM); Statistical Finance (q-fin.ST); Applications (stat.AP)
A novel procedure is presented for finding the true but latent endpoints within the repayment histories of individual loans. The monthly observations beyond these true endpoints are false, largely due to operational failures that delay account closure, thereby corrupting some loans. Detecting these false observations is difficult at scale since each affected loan history might have a different sequence of trailing zero (or very small) month-end balances. Identifying these trailing balances requires an exact definition of a "small balance", which our method informs. We demonstrate this procedure and isolate the ideal small-balance definition using South African residential mortgages. Evidently, corrupted loans are remarkably prevalent and have excess histories that are surprisingly long, which ruin the timing of risk events and compromise any subsequent time-to-event model, e.g., survival analysis. Having discarded these excess histories, we demonstrably improve the accuracy of both the predicted timing and severity of risk events, without materially impacting the portfolio. The resulting estimates of credit losses are lower and less biased, which augurs well for raising accurate credit impairments under IFRS 9. Our work therefore addresses a pernicious data error, which highlights the pivotal role of data preparation in producing credible forecasts of credit risk.
- [64] arXiv:2405.08971 (replaced) [pdf, other]
-
Title: Computation-Aware Kalman Filtering and SmoothingSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
Kalman filtering and smoothing are the foundational mechanisms for efficient inference in Gauss-Markov models. However, their time and memory complexities scale prohibitively with the size of the state space. This is particularly problematic in spatiotemporal regression problems, where the state dimension scales with the number of spatial observations. Existing approximate frameworks leverage low-rank approximations of the covariance matrix. But since they do not model the error introduced by the computational approximation, their predictive uncertainty estimates can be overly optimistic. In this work, we propose a probabilistic numerical method for inference in high-dimensional Gauss-Markov models which mitigates these scaling issues. Our matrix-free iterative algorithm leverages GPU acceleration and crucially enables a tunable trade-off between computational cost and predictive uncertainty. Finally, we demonstrate the scalability of our method on a large-scale climate dataset.
- [65] arXiv:2405.11284 (replaced) [pdf, html, other]
-
Title: The Logic of Counterfactuals and the Epistemology of Causal InferenceSubjects: Artificial Intelligence (cs.AI); Econometrics (econ.EM); Methodology (stat.ME); Other Statistics (stat.OT)
The 2021 Nobel Prize in Economics recognized an epistemology of causal inference based on the Rubin causal model (Rubin 1974), which merits broader attention in philosophy. This model, in fact, presupposes a logical principle of counterfactuals, Conditional Excluded Middle (CEM), the locus of a pivotal debate between Stalnaker (1968) and Lewis (1973) on the semantics of counterfactuals. Proponents of CEM should recognize that this connection points to a new argument for CEM -- a Quine-Putnam indispensability argument grounded in the Nobel-winning applications of the Rubin model in health and social sciences. To advance the dialectic, I challenge this argument with an updated Rubin causal model that retains its successes while dispensing with CEM. This novel approach combines the strengths of the Rubin causal model and a causal model familiar in philosophy, the causal Bayes net. The takeaway: deductive logic and inductive inference, often studied in isolation, are deeply interconnected.
- [66] arXiv:2408.07533 (replaced) [pdf, html, other]
-
Title: Information-Theoretic Measures on Lattices for High-Order InteractionsComments: 22 pages, 13 figures, 3 tablesSubjects: Information Theory (cs.IT); Machine Learning (stat.ML)
Traditional measures based solely on pairwise associations often fail to capture the complex statistical structure of multivariate data. Existing approaches for identifying information shared among $d>3$ variables are frequently computationally intractable, asymmetric with respect to a target variable, or unable to account for all the ways in which the joint probability distribution can be factorised. Here we present a systematic framework based on lattice theory to derive higher-order information-theoretic measures for multivariate data. Our construction uses lattice and operator function pairs, whereby an operator function is applied over a lattice that represents the algebraic relationships among variables. We show that many commonly used measures can be derived within this framework, yet they fail to capture all interactions for $d>3$, either because they are defined on restricted sublattices, or because the use of the KL divergence as an operator function, a typical choice, leads to undesired disregard of groups of interactions. To fully characterise all interactions among $d$ variables, we introduce the Streitberg Information, which is defined over the full partition lattice and uses generalised divergences (beyond KL) as operator functions. We validate the Streitberg Information on synthetic data, and illustrate its application in detecting complex interactions among stocks, decoding neural signals, and performing feature selection in machine learning.
- [67] arXiv:2411.01126 (replaced) [pdf, html, other]
-
Title: Axiomatic Explainer Globalness via Optimal TransportComments: Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Explainability methods are often challenging to evaluate and compare. With a multitude of explainers available, practitioners must often compare and select explainers based on quantitative evaluation metrics. One particular differentiator between explainers is the diversity of explanations for a given dataset; i.e. whether all explanations are identical, unique and uniformly distributed, or somewhere between these two extremes. In this work, we define a complexity measure for explainers, globalness, which enables deeper understanding of the distribution of explanations produced by feature attribution and feature selection methods for a given dataset. We establish the axiomatic properties that any such measure should possess and prove that our proposed measure, Wasserstein Globalness, meets these criteria. We validate the utility of Wasserstein Globalness using image, tabular, and synthetic datasets, empirically showing that it both facilitates meaningful comparison between explainers and improves the selection process for explainability methods.
- [68] arXiv:2411.16370 (replaced) [pdf, html, other]
-
Title: A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image SegmentationComments: 20 pages, revisedSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
Advancements in image segmentation play an integral role within the broad scope of Deep Learning-based Computer Vision. Furthermore, their widespread applicability in critical real-world tasks has resulted in challenges related to the reliability of such algorithms. Hence, uncertainty quantification has been extensively studied within this context, enabling the expression of model ignorance (epistemic uncertainty) or data ambiguity (aleatoric uncertainty) to prevent uninformed decision-making. Due to the rapid adoption of Convolutional Neural Network (CNN)-based segmentation models in high-stake applications, a substantial body of research has been published on this very topic, causing its swift expansion into a distinct field. This work provides a comprehensive overview of probabilistic segmentation, by discussing fundamental concepts of uncertainty quantification, governing advancements in the field as well as the application to various tasks. Moreover, literature on both types of uncertainties trace back to four key applications: (1) to quantify statistical inconsistencies in the annotation process due ambiguous images, (2) correlating prediction error with uncertainty, (3) expanding the model hypothesis space for better generalization, and (4) Active Learning. An extensive discussion follows that includes an overview of utilized datasets for each of the applications and evaluation of the available methods. We also highlight challenges related to architectures, uncertainty quantification methods, standardization and benchmarking, and finally end with recommendations for future work such as methods based on single forward passes and models that appropriately leverage volumetric data.
- [69] arXiv:2411.18752 (replaced) [pdf, html, other]
-
Title: Locally Differentially Private Online Federated Learning With Correlated NoiseComments: arXiv admin note: text overlap with arXiv:2403.16542Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
We introduce a locally differentially private (LDP) algorithm for online federated learning that employs temporally correlated noise to improve utility while preserving privacy. To address challenges posed by the correlated noise and local updates with streaming non-IID data, we develop a perturbed iterate analysis that controls the impact of the noise on the utility. Moreover, we demonstrate how the drift errors from local updates can be effectively managed for several classes of nonconvex loss functions. Subject to an $(\epsilon,\delta)$-LDP budget, we establish a dynamic regret bound that quantifies the impact of key parameters and the intensity of changes in the dynamic environment on the learning performance. Numerical experiments confirm the efficacy of the proposed algorithm.
- [70] arXiv:2501.07964 (replaced) [pdf, html, other]
-
Title: Derivation of Output Correlation Inferences for Multi-Output (aka Multi-Task) Gaussian ProcessSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Gaussian process (GP) is arguably one of the most widely used machine learning algorithms in practice. One of its prominent applications is Bayesian optimization (BO). Although the vanilla GP itself is already a powerful tool for BO, it is often beneficial to be able to consider the dependencies of multiple outputs. To do so, Multi-task GP (MTGP) is formulated, but it is not trivial to fully understand the derivations of its formulations and their gradients from the previous literature. This paper serves friendly derivations of the MTGP formulations and their gradients.
- [71] arXiv:2503.07565 (replaced) [pdf, html, other]
-
Title: Inductive Moment MatchingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Diffusion models and Flow Matching generate high-quality samples but are slow at inference, and distilling them into few-step models often leads to instability and extensive tuning. To resolve these trade-offs, we propose Inductive Moment Matching (IMM), a new class of generative models for one- or few-step sampling with a single-stage training procedure. Unlike distillation, IMM does not require pre-training initialization and optimization of two networks; and unlike Consistency Models, IMM guarantees distribution-level convergence and remains stable under various hyperparameters and standard model architectures. IMM surpasses diffusion models on ImageNet-256x256 with 1.99 FID using only 8 inference steps and achieves state-of-the-art 2-step FID of 1.98 on CIFAR-10 for a model trained from scratch.