Statistics
- [1] arXiv:2405.12984 [pdf, ps, html, other]
-
Title: Approximation of the Gompertz function with a multilogistic functionComments: 9 pages, 12 figuresSubjects: Statistics Theory (math.ST)
The paper deals with the comparison of the Gompertz function and the logistic function. We show that the Gompertz function can be approximated with high accuracy by a sum of three logistic functions (multilogistic function). Two of them are increasing and one is decreasing. We use second-order logistic wavelets to estimate the parameters of the multilogistic function.
- [2] arXiv:2405.13073 [pdf, ps, html, other]
-
Title: A graph-structured distance for heterogeneous datasets with meta variablesComments: 25 pages (without references), 5 figures, data and scripts available at this https URLSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Heterogeneous datasets emerge in various machine learning or optimization applications that feature different data sources, various data types and complex relationships between variables. In practice, heterogeneous datasets are often partitioned into smaller well-behaved ones that are easier to process. However, some applications involve expensive-to-generate or limited size datasets, which motivates methods based on the whole dataset. The first main contribution of this work is a modeling graph-structured framework that generalizes state-of-the-art hierarchical, tree-structured, or variable-size frameworks. This framework models domains that involve heterogeneous datasets in which variables may be continuous, integer, or categorical, with some identified as meta if their values determine the inclusion/exclusion or affect the bounds of other so-called decreed variables. Excluded variables are introduced to manage variables that are either included or excluded depending on the given points. The second main contribution is the graph-structured distance that compares extended points with any combination of included and excluded variables: any pair of points can be compared, allowing to work directly in heterogeneous datasets with meta variables. The contributions are illustrated with some regression experiments, in which the performance of a multilayer perceptron with respect to its hyperparameters is modeled with inverse distance weighting and $K$-nearest neighbors models.
- [3] arXiv:2405.13100 [pdf, ps, html, other]
-
Title: Better Simulations for Validating Causal Discovery with the DAG-Adaptation of the Onion MethodSubjects: Methodology (stat.ME); Artificial Intelligence (cs.AI)
The number of artificial intelligence algorithms for learning causal models from data is growing rapidly. Most ``causal discovery'' or ``causal structure learning'' algorithms are primarily validated through simulation studies. However, no widely accepted simulation standards exist and publications often report conflicting performance statistics -- even when only considering publications that simulate data from linear models. In response, several manuscripts have criticized a popular simulation design for validating algorithms in the linear case.
We propose a new simulation design for generating linear models for directed acyclic graphs (DAGs): the DAG-adaptation of the Onion (DaO) method. DaO simulations are fundamentally different from existing simulations because they prioritize the distribution of correlation matrices rather than the distribution of linear effects. Specifically, the DaO method uniformly samples the space of all correlation matrices consistent with (i.e. Markov to) a DAG. We also discuss how to sample DAGs and present methods for generating DAGs with scale-free in-degree or out-degree. We compare the DaO method against two alternative simulation designs and provide implementations of the DaO method in Python and R: this https URL. We advocate for others to adopt DaO simulations as a fair universal benchmark. - [4] arXiv:2405.13140 [pdf, ps, html, other]
-
Title: On Convergence of the Alternating Directions SGHMC AlgorithmSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR)
We study convergence rates of Hamiltonian Monte Carlo (HMC) algorithms with leapfrog integration under mild conditions on stochastic gradient oracle for the target distribution (SGHMC). Our method extends standard HMC by allowing the use of general auxiliary distributions, which is achieved by a novel procedure of Alternating Directions.
The convergence analysis is based on the investigations of the Dirichlet forms associated with the underlying Markov chain driving the algorithms. For this purpose, we provide a detailed analysis on the error of the leapfrog integrator for Hamiltonian motions with both the kinetic and potential energy functions in general form. We characterize the explicit dependence of the convergence rates on key parameters such as the problem dimension, functional properties of both the target and auxiliary distributions, and the quality of the oracle. - [5] arXiv:2405.13149 [pdf, ps, html, other]
-
Title: Gaussian Measures Conditioned on Nonlinear Observations: Consistency, MAP Estimators, and SimulationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Computation (stat.CO)
The article presents a systematic study of the problem of conditioning a Gaussian random variable $\xi$ on nonlinear observations of the form $F \circ \phi(\xi)$ where $\phi: \mathcal{X} \to \mathbb{R}^N$ is a bounded linear operator and $F$ is nonlinear. Such problems arise in the context of Bayesian inference and recent machine learning-inspired PDE solvers. We give a representer theorem for the conditioned random variable $\xi \mid F\circ \phi(\xi)$, stating that it decomposes as the sum of an infinite-dimensional Gaussian (which is identified analytically) as well as a finite-dimensional non-Gaussian measure. We also introduce a novel notion of the mode of a conditional measure by taking the limit of the natural relaxation of the problem, to which we can apply the existing notion of maximum a posteriori estimators of posterior measures. Finally, we introduce a variant of the Laplace approximation for the efficient simulation of the aforementioned conditioned Gaussian random variables towards uncertainty quantification.
- [6] arXiv:2405.13153 [pdf, ps, html, other]
-
Title: Max-sliced Wasserstein concentration and uniform ratio bounds of empirical measures on RKHSSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
Optimal transport and the Wasserstein distance $\mathcal{W}_p$ have recently seen a number of applications in the fields of statistics, machine learning, data science, and the physical sciences. These applications are however severely restricted by the curse of dimensionality, meaning that the number of data points needed to estimate these problems accurately increases exponentially in the dimension. To alleviate this problem, a number of variants of $\mathcal{W}_p$ have been introduced. We focus here on one of these variants, namely the max-sliced Wasserstein metric $\overline{\mathcal{W}}_p$. This metric reduces the high-dimensional minimization problem given by $\mathcal{W}_p$ to a maximum of one-dimensional measurements in an effort to overcome the curse of dimensionality. In this note we derive concentration results and upper bounds on the expectation of $\overline{\mathcal{W}}_p$ between the true and empirical measure on unbounded reproducing kernel Hilbert spaces. We show that, under quite generic assumptions, probability measures concentrate uniformly fast in one-dimensional subspaces, at (nearly) parametric rates. Our results rely on an improvement of currently known bounds for $\overline{\mathcal{W}}_p$ in the finite-dimensional case.
- [7] arXiv:2405.13160 [pdf, ps, html, other]
-
Title: Borrowing Strength in Distributionally Robust Optimization via Hierarchical Dirichlet ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
This paper presents a novel optimization framework to address key challenges presented by modern machine learning applications: High dimensionality, distributional uncertainty, and data heterogeneity. Our approach unifies regularized estimation, distributionally robust optimization (DRO), and hierarchical Bayesian modeling in a single data-driven criterion. By employing a hierarchical Dirichlet process (HDP) prior, the method effectively handles multi-source data, achieving regularization, distributional robustness, and borrowing strength across diverse yet related data-generating processes. We demonstrate the method's advantages by establishing theoretical performance guarantees and tractable Monte Carlo approximations based on Dirichlet process (DP) theory. Numerical experiments validate the framework's efficacy in improving and stabilizing both prediction and parameter estimation accuracy, showcasing its potential for application in complex data environments.
- [8] arXiv:2405.13266 [pdf, ps, html, other]
-
Title: Nonparametric estimation of FBSDEs with random terminal timeSubjects: Statistics Theory (math.ST)
This paper investigates the nonparametric estimation of the functional coefficients of the FBSDEs with random terminal time, including the local constant and local linear estimators. We provide complete two-dimensional asymptotics in both the time span and the sampling interval, allowing for the precise characterization of their distribution. Moreover, the empirical likelihood (EL) method to construct the data-driven confidence intervals for these estimators is provided. Some numerical simulations investigate the finite-sample properties of the estimators and compare the performance of the EL method and the conventional method in constructing confidence intervals based on asymptotic normality.
- [9] arXiv:2405.13302 [pdf, ps, html, other]
-
Title: Accelerated Evaluation of Ollivier-Ricci Curvature Lower Bounds: Bridging Theory and ComputationSubjects: Machine Learning (stat.ML); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Optimization and Control (math.OC)
Curvature serves as a potent and descriptive invariant, with its efficacy validated both theoretically and practically within graph theory. We employ a definition of generalized Ricci curvature proposed by Ollivier, which Lin and Yau later adapted to graph theory, known as Ollivier-Ricci curvature (ORC). ORC measures curvature using the Wasserstein distance, thereby integrating geometric concepts with probability theory and optimal transport. Jost and Liu previously discussed the lower bound of ORC by showing the upper bound of the Wasserstein distance. We extend the applicability of these bounds to discrete spaces with metrics on integers, specifically hypergraphs. Compared to prior work on ORC in hypergraphs by Coupette, Dalleiger, and Rieck, which faced computational challenges, our method introduces a simplified approach with linear computational complexity, making it particularly suitable for analyzing large-scale networks. Through extensive simulations and application to synthetic and real-world datasets, we demonstrate the significant improvements our method offers in evaluating ORC.
- [10] arXiv:2405.13342 [pdf, ps, html, other]
-
Title: Scalable Bayesian inference for heat kernel Gaussian processes on manifoldsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We develop scalable manifold learning methods and theory, motivated by the problem of estimating manifold of fMRI activation in the Human Connectome Project (HCP). We propose the Fast Graph Laplacian Estimation for Heat Kernel Gaussian Processes (FLGP) in the natural exponential family model. FLGP handles large sample sizes $ n $, preserves the intrinsic geometry of data, and significantly reduces computational complexity from $ \mathcal{O}(n^3) $ to $ \mathcal{O}(n) $ via a novel reduced-rank approximation of the graph Laplacian's transition matrix and truncated Singular Value Decomposition for eigenpair computation. Our numerical experiments demonstrate FLGP's scalability and improved accuracy for manifold learning from large-scale complex data.
- [11] arXiv:2405.13353 [pdf, ps, html, other]
-
Title: Adaptive Bayesian Multivariate Spline Knot Inference with Prior Specifications on Model ComplexitySubjects: Methodology (stat.ME); Machine Learning (stat.ML)
In multivariate spline regression, the number and locations of knots influence the performance and interpretability significantly. However, due to non-differentiability and varying dimensions, there is no desirable frequentist method to make inference on knots. In this article, we propose a fully Bayesian approach for knot inference in multivariate spline regression. The existing Bayesian method often uses BIC to calculate the posterior, but BIC is too liberal and it will heavily overestimate the knot number when the candidate model space is large. We specify a new prior on the knot number to take into account the complexity of the model space and derive an analytic formula in the normal model. In the non-normal cases, we utilize the extended Bayesian information criterion to approximate the posterior density. The samples are simulated in the space with differing dimensions via reversible jump Markov chain Monte Carlo. We apply the proposed method in knot inference and manifold denoising. Experiments demonstrate the splendid capability of the algorithm, especially in function fitting with jumping discontinuity.
- [12] arXiv:2405.13399 [pdf, ps, html, other]
-
Title: Scalable Bayesian Inference for Bradley--Terry Models with Ties: An Application to Honour Based AbuseComments: 17 pages, 6 figures, submittedSubjects: Applications (stat.AP)
Honour based abuse covers a wide range of family abuse including female genital mutilation and forced marriage. Safeguarding professionals need to identify where abuses are happening in their local community to best support those at risk of these crimes and take preventative action. However, there is little local data about these kinds of crime. To tackle this problem, we ran comparative judgement surveys to map abuses at local level. In previous comparative judgement studies, participants reported fatigue associated with comparisons between areas with similar levels of abuse. Allowing for ties reduces fatigue, but increase the computational complexity when fitting the model. We designed an efficient Markov Chain Monte Carlo algorithm to fit the model, allowing for a wide range of prior distributions on the model parameters. Working with South Yorkshire Police and Oxford Against Cutting, we mapped the risk of honour based abuse at community level in two counties in the UK.
- [13] arXiv:2405.13400 [pdf, ps, html, other]
-
Title: Ensemble size dependence of the logarithmic score for forecasts issued as multivariate normal distributionsComments: 24 pages; 7 figures; 4 tablesSubjects: Methodology (stat.ME); Applications (stat.AP)
Multivariate probabilistic verification is concerned with the evaluation of joint probability distributions of vector quantities such as a weather variable at multiple locations or a wind vector for instance. The logarithmic score is a proper score that is useful in this context. In order to apply this score to ensemble forecasts, a choice for the density is required. Here, we are interested in the specific case when the density is multivariate normal with mean and covariance given by the ensemble mean and ensemble covariance, respectively. Under the assumptions of multivariate normality and exchangeability of the ensemble members, a relationship is derived which describes how the logarithmic score depends on ensemble size. It permits to estimate the score in the limit of infinite ensemble size from a small ensemble and thus produces a fair logarithmic score for multivariate ensemble forecasts under the assumption of normality. This generalises a study from 2018 which derived the ensemble size adjustment of the logarithmic score in the univariate case.
An application to medium-range forecasts examines the usefulness of the ensemble size adjustments when multivariate normality is only an approximation. Predictions of vectors consisting of several different combinations of upper air variables are considered. Logarithmic scores are calculated for these vectors using ECMWF's daily extended-range forecasts which consist of a 100-member ensemble. The probabilistic forecasts of these vectors are verified against operational ECMWF analyses in the Northern mid-latitudes in autumn 2023. Scores are computed for ensemble sizes from 8 to 100. The fair logarithmic scores of ensembles with different cardinalities are very close, in contrast to the unadjusted scores which decrease considerably with ensemble size. This provides evidence for the practical usefulness of the derived relationships. - [14] arXiv:2405.13456 [pdf, ps, html, other]
-
Title: Deep linear networks for regression are implicitly regularized towards flat minimaComments: 46 pages, 4 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear networks for overdetermined univariate regression. Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one. Indeed, we show a lower bound on the sharpness of minimizers, which grows linearly with depth. We then study the properties of the minimizer found by gradient flow, which is the limit of gradient descent with vanishing learning rate. We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound. The constant depends on the condition number of the data covariance matrix, but not on width or depth. This result is proven both for a small-scale initialization and a residual initialization. Results of independent interest are shown in both cases. For small-scale initialization, we show that the learned weight matrices are approximately rank-one and that their singular vectors align. For residual initialization, convergence of the gradient flow for a Gaussian initialization of the residual network is proven. Numerical experiments illustrate our results and connect them to gradient descent with non-vanishing learning rate.
- [15] arXiv:2405.13481 [pdf, ps, html, other]
-
Title: Locally Private Estimation with Public FeaturesSubjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
We initiate the study of locally differentially private (LDP) learning with public features. We define semi-feature LDP, where some features are publicly available while the remaining ones, along with the label, require protection under local differential privacy. Under semi-feature LDP, we demonstrate that the mini-max convergence rate for non-parametric regression is significantly reduced compared to that of classical LDP. Then we propose HistOfTree, an estimator that fully leverages the information contained in both public and private features. Theoretically, HistOfTree reaches the mini-max optimal convergence rate. Empirically, HistOfTree achieves superior performance on both synthetic and real data. We also explore scenarios where users have the flexibility to select features for protection manually. In such cases, we propose an estimator and a data-driven parameter tuning strategy, leading to analogous theoretical and empirical results.
- [16] arXiv:2405.13531 [pdf, ps, html, other]
-
Title: A stereographic test of spherical uniformityComments: 12 pages, 5 figures, 1 tableSubjects: Statistics Theory (math.ST)
We introduce a test of uniformity for (hyper)spherical data motivated by the stereographic projection. The closed-form expression of the test statistic and its null asymptotic distribution are derived using Gegenbauer polynomials. The power against rotationally symmetric local alternatives is provided, and simulations illustrate the non-null asymptotic results. The stereographic test outperforms other tests in a testing scenario with antipodal dependence.
- [17] arXiv:2405.13537 [pdf, ps, html, other]
-
Title: Sequential Bayesian inference for stochastic epidemic models of cumulative incidenceComments: 27 pagesSubjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO)
Epidemics are inherently stochastic, and stochastic models provide an appropriate way to describe and analyse such phenomena. Given temporal incidence data consisting of, for example, the number of new infections or removals in a given time window, a continuous-time discrete-valued Markov process provides a natural description of the dynamics of each model component, typically taken to be the number of susceptible, exposed, infected or removed individuals. Fitting the SEIR model to time-course data is a challenging problem due incomplete observations and, consequently, the intractability of the observed data likelihood. Whilst sampling based inference schemes such as Markov chain Monte Carlo are routinely applied, their computational cost typically restricts analysis to data sets of no more than a few thousand infective cases. Instead, we develop a sequential inference scheme that makes use of a computationally cheap approximation of the most natural Markov process model. Crucially, the resulting model allows a tractable conditional parameter posterior which can be summarised in terms of a set of low dimensional statistics. This is used to rejuvenate parameter samples in conjunction with a novel bridge construct for propagating state trajectories conditional on the next observation of cumulative incidence. The resulting inference framework also allows for stochastic infection and reporting rates. We illustrate our approach using synthetic and real data applications.
- [18] arXiv:2405.13553 [pdf, ps, html, other]
-
Title: Hidden semi-Markov models with inhomogeneous state dwell-time distributionsComments: 35 pages, 12 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
The well-established methodology for the estimation of hidden semi-Markov models (HSMMs) as hidden Markov models (HMMs) with extended state spaces is further developed to incorporate covariate influences across all aspects of the state process model, in particular, regarding the distributions governing the state dwell time. The special case of periodically varying covariate effects on the state dwell-time distributions - and possibly the conditional transition probabilities - is examined in detail to derive important properties of such models, namely the periodically varying unconditional state distribution as well as the overall state dwell-time distribution. Through simulation studies, we ascertain key properties of these models and develop recommendations for hyperparameter settings. Furthermore, we provide a case study involving an HSMM with periodically varying dwell-time distributions to analyse the movement trajectory of an arctic muskox, demonstrating the practical relevance of the developed methodology.
- [19] arXiv:2405.13574 [pdf, ps, html, other]
-
Title: Reinforcement Learning for Adaptive MCMCSubjects: Computation (stat.CO); Machine Learning (cs.LG)
An informal observation, made by several authors, is that the adaptive design of a Markov transition kernel has the flavour of a reinforcement learning task. Yet, to-date it has remained unclear how to actually exploit modern reinforcement learning technologies for adaptive MCMC. The aim of this paper is to set out a general framework, called Reinforcement Learning Metropolis--Hastings, that is theoretically supported and empirically validated. Our principal focus is on learning fast-mixing Metropolis--Hastings transition kernels, which we cast as deterministic policies and optimise via a policy gradient. Control of the learning rate provably ensures conditions for ergodicity are satisfied. The methodology is used to construct a gradient-free sampler that out-performs a popular gradient-free adaptive Metropolis--Hastings algorithm on $\approx 90 \%$ of tasks in the PosteriorDB benchmark.
- [20] arXiv:2405.13587 [pdf, ps, html, other]
-
Title: Exact Gradients for Stochastic Spiking Neural Networks Driven by Rough SignalsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
We introduce a mathematically rigorous framework based on rough path theory to model stochastic spiking neural networks (SSNNs) as stochastic differential equations with event discontinuities (Event SDEs) and driven by càdlàg rough paths. Our formalism is general enough to allow for potential jumps to be present both in the solution trajectories as well as in the driving noise. We then identify a set of sufficient conditions ensuring the existence of pathwise gradients of solution trajectories and event times with respect to the network's parameters and show how these gradients satisfy a recursive relation. Furthermore, we introduce a general-purpose loss function defined by means of a new class of signature kernels indexed on càdlàg rough paths and use it to train SSNNs as generative models. We provide an end-to-end autodifferentiable solver for Event SDEs and make its implementation available as part of the $\texttt{diffrax}$ library. Our framework is, to our knowledge, the first enabling gradient-based training of SSNNs with noise affecting both the spike timing and the network's dynamics.
- [21] arXiv:2405.13591 [pdf, ps, html, other]
-
Title: Running in circles: is practical application feasible for data fission and data thinning in post-clustering differential analysis?Subjects: Methodology (stat.ME)
The standard pipeline to analyse single-cell RNA sequencing (scRNA-seq) often involves two steps : clustering and Differential Expression Analysis (DEA) to annotate cell populations based on gene expression. However, using clustering results for data-driven hypothesis formulation compromises statistical properties, especially Type I error control. Data fission was introduced to split the information contained in each observation into two independent parts that can be used for clustering and testing. However, data fission was originally designed for non-mixture distributions, and adapting it for mixtures requires knowledge of the unknown clustering structure to estimate component-specific scale parameters. As components are typically unavailable in practice, scale parameter estimators often exhibit bias. We explicitly quantify how this bias affects subsequent post-clustering differential analysis Type I error rate despite employing data fission. In response, we propose a novel approach that involves modeling each observation as a realization of its distribution, with scale parameters estimated non-parametrically. Simulations study showcase the efficacy of our method when component are clearly separated. However, the level of separability required to reach good performance presents complexities in its application to real scRNA-seq data.
- [22] arXiv:2405.13621 [pdf, ps, html, other]
-
Title: Interval identification of natural effects in the presence of outcome-related unmeasured confoundingComments: 14 pages, 2 figures, 2 tablesSubjects: Methodology (stat.ME)
With reference to a binary outcome and a binary mediator, we derive identification bounds for natural effects under a reduced set of assumptions. Specifically, no assumptions about confounding are made that involve the outcome; we only assume no unobserved exposure-mediator confounding as well as a condition termed partially constant cross-world dependence (PC-CWD), which poses fewer constraints on the counterfactual probabilities than the usual cross-world independence assumption. The proposed strategy can be used also to achieve interval identification of the total effect, which is no longer point identified under the considered set of assumptions. Our derivations are based on postulating a logistic regression model for the mediator as well as for the outcome. However, in both cases the functional form governing the dependence on the explanatory variables is allowed to be arbitrary, thereby resulting in a semi-parametric approach. To account for sampling variability, we provide delta-method approximations of standard errors in order to build uncertainty intervals from identification bounds. The proposed method is applied to a dataset gathered from a Spanish prospective cohort study. The aim is to evaluate whether the effect of smoking on lung cancer risk is mediated by the onset of pulmonary emphysema.
- [23] arXiv:2405.13690 [pdf, ps, html, other]
-
Title: The effect of regularization in high dimensional Cox regressionSubjects: Statistics Theory (math.ST); Disordered Systems and Neural Networks (cond-mat.dis-nn)
We investigate analytically the behaviour of the penalized maximum partial likelihood estimator (PMPLE). Our results are derived for a generic separable regularization, but we focus on the elastic net. This penalization is routinely adopted for survival analysis in the high dimensional regime, where the Maximum Partial Likelihood estimator (no regularization) might not even exist. Previous theoretical results require that the number $s$ of non-zero association coefficients is $O(n^{\alpha})$, with $\alpha \in (0,1)$ and $n$ the sample size. Here we accurately characterize the behaviour of the PMPLE when $s$ is proportional to $n$ via the solution of a system of six non-linear equations that can be easily obtained by fixed point iteration. These equations are derived by means of the replica method and under the assumption that the covariates $\mathbf{X}\in \mathbb{R}^p$ follow a multivariate Gaussian law with covariance $\mathbf{I}_p/p$.
The solution of the previous equations allows us to investigate the dependency of various metrics of interest and hence their dependency on the ratio $\zeta = p/n$, the fraction of true active components $\nu = s/p$, and the regularization strength. We validate our results by extensive numerical simulations. - [24] arXiv:2405.13731 [pdf, ps, html, other]
-
Title: Control, Transport and Sampling: Towards Better Loss DesignSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
Leveraging connections between diffusion-based sampling, optimal transport, and optimal stochastic control through their shared links to the Schrödinger bridge problem, we propose novel objective functions that can be used to transport $\nu$ to $\mu$, consequently sample from the target $\mu$, via optimally controlled dynamics. We highlight the importance of the pathwise perspective and the role various optimality conditions on the path measure can play for the design of valid training losses, the careful choice of which offer numerical advantages in practical implementation.
- [25] arXiv:2405.13767 [pdf, ps, other]
-
Title: Enhancing Dose Selection in Phase I Cancer Trials: Extending the Bayesian Logistic Regression Model with Non-DLT Adverse Events IntegrationComments: 15 pages, 2 tables, 3 figures; submitted to "Statistics in Medicine" journalSubjects: Methodology (stat.ME)
This paper presents the Burdened Bayesian Logistic Regression Model (BBLRM), an enhancement to the Bayesian Logistic Regression Model (BLRM) for dose-finding in phase I oncology trials. Traditionally, the BLRM determines the maximum tolerated dose (MTD) based on dose-limiting toxicities (DLTs). However, clinicians often perceive model-based designs like BLRM as complex and less conservative than rule-based designs, such as the widely used 3+3 method. To address these concerns, the BBLRM incorporates non-DLT adverse events (nDLTAEs) into the model. These events, although not severe enough to qualify as DLTs, provide additional information suggesting that higher doses might result in DLTs. In the BBLRM, an additional parameter $\delta$ is introduced to account for nDLTAEs. This parameter adjusts the toxicity probability estimates, making the model more conservative in dose escalation. The $\delta$ parameter is derived from the proportion of patients experiencing nDLTAEs within each cohort and is tuned to balance the model's conservatism. This approach aims to reduce the likelihood of assigning toxic doses as MTD while involving clinicians more directly in the decision-making process. The paper includes a simulation study comparing BBLRM with the traditional BLRM across various scenarios. The simulations demonstrate that BBLRM significantly reduces the selection of toxic doses as MTD without compromising, and sometimes even increasing, the accuracy of MTD identification. These results suggest that integrating nDLTAEs into the dose-finding process can enhance the safety and acceptance of model-based designs in phase I oncology trials.
- [26] arXiv:2405.13783 [pdf, ps, html, other]
-
Title: Nonparametric quantile regression for spatio-temporal processesComments: 33 pages, 2 figures and accompanying supplementary documentationSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
In this paper, we develop a new and effective approach to nonparametric quantile regression that accommodates ultrahigh-dimensional data arising from spatio-temporal processes. This approach proves advantageous in staving off computational challenges that constitute known hindrances to existing nonparametric quantile regression methods when the number of predictors is much larger than the available sample size. We investigate conditions under which estimation is feasible and of good overall quality and obtain sharp approximations that we employ to devising statistical inference methodology. These include simultaneous confidence intervals and tests of hypotheses, whose asymptotics is borne by a non-trivial functional central limit theorem tailored to martingale differences. Additionally, we provide finite-sample results through various simulations which, accompanied by an illustrative application to real-worldesque data (on electricity demand), offer guarantees on the performance of the proposed methodology.
- [27] arXiv:2405.13794 [pdf, ps, html, other]
-
Title: Conditioning diffusion models by explicit forward-backward bridgingComments: 24 pages, 12 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
Given an unconditional diffusion model $\pi(x, y)$, using it to perform conditional simulation $\pi(x \mid y)$ is still largely an open question and is typically achieved by learning conditional drifts to the denoising SDE after the fact. In this work, we express conditional simulation as an inference problem on an augmented space corresponding to a partial SDE bridge. This perspective allows us to implement efficient and principled particle Gibbs and pseudo-marginal samplers marginally targeting the conditional distribution $\pi(x \mid y)$. Contrary to existing methodology, our methods do not introduce any additional approximation to the unconditional diffusion model aside from the Monte Carlo error. We showcase the benefits and drawbacks of our approach on a series of synthetic and real data examples.
- [28] arXiv:2405.13799 [pdf, ps, html, other]
-
Title: Extending Kernel Testing To General DesignsComments: 9 pages, 2 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Kernel-based testing has revolutionized the field of non-parametric tests through the embedding of distributions in an RKHS. This strategy has proven to be powerful and flexible, yet its applicability has been limited to the standard two-sample case, while practical situations often involve more complex experimental designs. To extend kernel testing to any design, we propose a linear model in the RKHS that allows for the decomposition of mean embeddings into additive functional effects. We then introduce a truncated kernel Hotelling-Lawley statistic to test the effects of the model, demonstrating that its asymptotic distribution is chi-square, which remains valid with its Nystrom approximation. We discuss a homoscedasticity assumption that, although absent in the standard two-sample case, is necessary for general designs. Finally, we illustrate our framework using a single-cell RNA sequencing dataset and provide kernel-based generalizations of classical diagnostic and exploration tools to broaden the scope of kernel testing in any experimental design.
- [29] arXiv:2405.13801 [pdf, ps, html, other]
-
Title: Bayesian Inference Under Differential Privacy: Prior Selection Considerations with Application to Univariate Gaussian Data and RegressionComments: 9-page main document with 5 figures and a 12-page appendix with 4 figuresSubjects: Methodology (stat.ME); Cryptography and Security (cs.CR)
We describe Bayesian inference for the mean and variance of bounded data protected by differential privacy and modeled as Gaussian. Using this setting, we demonstrate that analysts can and should take the constraints imposed by the bounds into account when specifying prior distributions. Additionally, we provide theoretical and empirical results regarding what classes of default priors produce valid inference for a differentially private release in settings where substantial prior information is not available. We discuss how these results can be applied to Bayesian inference for regression with differentially private data.
- [30] arXiv:2405.13821 [pdf, ps, html, other]
-
Title: Normalizing Basis Functions: Approximate Stationary Models for Large Spatial DataSubjects: Computation (stat.CO); Numerical Analysis (math.NA); Applications (stat.AP)
In geostatistics, traditional spatial models often rely on the Gaussian Process (GP) to fit stationary covariances to data. It is well known that this approach becomes computationally infeasible when dealing with large data volumes, necessitating the use of approximate methods. A powerful class of methods approximate the GP as a sum of basis functions with random coefficients. Although this technique offers computational efficiency, it does not inherently guarantee a stationary covariance. To mitigate this issue, the basis functions can be "normalized" to maintain a constant marginal variance, avoiding unwanted artifacts and edge effects. This allows for the fitting of nearly stationary models to large, potentially non-stationary datasets, providing a rigorous base to extend to more complex problems. Unfortunately, the process of normalizing these basis functions is computationally demanding. To address this, we introduce two fast and accurate algorithms to the normalization step, allowing for efficient prediction on fine grids. The practical value of these algorithms is showcased in the context of a spatial analysis on a large dataset, where significant computational speedups are achieved. While implementation and testing are done specifically within the LatticeKrig framework, these algorithms can be adapted to other basis function methods operating on regular grids.
- [31] arXiv:2405.13844 [pdf, ps, html, other]
-
Title: Causal Inference with CocyclesSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
Many interventions in causal inference can be represented as transformations. We identify a local symmetry property satisfied by a large class of causal models under such interventions. Where present, this symmetry can be characterized by a type of map called a cocycle, an object that is central to dynamical systems theory. We show that such cocycles exist under general conditions and are sufficient to identify interventional and counterfactual distributions. We use these results to derive cocycle-based estimators for causal estimands and show they achieve semiparametric efficiency under typical conditions. Since (infinitely) many distributions can share the same cocycle, these estimators make causal inference robust to mis-specification by sidestepping superfluous modelling assumptions. We demonstrate both robustness and state-of-the-art performance in several simulations, and apply our method to estimate the effects of 401(k) pension plan eligibility on asset accumulation using a real dataset.
- [32] arXiv:2405.13846 [pdf, ps, html, other]
-
Title: Regression Trees Know CalculusComments: Comments very welcome!Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Regression trees have emerged as a preeminent tool for solving real-world regression problems due to their ability to deal with nonlinearities, interaction effects and sharp discontinuities. In this article, we rather study regression trees applied to well-behaved, differentiable functions, and determine the relationship between node parameters and the local gradient of the function being approximated. We find a simple estimate of the gradient which can be efficiently computed using quantities exposed by popular tree learning libraries. This allows the tools developed in the context of differentiable algorithms, like neural nets and Gaussian processes, to be deployed to tree-based models. To demonstrate this, we study measures of model sensitivity defined in terms of integrals of gradients and demonstrate how to compute them for regression trees using the proposed gradient estimates. Quantitative and qualitative numerical experiments reveal the capability of gradients estimated by regression trees to improve predictive analysis, solve tasks in uncertainty quantification, and provide interpretation of model behavior.
- [33] arXiv:2405.13899 [pdf, ps, html, other]
-
Title: Symmetric Linear Bandits with Hidden SymmetrySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
High-dimensional linear bandits with low-dimensional structure have received considerable attention in recent studies due to their practical significance. The most common structure in the literature is sparsity. However, it may not be available in practice. Symmetry, where the reward is invariant under certain groups of transformations on the set of arms, is another important inductive bias in the high-dimensional case that covers many standard structures, including sparsity. In this work, we study high-dimensional symmetric linear bandits where the symmetry is hidden from the learner, and the correct symmetry needs to be learned in an online setting. We examine the structure of a collection of hidden symmetry and provide a method based on model selection within the collection of low-dimensional subspaces. Our algorithm achieves a regret bound of $ O(d_0^{1/3} T^{2/3} \log(d))$, where $d$ is the ambient dimension which is potentially very large, and $d_0$ is the dimension of the true low-dimensional subspace such that $d_0 \ll d$. With an extra assumption on well-separated models, we can further improve the regret to $ O(d_0\sqrt{T\log(d)} )$.
- [34] arXiv:2405.13912 [pdf, ps, html, other]
-
Title: Matrix Denoising with Doubly Heteroscedastic Noise: Fundamental Limits and Optimal Spectral MethodsSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
We study the matrix denoising problem of estimating the singular vectors of a rank-$1$ signal corrupted by noise with both column and row correlations. Existing works are either unable to pinpoint the exact asymptotic estimation error or, when they do so, the resulting approaches (e.g., based on whitening or singular value shrinkage) remain vastly suboptimal. On top of this, most of the literature has focused on the special case of estimating the left singular vector of the signal when the noise only possesses row correlation (one-sided heteroscedasticity). In contrast, our work establishes the information-theoretic and algorithmic limits of matrix denoising with doubly heteroscedastic noise. We characterize the exact asymptotic minimum mean square error, and design a novel spectral estimator with rigorous optimality guarantees: under a technical condition, it attains positive correlation with the signals whenever information-theoretically possible and, for one-sided heteroscedasticity, it also achieves the Bayes-optimal error. Numerical experiments demonstrate the significant advantage of our theoretically principled method with the state of the art. The proofs draw connections with statistical physics and approximate message passing, departing drastically from standard random matrix theory techniques.
- [35] arXiv:2405.13926 [pdf, ps, html, other]
-
Title: Some models are useful, but for how long?: A decision theoretic approach to choosing when to refit large-scale prediction modelsSubjects: Methodology (stat.ME); Econometrics (econ.EM)
Large-scale prediction models (typically using tools from artificial intelligence, AI, or machine learning, ML) are increasingly ubiquitous across a variety of industries and scientific domains. Such methods are often paired with detailed data from sources such as electronic health records, wearable sensors, and omics data (high-throughput technology used to understand biology). Despite their utility, implementing AI and ML tools at the scale necessary to work with this data introduces two major challenges. First, it can cost tens of thousands of dollars to train a modern AI/ML model at scale. Second, once the model is trained, its predictions may become less relevant as patient and provider behavior change, and predictions made for one geographical area may be less accurate for another. These two challenges raise a fundamental question: how often should you refit the AI/ML model to optimally trade-off between cost and relevance? Our work provides a framework for making decisions about when to {\it refit} AI/ML models when the goal is to maintain valid statistical inference (e.g. estimating a treatment effect in a clinical trial). Drawing on portfolio optimization theory, we treat the decision of {\it recalibrating} versus {\it refitting} the model as a choice between ''investing'' in one of two ''assets.'' One asset, recalibrating the model based on another model, is quick and relatively inexpensive but bears uncertainty from sampling and the possibility that the other model is not relevant to current circumstances. The other asset, {\it refitting} the model, is costly but removes the irrelevance concern (though not the risk of sampling error). We explore the balancing act between these two potential investments in this paper.
- [36] arXiv:2405.13940 [pdf, ps, html, other]
-
Title: High-dimensional (Group) Adversarial Training in Linear RegressionSubjects: Statistics Theory (math.ST)
Adversarial training can achieve robustness against adversarial perturbations and has been widely used in machine learning models. This paper delivers a non-asymptotic consistency analysis of the adversarial training procedure under $\ell_\infty$-perturbation in high-dimensional linear regression. It will be shown that the associated convergence rate of prediction error can achieve the minimax rate up to a logarithmic factor in the high-dimensional linear regression on the class of sparse parameters. Additionally, the group adversarial training procedure is analyzed. Compared with classic adversarial training, it will be proved that the group adversarial training procedure enjoys a better prediction error upper bound under certain group-sparsity patterns.
- [37] arXiv:2405.13950 [pdf, ps, html, other]
-
Title: Actor-critic algorithms for fiber sampling problemsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We propose an actor-critic algorithm for a family of complex problems arising in algebraic statistics and discrete optimization. The core task is to produce a sample from a finite subset of the non-negative integer lattice defined by a high-dimensional polytope. We translate the problem into a Markov decision process and devise an actor-critic reinforcement learning (RL) algorithm to learn a set of good moves that can be used for sampling. We prove that the actor-critic algorithm converges to an approximately optimal sampling policy.
To tackle complexity issues that typically arise in these sampling problems, and to allow the RL to function at scale, our solution strategy takes three steps: decomposing the starting point of the sample, using RL on each induced subproblem, and reconstructing to obtain a sample in the original polytope. In this setup, the proof of convergence applies to each subproblem in the decomposition.
We test the method in two regimes. In statistical applications, a high-dimensional polytope arises as the support set for the reference distribution in a model/data fit test for a broad family of statistical models for categorical data. We demonstrate how RL can be used for model fit testing problems for data sets for which traditional MCMC samplers converge too slowly due to problem size and sparsity structure. To test the robustness of the algorithm and explore its generalization properties, we apply it to synthetically generated data of various sizes and sparsity levels. - [38] arXiv:2405.13962 [pdf, ps, html, other]
-
Title: Learning heavy-tailed distributions with Wasserstein-proximal-regularized $\alpha$-divergencesComments: 23 pages, 7 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
In this paper, we propose Wasserstein proximals of $\alpha$-divergences as suitable objective functionals for learning heavy-tailed distributions in a stable manner. First, we provide sufficient, and in some cases necessary, relations among data dimension, $\alpha$, and the decay rate of data distributions for the Wasserstein-proximal-regularized divergence to be finite. Finite-sample convergence rates for the estimation in the case of the Wasserstein-1 proximal divergences are then provided under certain tail conditions. Numerical experiments demonstrate stable learning of heavy-tailed distributions -- even those without first or second moment -- without any explicit knowledge of the tail behavior, using suitable generative models such as GANs and flow-based models related to our proposed Wasserstein-proximal-regularized $\alpha$-divergences. Heuristically, $\alpha$-divergences handle the heavy tails and Wasserstein proximals allow non-absolute continuity between distributions and control the velocities of flow-based algorithms as they learn the target distribution deep into the tails.
- [39] arXiv:2405.13970 [pdf, ps, html, other]
-
Title: Conformal uncertainty quantification using kernel depth measures in separable Hilbert spacesSubjects: Methodology (stat.ME)
Depth measures have gained popularity in the statistical literature for defining level sets in complex data structures like multivariate data, functional data, and graphs. Despite their versatility, integrating depth measures into regression modeling for establishing prediction regions remains underexplored. To address this gap, we propose a novel method utilizing a model-free uncertainty quantification algorithm based on conditional depth measures and conditional kernel mean embeddings. This enables the creation of tailored prediction and tolerance regions in regression models handling complex statistical responses and predictors in separable Hilbert spaces. Our focus in this paper is exclusively on examples where the response is a functional data object. To enhance practicality, we introduce a conformal prediction algorithm, providing non-asymptotic guarantees in the derived prediction region. Additionally, we establish both conditional and unconditional consistency results and fast convergence rates in some special homoscedastic cases. We evaluate the model finite sample performance in extensive simulation studies with different function objects as probability distributions and functional data. Finally, we apply the approach in a digital health application related to physical activity, aiming to offer personalized recommendations in the US. population based on individuals' characteristics.
- [40] arXiv:2405.13997 [pdf, ps, html, other]
-
Title: Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of ExpertsComments: 31 pages, 2 figures. arXiv admin note: text overlap with arXiv:2402.02952Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that sigmoid gating, in fact, enjoys a higher sample efficiency than softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of convergence of the least squares estimator in the over-specified case in which the number of experts fitted is larger than the true value. We show that two gating regimes naturally arise and, in each of them, we formulate identifiability conditions for the expert functions and derive the corresponding convergence rates. In both cases, we find that experts formulated as feed-forward networks with commonly used activation such as $\mathrm{ReLU}$ and $\mathrm{GELU}$ enjoy faster convergence rates under sigmoid gating than softmax gating. Furthermore, given the same choice of experts, we demonstrate that the sigmoid gating function requires a smaller sample size than its softmax counterpart to attain the same error of expert estimation and, therefore, is more sample efficient.
- [41] arXiv:2405.14038 [pdf, ps, html, other]
-
Title: FLIPHAT: Joint Differential Privacy for High Dimensional Sparse Linear BanditsComments: 28 pages, 1 figureSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
High dimensional sparse linear bandits serve as an efficient model for sequential decision-making problems (e.g. personalized medicine), where high dimensional features (e.g. genomic data) on the users are available, but only a small subset of them are relevant. Motivated by data privacy concerns in these applications, we study the joint differentially private high dimensional sparse linear bandits, where both rewards and contexts are considered as private data. First, to quantify the cost of privacy, we derive a lower bound on the regret achievable in this setting. To further address the problem, we design a computationally efficient bandit algorithm, \textbf{F}orgetfu\textbf{L} \textbf{I}terative \textbf{P}rivate \textbf{HA}rd \textbf{T}hresholding (FLIPHAT). Along with doubling of episodes and episodic forgetting, FLIPHAT deploys a variant of Noisy Iterative Hard Thresholding (N-IHT) algorithm as a sparse linear regression oracle to ensure both privacy and regret-optimality. We show that FLIPHAT achieves optimal regret up to logarithmic factors. We analyze the regret by providing a novel refined analysis of the estimation error of N-IHT, which is of parallel interest.
- [42] arXiv:2405.14048 [pdf, ps, html, other]
-
Title: fsemipar: an R package for SoF semiparametric regressionComments: 41 pages, 7 figures, 11 tablesSubjects: Methodology (stat.ME); Computation (stat.CO)
Functional data analysis has become a tool of interest in applied areas such as economics, medicine, and chemistry. Among the techniques developed in recent literature, functional semiparametric regression stands out for its balance between flexible modelling and output interpretation. Despite the large variety of research papers dealing with scalar-on-function (SoF) semiparametric models, there is a notable gap in software tools for their implementation. This article introduces the R package \texttt{fsemipar}, tailored for these models. \texttt{fsemipar} not only estimates functional single-index models using kernel smoothing techniques but also estimates and selects relevant scalar variables in semi-functional models with multivariate linear components. A standout feature is its ability to identify impact points of a curve on the response, even in models with multiple functional covariates, and to integrate both continuous and pointwise effects of functional predictors within a single model. In addition, it allows the use of location-adaptive estimators based on the $k$-nearest-neighbours approach for all the semiparametric models included. Its flexible interface empowers users to customise a wide range of input parameters and includes the standard S3 methods for prediction, statistical analysis, and estimate visualization (\texttt{predict}, \texttt{summary}, \texttt{print}, and \texttt{plot}), enhancing clear result interpretation. Throughout the article, we illustrate the functionalities and the practicality of \texttt{fsemipar} using two chemometric datasets.
- [43] arXiv:2405.14064 [pdf, ps, html, other]
-
Title: Building a stable classifier with the inflated argmaxSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We propose a new framework for algorithmic stability in the context of multiclass classification. In practice, classification algorithms often operate by first assigning a continuous score (for instance, an estimated probability) to each possible label, then taking the maximizer -- i.e., selecting the class that has the highest score. A drawback of this type of approach is that it is inherently unstable, meaning that it is very sensitive to slight perturbations of the training data, since taking the maximizer is discontinuous. Motivated by this challenge, we propose a pipeline for constructing stable classifiers from data, using bagging (i.e., resampling and averaging) to produce stable continuous scores, and then using a stable relaxation of argmax, which we call the "inflated argmax," to convert these scores to a set of candidate labels. The resulting stability guarantee places no distributional assumptions on the data, does not depend on the number of classes or dimensionality of the covariates, and holds for any base classifier. Using a common benchmark data set, we demonstrate that the inflated argmax provides necessary protection against unstable classifiers, without loss of accuracy.
- [44] arXiv:2405.14131 [pdf, ps, html, other]
-
Title: Statistical Advantages of Perturbing Cosine Router in Sparse Mixture of ExpertsComments: 44 pages, 2 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The cosine router in sparse Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in sparse MoE has been lacking. Considering the least square estimation of the cosine routing sparse MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as $\mathcal{O}(1/\log^{\tau}(n))$ where $\tau > 0$ is some constant and $n$ is the sample size. Surprisingly, these pessimistic non-polynomial convergence rates can be circumvented by the widely used technique in practice to stabilize the cosine router -- simply adding noises to the $\mathbb{L}_{2}$ norms in the cosine router, which we refer to as \textit{perturbed cosine router}. Under the strongly identifiable settings of the expert functions, we prove that the estimation rates for both the experts and model parameters under the perturbed cosine routing sparse MoE are significantly improved to polynomial rates. Finally, we conduct extensive simulation studies in both synthetic and real data settings to empirically validate our theoretical results.
- [45] arXiv:2405.14145 [pdf, ps, html, other]
-
Title: Generalised Bayes Linear InferenceComments: Submitted to the Journal of the Royal Statistical Society: Series BSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Motivated by big data and the vast parameter spaces in modern machine learning models, optimisation approaches to Bayesian inference have seen a surge in popularity in recent years. In this paper, we address the connection between the popular new methods termed generalised Bayesian inference and Bayes linear methods. We propose a further generalisation to Bayesian inference that unifies these and other recent approaches by considering the Bayesian inference problem as one of finding the closest point in a particular solution space to a data generating process, where these notions differ depending on user-specified geometries and foundational belief systems. Motivated by this framework, we propose a generalisation to Bayes linear approaches that enables fast and principled inferences that obey the coherence requirements implied by domain restrictions on random quantities. We demonstrate the efficacy of generalised Bayes linear inference on a number of examples, including monotonic regression and inference for spatial counts. This paper is accompanied by an R package available at this http URL.
- [46] arXiv:2405.14149 [pdf, ps, html, other]
-
Title: A Direct Importance Sampling-based Framework for Rare Event Uncertainty Quantification in Non-Gaussian SpacesSubjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO)
This work introduces a novel framework for precisely and efficiently estimating rare event probabilities in complex, high-dimensional non-Gaussian spaces, building on our foundational Approximate Sampling Target with Post-processing Adjustment (ASTPA) approach. An unnormalized sampling target is first constructed and sampled, relaxing the optimal importance sampling distribution and appropriately designed for non-Gaussian spaces. Post-sampling, its normalizing constant is estimated using a stable inverse importance sampling procedure, employing an importance sampling density based on the already available samples. The sought probability is then computed based on the estimates evaluated in these two stages. The proposed estimator is theoretically analyzed, proving its unbiasedness and deriving its analytical coefficient of variation. To sample the constructed target, we resort to our developed Quasi-Newton mass preconditioned Hamiltonian MCMC (QNp-HMCMC) and we prove that it converges to the correct stationary target distribution. To avoid the challenging task of tuning the trajectory length in complex spaces, QNp-HMCMC is effectively utilized in this work with a single-step integration. We thus show the equivalence of QNp-HMCMC with single-step implementation to a unique and efficient preconditioned Metropolis-adjusted Langevin algorithm (MALA). An optimization approach is also leveraged to initiate QNp-HMCMC effectively, and the implementation of the developed framework in bounded spaces is eventually discussed. A series of diverse problems involving high dimensionality (several hundred inputs), strong nonlinearity, and non-Gaussianity is presented, showcasing the capabilities and efficiency of the suggested framework and demonstrating its advantages compared to relevant state-of-the-art sampling methods.
- [47] arXiv:2405.14166 [pdf, ps, html, other]
-
Title: Optimal Bayesian predictive probability for delayed response in single-arm clinical trials with binary efficacy outcomeComments: 20 pages, 2 tables, 1 figureSubjects: Methodology (stat.ME)
In oncology, phase II or multiple expansion cohort trials are crucial for clinical development plans. This is because they aid in identifying potent agents with sufficient activity to continue development and confirm the proof of concept. Typically, these clinical trials are single-arm trials, with the primary endpoint being short-term treatment efficacy. Despite the development of several well-designed methodologies, there may be a practical impediment in that the endpoints may be observed within a sufficient time such that adaptive go/no-go decisions can be made in a timely manner at each interim monitoring. Specifically, Response Evaluation Criteria in Solid Tumors guideline defines a confirmed response and necessitates it in non-randomized trials, where the response is the primary endpoint. However, obtaining the confirmed outcome from all participants entered at interim monitoring may be time-consuming as non-responders should be followed up until the disease progresses. Thus, this study proposed an approach to accelerate the decision-making process that incorporated the outcome without confirmation by discounting its contribution to the decision-making framework using the generalized Bayes' theorem. Further, the behavior of the proposed approach was evaluated through a simple simulation study. The results demonstrated that the proposed approach made appropriate interim go/no-go decisions.
- [48] arXiv:2405.14208 [pdf, ps, html, other]
-
Title: An Empirical Comparison of Methods to Produce Business Statistics Using Non-Probability DataSubjects: Methodology (stat.ME)
There is a growing trend among statistical agencies to explore non-probability data sources for producing more timely and detailed statistics, while reducing costs and respondent burden. Coverage and measurement error are two issues that may be present in such data. The imperfections may be corrected using available information relating to the population of interest, such as a census or a reference probability sample.
In this paper, we compare a wide range of existing methods for producing population estimates using a non-probability dataset through a simulation study based on a realistic business population. The study was conducted to examine the performance of the methods under different missingness and data quality assumptions. The results confirm the ability of the methods examined to address selection bias. When no measurement error is present in the non-probability dataset, a screening dual-frame approach for the probability sample tends to yield lower sample size and mean squared error results. The presence of measurement error and/or nonignorable missingness increases mean squared errors for estimators that depend heavily on the non-probability data. In this case, the best approach tends to be to fall back to a model-assisted estimator based on the probability sample. - [49] arXiv:2405.14285 [pdf, ps, html, other]
-
Title: Computing the Bias of Constant-step Stochastic Approximation with Markovian NoiseComments: PreprintSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
We study stochastic approximation algorithms with Markovian noise and constant step-size $\alpha$. We develop a method based on infinitesimal generator comparisons to study the bias of the algorithm, which is the expected difference between $\theta_n$ -- the value at iteration $n$ -- and $\theta^*$ -- the unique equilibrium of the corresponding ODE. We show that, under some smoothness conditions, this bias is of order $O(\alpha)$. Furthermore, we show that the time-averaged bias is equal to $\alpha V + O(\alpha^2)$, where $V$ is a constant characterized by a Lyapunov equation, showing that $\esp{\bar{\theta}_n} \approx \theta^*+V\alpha + O(\alpha^2)$, where $\bar{\theta}_n=(1/n)\sum_{k=1}^n\theta_k$ is the Polyak-Ruppert average. We also show that $\bar{\theta}_n$ converges with high probability around $\theta^*+\alpha V$. We illustrate how to combine this with Richardson-Romberg extrapolation to derive an iterative scheme with a bias of order $O(\alpha^2)$.
- [50] arXiv:2405.14335 [pdf, ps, html, other]
-
Title: Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
This work investigates the offline formulation of the contextual bandit problem, where the goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies. Motivated by critical applications, we move beyond point estimators. Instead, we adopt the principle of pessimism where we construct upper bounds that assess a policy's worst-case performance, enabling us to confidently select and learn improved policies. Precisely, we introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators. These bounds are general enough to cover most existing estimators and pave the way for the development of new ones. In particular, our pursuit of the tightest bound within this class motivates a novel estimator (LS), that logarithmically smooths large importance weights. The bound for LS is provably tighter than all its competitors, and naturally results in improved policy selection and learning strategies. Extensive policy evaluation, selection, and learning experiments highlight the versatility and favorable performance of LS.
- [51] arXiv:2405.14374 [pdf, ps, html, other]
-
Title: State-Constrained Offline Reinforcement LearningSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Traditional offline reinforcement learning methods predominantly operate in a batch-constrained setting. This confines the algorithms to a specific state-action distribution present in the dataset, reducing the effects of distributional shift but restricting the algorithm greatly. In this paper, we alleviate this limitation by introducing a novel framework named \emph{state-constrained} offline reinforcement learning. By exclusively focusing on the dataset's state distribution, our framework significantly enhances learning potential and reduces previous limitations. The proposed setting not only broadens the learning horizon but also improves the ability to combine different trajectories from the dataset effectively, a desirable property inherent in offline reinforcement learning. Our research is underpinned by solid theoretical findings that pave the way for subsequent advancements in this domain. Additionally, we introduce StaCQ, a deep learning algorithm that is both performance-driven on the D4RL benchmark datasets and closely aligned with our theoretical propositions. StaCQ establishes a strong baseline for forthcoming explorations in state-constrained offline reinforcement learning.
- [52] arXiv:2405.14392 [pdf, ps, html, other]
-
Title: Markovian Flow Matching: Accelerating MCMC with Continuous Normalizing FlowsSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Continuous normalizing flows (CNFs) learn the probability path between a reference and a target density by modeling the vector field generating said path using neural networks. Recently, Lipman et al. (2022) introduced a simple and inexpensive method for training CNFs in generative modeling, termed flow matching (FM). In this paper, we re-purpose this method for probabilistic inference by incorporating Markovian sampling methods in evaluating the FM objective and using the learned probability path to improve Monte Carlo sampling. We propose a sequential method, which uses samples from a Markov chain to fix the probability path defining the FM objective. We augment this scheme with an adaptive tempering mechanism that allows the discovery of multiple modes in the target. Under mild assumptions, we establish convergence to a local optimum of the FM objective, discuss improvements in the convergence rate, and illustrate our methods on synthetic and real-world examples.
- [53] arXiv:2405.14403 [pdf, ps, html, other]
-
Title: Representative electricity price profiles for European day-ahead and intraday spot marketsComments: Supplementary information (SI) included; Manuscript: 27 pages, 9 figures, 4 tables; SI: 7 pages, 5 figures, 2 tablesSubjects: Applications (stat.AP); Computational Engineering, Finance, and Science (cs.CE); Physics and Society (physics.soc-ph)
We propose a method to construct representative price profiles of the day-ahead (DA) and the intraday (ID) electricity spot markets and use this method to provide examples of ready-to-use price data sets. In contrast to common scenario generation approaches, the method is deterministic and relies on a small number of degrees of freedom, with the aim to be well defined and easy to use. We thereby target an enhanced comparability of future research studies on demand-side management and energy cost optimization. We construct the price profiles based on historical time series from the spot markets of interest, e.g., European Power Exchange (EPEX) spot. To this end, we extract key price components from the data while also accounting for known dominant mechanisms in the price variation. Further, the method is able to preserve key statistical features of the historical data (e.g., mean and standard deviation) when constructing the benchmark profile. Finally, our approach ensures comparability of ID and DA price profiles by design, as their cumulative (integral) price can be made identical if needed.
- [54] arXiv:2405.14456 [pdf, ps, html, other]
-
Title: Cumulant-based approximation for fast and efficient prediction for species distributionSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Species distribution modeling plays an important role in estimating the habitat suitability of species using environmental variables. For this purpose, Maxent and the Poisson point process are popular and powerful methods extensively employed across various ecological and biological sciences. However, the computational speed becomes prohibitively slow when using huge background datasets, which is often the case with fine-resolution data or global-scale estimations. To address this problem, we propose a computationally efficient species distribution model using a cumulant-based approximation (CBA) applied to the loss function of $\gamma$-divergence. Additionally, we introduce a sequential estimating algorithm with an $L_1$ penalty to select important environmental variables closely associated with species distribution. The regularized geometric-mean method, derived from the CBA, demonstrates high computational efficiency and estimation accuracy. Moreover, by applying CBA to Maxent, we establish that Maxent and Fisher linear discriminant analysis are equivalent under a normality assumption. This equivalence leads to an highly efficient computational method for estimating species distribution. The effectiveness of our proposed methods is illustrated through simulation studies and by analyzing data on 226 species from the National Centre for Ecological Analysis and Synthesis and 709 Japanese vascular plant species. The computational efficiency of the proposed methods is significantly improved compared to Maxent, while maintaining comparable estimation accuracy. A R package {\tt CBA} is also prepared to provide all programming codes used in simulation studies and real data analysis.
- [55] arXiv:2405.14459 [pdf, ps, html, other]
-
Title: Semi-Discrete Optimal Transport: Nearly Minimax Estimation With Stochastic Gradient Descent and Adaptive Entropic RegularizationFerdinand Genans-Boiteux (LPSM (UMR\_8001)), Antoine Godichon-Baggioni (LPSM (UMR\_8001)), François-Xavier Vialard (LIGM), Olivier Wintenberger (LPSM (UMR\_8001))Subjects: Machine Learning (stat.ML)
Optimal Transport (OT) based distances are powerful tools for machine learning to compare probability measures and manipulate them using OT maps. In this field, a setting of interest is semi-discrete OT, where the source measure $\mu$ is continuous, while the target $\nu$ is discrete. Recent works have shown that the minimax rate for the OT map is $\mathcal{O}(t^{-1/2})$ when using $t$ i.i.d. subsamples from each measure (two-sample setting). An open question is whether a better convergence rate can be achieved when the full information of the discrete measure $\nu$ is known (one-sample setting). In this work, we answer positively to this question by (i) proving an $\mathcal{O}(t^{-1})$ lower bound rate for the OT map, using the similarity between Laguerre cells estimation and density support estimation, and (ii) proposing a Stochastic Gradient Descent (SGD) algorithm with adaptive entropic regularization and averaging acceleration. To nearly achieve the desired fast rate, characteristic of non-regular parametric problems, we design an entropic regularization scheme decreasing with the number of samples. Another key step in our algorithm consists of using a projection step that permits to leverage the local strong convexity of the regularized OT problem. Our convergence analysis integrates online convex optimization and stochastic gradient techniques, complemented by the specificities of the OT semi-dual. Moreover, while being as computationally and memory efficient as vanilla SGD, our algorithm achieves the unusual fast rates of our theory in numerical experiments.
- [56] arXiv:2405.14482 [pdf, ps, html, other]
-
Title: Quantifying Multivariate Graph Dependencies: Theory and Estimation for Multiplex GraphsSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Combinatorics (math.CO); Probability (math.PR)
Multiplex graphs, characterised by their layered structure, exhibit informative interdependencies within layers that are crucial for understanding complex network dynamics. Quantifying the interaction and shared information among these layers is challenging due to the non-Euclidean structure of graphs. Our paper introduces a comprehensive theory of multivariate information measures for multiplex graphs. We introduce graphon mutual information for pairs of graphs and expand this to graphon interaction information for three or more graphs, including their conditional variants. We then define graphon total correlation and graphon dual total correlation, along with their conditional forms, and introduce graphon $O-$information. We discuss and quantify the concepts of synergy and redundancy in graphs for the first time, introduce consistent nonparametric estimators for these multivariate graphon information--theoretic measures, and provide their convergence rates. We also conduct a simulation study to illustrate our theoretical findings and demonstrate the relationship between the introduced measures, multiplex graph structure, and higher--order interdependecies. Real-world applications further show the utility of our estimators in revealing shared information and dependence structures in real-world multiplex graphs. This work not only answers fundamental questions about information sharing across multiple graphs but also sets the stage for advanced pattern analysis in complex networks.
- [57] arXiv:2405.14492 [pdf, ps, html, other]
-
Title: Iterative Methods for Full-Scale Gaussian Process Approximations for Large Spatial DataSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Gaussian processes are flexible probabilistic regression models which are widely used in statistics and machine learning. However, a drawback is their limited scalability to large data sets. To alleviate this, we consider full-scale approximations (FSAs) that combine predictive process methods and covariance tapering, thus approximating both global and local structures. We show how iterative methods can be used to reduce the computational costs for calculating likelihoods, gradients, and predictive distributions with FSAs. We introduce a novel preconditioner and show that it accelerates the conjugate gradient method's convergence speed and mitigates its sensitivity with respect to the FSA parameters and the eigenvalue structure of the original covariance matrix, and we demonstrate empirically that it outperforms a state-of-the-art pivoted Cholesky preconditioner. Further, we present a novel, accurate, and fast way to calculate predictive variances relying on stochastic estimations and iterative methods. In both simulated and real-world data experiments, we find that our proposed methodology achieves the same accuracy as Cholesky-based computations with a substantial reduction in computational time. Finally, we also compare different approaches for determining inducing points in predictive process and FSA models. All methods are implemented in a free C++ software library with high-level Python and R packages.
- [58] arXiv:2405.14494 [pdf, ps, html, other]
-
Title: Entrywise error bounds for low-rank approximations of kernel matricesComments: 28 pages, 3 figuresSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG)
In this paper, we derive entrywise error bounds for low-rank approximations of kernel matrices obtained using the truncated eigen-decomposition (or singular value decomposition). While this approximation is well-known to be optimal with respect to the spectral and Frobenius norm error, little is known about the statistical behaviour of individual entries. Our error bounds fill this gap. A key technical innovation is a delocalisation result for the eigenvectors of the kernel matrix corresponding to small eigenvalues, which takes inspiration from the field of Random Matrix Theory. Finally, we validate our theory with an empirical study of a collection of synthetic and real-world datasets.
- [59] arXiv:2405.14509 [pdf, ps, html, other]
-
Title: Closed-form estimators for an exponential family derived from likelihood equationsComments: 13 pages, 4 figuresSubjects: Methodology (stat.ME)
In this paper, we derive closed-form estimators for the parameters of some probability distributions belonging to the exponential family. A bootstrap bias-reduced version of these proposed closed-form estimators are also derived. A Monte Carlo simulation is performed for the assessment of the estimators. The results are seen to be quite favorable to the proposed bootstrap bias-reduce estimators.
- [60] arXiv:2405.14532 [pdf, ps, html, other]
-
Title: Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein ProblemComments: 28 pages, 1 figure. Comments are most welcome!Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
The Procrustes-Wasserstein problem consists in matching two high-dimensional point clouds in an unsupervised setting, and has many applications in natural language processing and computer vision. We consider a planted model with two datasets $X,Y$ that consist of $n$ datapoints in $\mathbb{R}^d$, where $Y$ is a noisy version of $X$, up to an orthogonal transformation and a relabeling of the data points. This setting is related to the graph alignment problem in geometric models. In this work, we focus on the euclidean transport cost between the point clouds as a measure of performance for the alignment. We first establish information-theoretic results, in the high ($d \gg \log n$) and low ($d \ll \log n$) dimensional regimes. We then study computational aspects and propose the Ping-Pong algorithm, alternatively estimating the orthogonal transformation and the relabeling, initialized via a Franke-Wolfe convex relaxation. We give sufficient conditions for the method to retrieve the planted signal after one single step. We provide experimental results to compare the proposed approach with the state-of-the-art method of Grave et al. (2019).
- [61] arXiv:2405.14540 [pdf, ps, html, other]
-
Title: This Too Shall Pass: Removing Stale Observations in Dynamic Bayesian OptimizationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bayesian Optimization (BO) has proven to be very successful at optimizing a static, noisy, costly-to-evaluate black-box function $f : \mathcal{S} \to \mathbb{R}$. However, optimizing a black-box which is also a function of time (i.e., a dynamic function) $f : \mathcal{S} \times \mathcal{T} \to \mathbb{R}$ remains a challenge, since a dynamic Bayesian Optimization (DBO) algorithm has to keep track of the optimum over time. This changes the nature of the optimization problem in at least three aspects: (i) querying an arbitrary point in $\mathcal{S} \times \mathcal{T}$ is impossible, (ii) past observations become less and less relevant for keeping track of the optimum as time goes by and (iii) the DBO algorithm must have a high sampling frequency so it can collect enough relevant observations to keep track of the optimum through time. In this paper, we design a Wasserstein distance-based criterion able to quantify the relevancy of an observation with respect to future predictions. Then, we leverage this criterion to build W-DBO, a DBO algorithm able to remove irrelevant observations from its dataset on the fly, thus maintaining simultaneously a good predictive performance and a high sampling frequency, even in continuous-time optimization tasks with unknown horizon. Numerical experiments establish the superiority of W-DBO, which outperforms state-of-the-art methods by a comfortable margin.
- [62] arXiv:2405.14574 [pdf, ps, html, other]
-
Title: Learning with Fitzpatrick LossesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Fenchel-Young losses are a family of convex loss functions, encompassing the squared, logistic and sparsemax losses, among others. Each Fenchel-Young loss is implicitly associated with a link function, for mapping model outputs to predictions. For instance, the logistic loss is associated with the soft argmax link function. Can we build new loss functions associated with the same link function as Fenchel-Young losses? In this paper, we introduce Fitzpatrick losses, a new family of convex loss functions based on the Fitzpatrick function. A well-known theoretical tool in maximal monotone operator theory, the Fitzpatrick function naturally leads to a refined Fenchel-Young inequality, making Fitzpatrick losses tighter than Fenchel-Young losses, while maintaining the same link function for prediction. As an example, we introduce the Fitzpatrick logistic loss and the Fitzpatrick sparsemax loss, counterparts of the logistic and the sparsemax losses. This yields two new tighter losses associated with the soft argmax and the sparse argmax, two of the most ubiquitous output layers used in machine learning. We study in details the properties of Fitzpatrick losses and in particular, we show that they can be seen as Fenchel-Young losses using a modified, target-dependent generating function. We demonstrate the effectiveness of Fitzpatrick losses for label proportion estimation.
- [63] arXiv:2405.14628 [pdf, ps, html, other]
-
Title: Online robust estimation and bootstrap inference for function-on-scalar regressionSubjects: Methodology (stat.ME); Computation (stat.CO)
We propose a novel and robust online function-on-scalar regression technique via geometric median to learn associations between functional responses and scalar covariates based on massive or streaming datasets. The online estimation procedure, developed using the average stochastic gradient descent algorithm, offers an efficient and cost-effective method for analyzing sequentially augmented datasets, eliminating the need to store large volumes of data in memory. We establish the almost sure consistency, $L_p$ convergence, and asymptotic normality of the online estimator. To enable efficient and fast inference of the parameters of interest, including the derivation of confidence intervals, we also develop an innovative two-step online bootstrap procedure to approximate the limiting error distribution of the robust online estimator. Numerical studies under a variety of scenarios demonstrate the effectiveness and efficiency of the proposed online learning method. A real application analyzing PM$_{2.5}$ air-quality data is also included to exemplify the proposed online approach.
- [64] arXiv:2405.14630 [pdf, ps, html, other]
-
Title: Bounds for the smallest eigenvalue of the NTK for arbitrary spherical data of arbitrary dimensionComments: 47 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bounds on the smallest eigenvalue of the neural tangent kernel (NTK) are a key ingredient in the analysis of neural network optimization and memorization. However, existing results require distributional assumptions on the data and are limited to a high-dimensional setting, where the input dimension $d_0$ scales at least logarithmically in the number of samples $n$. In this work we remove both of these requirements and instead provide bounds in terms of a measure of the collinearity of the data: notably these bounds hold with high probability even when $d_0$ is held constant versus $n$. We prove our results through a novel application of the hemisphere transform.
- [65] arXiv:2405.14652 [pdf, ps, html, other]
-
Title: Statistical inference for high-dimensional convoluted rank regressionSubjects: Methodology (stat.ME)
High-dimensional penalized rank regression is a powerful tool for modeling high-dimensional data due to its robustness and estimation efficiency. However, the non-smoothness of the rank loss brings great challenges to the computation. To solve this critical issue, high-dimensional convoluted rank regression is recently proposed, and penalized convoluted rank regression estimators are introduced. However, these developed estimators cannot be directly used to make inference. In this paper, we investigate the inference problem of high-dimensional convoluted rank regression. We first establish estimation error bounds of penalized convoluted rank regression estimators under weaker conditions on the predictors. Based on the penalized convoluted rank regression estimators, we further introduce a debiased estimator. We then provide Bahadur representation for our proposed estimator. We further develop simultaneous inference procedures. A novel bootstrap procedure is proposed and its theoretical validity is also established. Finally, simulation and real data analysis are conducted to illustrate the merits of our proposed methods.
- [66] arXiv:2405.14686 [pdf, ps, html, other]
-
Title: Efficient Algorithms for the Sensitivities of the Pearson Correlation Coefficient and Its Statistical Significance to Online DataSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Reliably measuring the collinearity of bivariate data is crucial in statistics, particularly for time-series analysis or ongoing studies in which incoming observations can significantly impact current collinearity estimates. Leveraging identities from Welford's online algorithm for sample variance, we develop a rigorous theoretical framework for analyzing the maximal change to the Pearson correlation coefficient and its p-value that can be induced by additional data. Further, we show that the resulting optimization problems yield elegant closed-form solutions that can be accurately computed by linear- and constant-time algorithms. Our work not only creates new theoretical avenues for robust correlation measures, but also has broad practical implications for disciplines that span econometrics, operations research, clinical trials, climatology, differential privacy, and bioinformatics. Software implementations of our algorithms in Cython-wrapped C are made available at this https URL for reproducibility, practical deployment, and future theoretical development.
- [67] arXiv:2405.14711 [pdf, ps, html, other]
-
Title: Zero-inflation in the Multivariate Poisson Lognormal FamilyComments: 27 pages including appendices. 8 figures, 1 tableSubjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)
Analyzing high-dimensional count data is a challenge and statistical model-based approaches provide an adequate and efficient framework that preserves explainability. The (multivariate) Poisson-Log-Normal (PLN) model is one such model: it assumes count data are driven by an underlying structured latent Gaussian variable, so that the dependencies between counts solely stems from the latent dependencies. However PLN doesn't account for zero-inflation, a feature frequently observed in real-world datasets. Here we introduce the Zero-Inflated PLN (ZIPLN) model, adding a multivariate zero-inflated component to the model, as an additional Bernoulli latent variable. The Zero-Inflation can be fixed, site-specific, feature-specific or depends on covariates. We estimate model parameters using variational inference that scales up to datasets with a few thousands variables and compare two approximations: (i) independent Gaussian and Bernoulli variational distributions or (ii) Gaussian variational distribution conditioned on the Bernoulli one. The method is assessed on synthetic data and the efficiency of ZIPLN is established even when zero-inflation concerns up to $90\%$ of the observed counts. We then apply both ZIPLN and PLN to a cow microbiome dataset, containing $90.6\%$ of zeroes. Accounting for zero-inflation significantly increases log-likelihood and reduces dispersion in the latent space, thus leading to improved group discrimination.
- [68] arXiv:2405.14778 [pdf, ps, html, other]
-
Title: Optimal Rates for Vector-Valued Spectral Regularization Learning AlgorithmsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We study theoretical properties of a broad class of regularized algorithms with vector-valued output. These spectral algorithms include kernel ridge regression, kernel principal component regression, various implementations of gradient descent and many more. Our contributions are twofold. First, we rigorously confirm the so-called saturation effect for ridge regression with vector-valued output by deriving a novel lower bound on learning rates; this bound is shown to be suboptimal when the smoothness of the regression function exceeds a certain level. Second, we present the upper bound for the finite sample risk general vector-valued spectral algorithms, applicable to both well-specified and misspecified scenarios (where the true regression function lies outside of the hypothesis space) which is minimax optimal in various regimes. All of our results explicitly allow the case of infinite-dimensional output variables, proving consistency of recent practical applications.
- [69] arXiv:2405.14789 [pdf, ps, html, other]
-
Title: A Bayesian Approach to Estimate Causal Peer Influence Accounting for Latent Network HomophilyComments: 39 pages, 15 figures, 2 tablesSubjects: Applications (stat.AP)
Researchers have focused on understanding how individual's behavior is influenced by the behaviors of their peers in observational studies of social networks. Identifying and estimating causal peer influence, however, is challenging due to confounding by homophily, where people tend to connect with those who share similar characteristics with them. Moreover, since all the attributes driving homophily are generally not always observed and act as unobserved confounders, identifying and estimating causal peer influence becomes infeasible using standard causal identification assumptions. In this paper, we address this challenge by leveraging latent locations inferred from the network itself to disentangle homophily from causal peer influence, and we extend this approach to multiple networks by adopting a Bayesian hierarchical modeling framework. To accommodate the nonlinear dependency of peer influence on individual behavior, we employ a Bayesian nonparametric method, specifically Bayesian Additive Regression Trees (BART), and we propose a Bayesian framework that accounts for the uncertainty in inferring latent locations. We assess the operating characteristics of the estimator via extensive simulation study. Finally, we apply our method to estimate causal peer influence in advice-seeking networks of teachers in secondary schools, in order to assess whether the teachers' belief about mathematics education is influenced by the beliefs of their peers from whom they receive advice. Our results suggest that, overlooking latent homophily can lead to either underestimation or overestimation of causal peer influence, accompanied by considerable estimation uncertainty.
- [70] arXiv:2405.14840 [pdf, ps, html, other]
-
Title: Differentiable Annealed Importance Sampling Minimizes The Jensen-Shannon Divergence Between Initial and Target DistributionComments: 22 pages, including 9 pages of main text and 11 pages of appendix, conference paper at ICML 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Differentiable annealed importance sampling (DAIS), proposed by Geffner & Domke (2021) and Zhang et al. (2021), allows optimizing, among others, over the initial distribution of AIS. In this paper, we show that, in the limit of many transitions, DAIS minimizes the symmetrized KL divergence (Jensen-Shannon divergence) between the initial and target distribution. Thus, DAIS can be seen as a form of variational inference (VI) in that its initial distribution is a parametric fit to an intractable target distribution. We empirically evaluate the usefulness of the initial distribution as a variational distribution on synthetic and real-world data, observing that it often provides more accurate uncertainty estimates than standard VI (optimizing the reverse KL divergence), importance weighted VI, and Markovian score climbing (optimizing the forward KL divergence).
- [71] arXiv:2405.14848 [pdf, ps, html, other]
-
Title: Local Causal Discovery for Structural Evidence of Direct DiscriminationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Fairness is a critical objective in policy design and algorithmic decision-making. Identifying the causal pathways of unfairness requires knowledge of the underlying structural causal model, which may be incomplete or unavailable. This limits the practicality of causal fairness analysis in complex or low-knowledge domains. To mitigate this practicality gap, we advocate for developing efficient causal discovery methods for fairness applications. To this end, we introduce local discovery for direct discrimination (LD3): a polynomial-time algorithm that recovers structural evidence of direct discrimination. LD3 performs a linear number of conditional independence tests with respect to variable set size. Moreover, we propose a graphical criterion for identifying the weighted controlled direct effect (CDE), a qualitative measure of direct discrimination. We prove that this criterion is satisfied by the knowledge returned by LD3, increasing the accessibility of the weighted CDE as a causal fairness measure. Taking liver transplant allocation as a case study, we highlight the potential impact of LD3 for modeling fairness in complex decision systems. Results on real-world data demonstrate more plausible causal relations than baselines, which took 197x to 5870x longer to execute.
New submissions for Friday, 24 May 2024 (showing 71 of 71 entries )
- [72] arXiv:2405.12230 (cross-list from astro-ph.IM) [pdf, ps, html, other]
-
Title: Computing the Instantaneous Collision Probability between Satellites using Characteristic Function InversionSubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Applications (stat.AP)
The probability that two satellites overlap in space at a specified instant of time is called their instantaneous collision probability. Assuming Gaussian uncertainties and spherical satellites, this probability is the integral of a Gaussian distribution over a sphere. This paper shows how to compute the probability using an established numerical procedure called characteristic function inversion. The collision probability in the short-term encounter scenario is also evaluated with this approach, where the instant at which the probability is computed is the time of closest approach between the objects. Python and R code is provided to evaluate the probability in practice. Overall, the approach has been established for over fifty years, is implemented in existing software, does not rely on analytical approximations, and can be used to evaluate two and three dimensional collision probabilities.
- [73] arXiv:2405.13180 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: Data Assimilation with Machine Learning Surrogate Models: A Case Study with FourCastNetSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Atmospheric and Oceanic Physics (physics.ao-ph); Applications (stat.AP)
Modern data-driven surrogate models for weather forecasting provide accurate short-term predictions but inaccurate and nonphysical long-term forecasts. This paper investigates online weather prediction using machine learning surrogates supplemented with partial and noisy observations. We empirically demonstrate and theoretically justify that, despite the long-time instability of the surrogates and the sparsity of the observations, filtering estimates can remain accurate in the long-time horizon. As a case study, we integrate FourCastNet, a state-of-the-art weather surrogate model, within a variational data assimilation framework using partial, noisy ERA5 data. Our results show that filtering estimates remain accurate over a year-long assimilation window and provide effective initial conditions for forecasting tasks, including extreme event prediction.
- [74] arXiv:2405.13251 (cross-list from econ.GN) [pdf, ps, html, other]
-
Title: Valores extremos de inflaci\'on en Costa RicaComments: in Spanish language. arXiv admin note: text overlap with arXiv:2405.12240Subjects: General Economics (econ.GN); Applications (stat.AP)
Maintaining low, non-negative and stable inflation levels is a necessary condition for the stability of the economy as a whole, because the monetary authorities of most industrialized countries, including the Central Bank of Costa Rica since 2005, they have oriented their monetary policy precisely to that task. Still Thus, both in Costa Rica and internationally, most of the statistical modeling of inflation has been limited to modeling their expectancy conditional on different covariates using linear models. This implies a lack of knowledge of the dynamics of the extreme values of the inflation rate and how these are related with other macroeconomic variables. In Costa Rica this is of particular importance since in several periods Negative quarter-on-quarter inflation rates have recently been experienced, which can be problematic if this becomes a recurring phenomenon. Therefore, in this work we propose to answer what is the relationship between the gap of GDP, inflation expectations, imported inflation rate, and the extreme values of the inflation rate in Costa Rica. That is, the main objective is to determine the relationship between the extreme values of the the inflation rate, GDP gap, inflation expectations and imported inflation.
- [75] arXiv:2405.13346 (cross-list from math.OC) [pdf, ps, html, other]
-
Title: Convergence of the Deep Galerkin Method for Mean Field Control ProblemsComments: 27 pages, 6 figuresSubjects: Optimization and Control (math.OC); Machine Learning (stat.ML)
We establish the convergence of the deep Galerkin method (DGM), a deep learning-based scheme for solving high-dimensional nonlinear PDEs, for Hamilton-Jacobi-Bellman (HJB) equations that arise from the study of mean field control problems (MFCPs). Based on a recent characterization of the value function of the MFCP as the unique viscosity solution of an HJB equation on the simplex, we establish both an existence and convergence result for the DGM. First, we show that the loss functional of the DGM can be made arbitrarily small given that the value function of the MFCP possesses sufficient regularity. Then, we show that if the loss functional of the DGM converges to zero, the corresponding neural network approximators must converge uniformly to the true value function on the simplex. We also provide numerical experiments demonstrating the DGM's ability to generalize to high-dimensional HJB equations.
- [76] arXiv:2405.13375 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Adaptive Data Analysis for Growing DataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Reuse of data in adaptive workflows poses challenges regarding overfitting and the statistical validity of results. Previous work has demonstrated that interacting with data via differentially private algorithms can mitigate overfitting, achieving worst-case generalization guarantees with asymptotically optimal data requirements. However, such past work assumes data is static and cannot accommodate situations where data grows over time. In this paper we address this gap, presenting the first generalization bounds for adaptive analysis in the dynamic data setting. We allow the analyst to adaptively schedule their queries conditioned on the current size of the data, in addition to previous queries and responses. We also incorporate time-varying empirical accuracy bounds and mechanisms, allowing for tighter guarantees as data accumulates. In a batched query setting, the asymptotic data requirements of our bound grows with the square-root of the number of adaptive queries, matching prior works' improvement over data splitting for the static setting. We instantiate our bound for statistical queries with the clipped Gaussian mechanism, where it empirically outperforms baselines composed from static bounds.
- [77] arXiv:2405.13392 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Local convergence of min-max algorithms to differentiable equilibrium on Riemannian manifoldComments: under reviewSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We study min-max algorithms to solve zero-sum differentiable games on Riemannian manifold. The notions of differentiable Stackelberg equilibrium and differentiable Nash equilibrium in Euclidean space are generalized to Riemannian manifold, through an intrinsic definition which does not depend on the choice of local coordinate chart of manifold. We then provide sufficient conditions for the local convergence of the deterministic simultaneous algorithms $\tau$-GDA and $\tau$-SGA near such equilibrium, using a general methodology based on spectral analysis. These algorithms are extended with stochastic gradients and applied to the training of Wasserstein GAN. The discriminator of GAN is constructed from Lipschitz-continuous functions based on Stiefel manifold. We show numerically how the insights obtained from the local convergence analysis may lead to an improvement of GAN models.
- [78] arXiv:2405.13396 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Why In-Context Learning Transformers are Tabular Data ClassifiersComments: 9 pages main body, 22 pages total. Preprint under reviewSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The recently introduced TabPFN pretrains an In-Context Learning (ICL) transformer on synthetic data to perform tabular data classification. As synthetic data does not share features or labels with real-world data, the underlying mechanism that contributes to the success of this method remains unclear. This study provides an explanation by demonstrating that ICL-transformers acquire the ability to create complex decision boundaries during pretraining. To validate our claim, we develop a novel forest dataset generator which creates datasets that are unrealistic, but have complex decision boundaries. Our experiments confirm the effectiveness of ICL-transformers pretrained on this data. Furthermore, we create TabForestPFN, the ICL-transformer pretrained on both the original TabPFN synthetic dataset generator and our forest dataset generator. By fine-tuning this model, we reach the current state-of-the-art on tabular data classification. Code is available at this https URL.
- [79] arXiv:2405.13469 (cross-list from astro-ph.EP) [pdf, ps, html, other]
-
Title: Machine Learning for Exoplanet Detection in High-Contrast Spectroscopy: Revealing Exoplanets by Leveraging Hidden Molecular Signatures in Cross-Correlated Spectra with Convolutional Neural NetworksEmily O. Garvin, Markus J. Bonse, Jean Hayoz, Gabriele Cugno, Jonas Spiller, Polychronis A. Patapis, Dominique Petit Dit de la Roche, Rakesh Nath-Ranga, Olivier Absil, Nicolai F. Meinshausen, Sascha P. QuanzComments: 27 pages, 24 figures. Submitted for publication in A&A January 2, 2024. After first iteration with the referee, resubmitted May 17, 2024Subjects: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Applications (stat.AP)
The new generation of observatories and instruments (VLT/ERIS, JWST, ELT) motivate the development of robust methods to detect and characterise faint and close-in exoplanets. Molecular mapping and cross-correlation for spectroscopy use molecular templates to isolate a planet's spectrum from its host star. However, reliance on signal-to-noise ratio (S/N) metrics can lead to missed discoveries, due to strong assumptions of Gaussian independent and identically distributed noise. We introduce machine learning for cross-correlation spectroscopy (MLCCS); the method aims to leverage weak assumptions on exoplanet characterisation, such as the presence of specific molecules in atmospheres, to improve detection sensitivity for exoplanets. MLCCS methods, including a perceptron and unidimensional convolutional neural networks, operate in the cross-correlated spectral dimension, in which patterns from molecules can be identified. We test on mock datasets of synthetic planets inserted into real noise from SINFONI at K-band. The results from MLCCS show outstanding improvements. The outcome on a grid of faint synthetic gas giants shows that for a false discovery rate up to 5%, a perceptron can detect about 26 times the amount of planets compared to an S/N metric. This factor increases up to 77 times with convolutional neural networks, with a statistical sensitivity shift from 0.7% to 55.5%. In addition, MLCCS methods show a drastic improvement in detection confidence and conspicuity on imaging spectroscopy. Once trained, MLCCS methods offer sensitive and rapid detection of exoplanets and their molecular species in the spectral dimension. They handle systematic noise and challenging seeing conditions, can adapt to many spectroscopic instruments and modes, and are versatile regarding atmospheric characteristics, which can enable identification of various planets in archival and future data.
- [80] arXiv:2405.13535 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Generalized Laplace ApproximationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In recent years, the inconsistency in Bayesian deep learning has garnered increasing attention. Tempered or generalized posterior distributions often offer a direct and effective solution to this issue. However, understanding the underlying causes and evaluating the effectiveness of generalized posteriors remain active areas of research. In this study, we introduce a unified theoretical framework to attribute Bayesian inconsistency to model misspecification and inadequate priors. We interpret the generalization of the posterior with a temperature factor as a correction for misspecified models through adjustments to the joint probability model, and the recalibration of priors by redistributing probability mass on models within the hypothesis space using data samples. Additionally, we highlight a distinctive feature of Laplace approximation, which ensures that the generalized normalizing constant can be treated as invariant, unlike the typical scenario in general Bayesian learning where this constant varies with model parameters post-generalization. Building on this insight, we propose the generalized Laplace approximation, which involves a simple adjustment to the computation of the Hessian matrix of the regularized loss function. This method offers a flexible and scalable framework for obtaining high-quality posterior distributions. We assess the performance and properties of the generalized Laplace approximation on state-of-the-art neural networks and real-world datasets.
- [81] arXiv:2405.13655 (cross-list from eess.IV) [pdf, ps, html, other]
-
Title: A Deep Learning Approach to Multi-Fiber Parameter Estimation and Uncertainty Quantification in Diffusion MRISubjects: Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM); Applications (stat.AP); Computation (stat.CO)
Diffusion MRI (dMRI) is the primary imaging modality used to study brain microstructure in vivo. Reliable and computationally efficient parameter inference for common dMRI biophysical models is a challenging inverse problem, due to factors such as variable dimensionalities (reflecting the unknown number of distinct white matter fiber populations in a voxel), low signal-to-noise ratios, and non-linear forward models. These challenges have led many existing methods to use biologically implausible simplified models to stabilize estimation, for instance, assuming shared microstructure across all fiber populations within a voxel. In this work, we introduce a novel sequential method for multi-fiber parameter inference that decomposes the task into a series of manageable subproblems. These subproblems are solved using deep neural networks tailored to problem-specific structure and symmetry, and trained via simulation. The resulting inference procedure is largely amortized, enabling scalable parameter estimation and uncertainty quantification across all model parameters. Simulation studies and real imaging data analysis using the Human Connectome Project (HCP) demonstrate the advantages of our method over standard alternatives. In the case of the standard model of diffusion, our results show that under HCP-like acquisition schemes, estimates for extra-cellular parallel diffusivity are highly uncertain, while those for the intra-cellular volume fraction can be estimated with relatively high precision.
- [82] arXiv:2405.13682 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Constructive Universal Approximation Theorems for Deep Joint-Equivariant Networks by Schur's LemmaSubjects: Machine Learning (cs.LG); Representation Theory (math.RT); Machine Learning (stat.ML)
We present a unified constructive universal approximation theorem covering a wide range of learning machines including both shallow and deep neural networks based on the group representation theory. Constructive here means that the distribution of parameters is given in a closed-form expression (called the ridgelet transform). Contrary to the case of shallow models, expressive power analysis of deep models has been conducted in a case-by-case manner. Recently, Sonoda et al. (2023a,b) developed a systematic method to show a constructive approximation theorem from scalar-valued joint-group-invariant feature maps, covering a formal deep network. However, each hidden layer was formalized as an abstract group action, so it was not possible to cover real deep networks defined by composites of nonlinear activation function. In this study, we extend the method for vector-valued joint-group-equivariant feature maps, so to cover such real networks.
- [83] arXiv:2405.13691 (cross-list from physics.flu-dyn) [pdf, ps, html, other]
-
Title: Neural Networks-based Random Vortex Methods for Modelling Incompressible FlowsComments: 16 pages, 5 figuresSubjects: Fluid Dynamics (physics.flu-dyn); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
In this paper we introduce a novel Neural Networks-based approach for approximating solutions to the (2D) incompressible Navier--Stokes equations. Our algorithm uses a Physics-informed Neural Network, that approximates the vorticity based on a loss function that uses a computationally efficient formulation of the Random Vortex dynamics. The neural vorticity estimator is then combined with traditional numerical PDE-solvers for the Poisson equation to compute the velocity field. The main advantage of our method compared to standard Physics-informed Neural Networks is that it strictly enforces physical properties, such as incompressibility or boundary conditions, which might otherwise be hard to guarantee with purely Neural Networks-based approaches.
- [84] arXiv:2405.13712 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Learning Diffusion Priors from Observations by Expectation MaximizationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Diffusion models recently proved to be remarkable priors for Bayesian inverse problems. However, training these models typically requires access to large amounts of clean data, which could prove difficult in some settings. In this work, we present a novel method based on the expectation-maximization algorithm for training diffusion models from incomplete and noisy observations only. Unlike previous works, our method leads to proper diffusion models, which is crucial for downstream tasks. As part of our method, we propose and motivate a new posterior sampling scheme for unconditional diffusion models. We present empirical evidence supporting the effectiveness of our method.
- [85] arXiv:2405.13785 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Efficient Two-Stage Gaussian Process Regression Via Automatic Kernel Search and SubsamplingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
Gaussian Process Regression (GPR) is widely used in statistics and machine learning for prediction tasks requiring uncertainty measures. Its efficacy depends on the appropriate specification of the mean function, covariance kernel function, and associated hyperparameters. Severe misspecifications can lead to inaccurate results and problematic consequences, especially in safety-critical applications. However, a systematic approach to handle these misspecifications is lacking in the literature. In this work, we propose a general framework to address these issues. Firstly, we introduce a flexible two-stage GPR framework that separates mean prediction and uncertainty quantification (UQ) to prevent mean misspecification, which can introduce bias into the model. Secondly, kernel function misspecification is addressed through a novel automatic kernel search algorithm, supported by theoretical analysis, that selects the optimal kernel from a candidate set. Additionally, we propose a subsampling-based warm-start strategy for hyperparameter initialization to improve efficiency and avoid hyperparameter misspecification. With much lower computational cost, our subsampling-based strategy can yield competitive or better performance than training exclusively on the full dataset. Combining all these components, we recommend two GPR methods-exact and scalable-designed to match available computational resources and specific UQ requirements. Extensive evaluation on real-world datasets, including UCI benchmarks and a safety-critical medical case study, demonstrates the robustness and precision of our methods.
- [86] arXiv:2405.13888 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Marrying Causal Representation Learning with Dynamical Systems for ScienceComments: 21 pages, 8 figures, 6 tablesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Causal representation learning promises to extend causal models to hidden causal variables from raw entangled measurements. However, most progress has focused on proving identifiability results in different settings, and we are not aware of any successful real-world application. At the same time, the field of dynamical systems benefited from deep learning and scaled to countless applications but does not allow parameter identification. In this paper, we draw a clear connection between the two and their key assumptions, allowing us to apply identifiable methods developed in causal representation learning to dynamical systems. At the same time, we can leverage scalable differentiable solvers developed for differential equations to build models that are both identifiable and practical. Overall, we learn explicitly controllable models that isolate the trajectory-specific parameters for further downstream tasks such as out-of-distribution classification or treatment effect estimation. We experiment with a wind simulator with partially known factors of variation. We also apply the resulting model to real-world climate data and successfully answer downstream causal questions in line with existing literature on climate change.
- [87] arXiv:2405.13897 (cross-list from math.AG) [pdf, ps, html, other]
-
Title: Geometry of rational quasi-independence models as toric fiber productsSubjects: Algebraic Geometry (math.AG); Combinatorics (math.CO); Statistics Theory (math.ST)
We investigate the geometry of a family of log-linear statistical models called quasi-independence models. The toric fiber product is useful for understanding the geometry of parameter inference in these models because the maximum likelihood degree is multiplicative under the TFP. We define the coordinate toric fiber product, or cTFP, and give necessary and sufficient conditions under which a quasi-independence model is a cTFP of lower-order models. We show that the vanishing ideal of every 2-way quasi-independence model with ML-degree 1 can be realized as an iterated toric fiber product of linear ideals. We also classify which Lawrence lifts of 2-way quasi-independence models are cTFPs and give a necessary condition under which a $k$-way model has ML-degree 1 using its facial submodels.
- [88] arXiv:2405.13910 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Learning Latent Space Hierarchical EBM Diffusion ModelsSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
This work studies the learning problem of the energy-based prior model and the multi-layer generator model. The multi-layer generator model, which contains multiple layers of latent variables organized in a top-down hierarchical structure, typically assumes the Gaussian prior model. Such a prior model can be limited in modelling expressivity, which results in a gap between the generator posterior and the prior model, known as the prior hole problem. Recent works have explored learning the energy-based (EBM) prior model as a second-stage, complementary model to bridge the gap. However, the EBM defined on a multi-layer latent space can be highly multi-modal, which makes sampling from such marginal EBM prior challenging in practice, resulting in ineffectively learned EBM. To tackle the challenge, we propose to leverage the diffusion probabilistic scheme to mitigate the burden of EBM sampling and thus facilitate EBM learning. Our extensive experiments demonstrate a superior performance of our diffusion-learned EBM prior on various challenging tasks.
- [89] arXiv:2405.13922 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Towards Certification of Uncertainty Calibration under Adversarial AttacksComments: 11 pages main paper, appendix includedSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Since neural classifiers are known to be sensitive to adversarial perturbations that alter their accuracy, \textit{certification methods} have been developed to provide provable guarantees on the insensitivity of their predictions to such perturbations. Furthermore, in safety-critical applications, the frequentist interpretation of the confidence of a classifier (also known as model calibration) can be of utmost importance. This property can be measured via the Brier score or the expected calibration error. We show that attacks can significantly harm calibration, and thus propose certified calibration as worst-case bounds on calibration under adversarial perturbations. Specifically, we produce analytic bounds for the Brier score and approximate bounds via the solution of a mixed-integer program on the expected calibration error. Finally, we propose novel calibration attacks and demonstrate how they can improve model calibration through \textit{adversarial calibration training}.
- [90] arXiv:2405.13975 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: There is HOPE to Avoid HiPPOs for Long-memory State Space ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
State-space models (SSMs) that utilize linear, time-invariant (LTI) systems are known for their effectiveness in learning long sequences. However, these models typically face several challenges: (i) they require specifically designed initializations of the system matrices to achieve state-of-the-art performance, (ii) they require training of state matrices on a logarithmic scale with very small learning rates to prevent instabilities, and (iii) they require the model to have exponentially decaying memory in order to ensure an asymptotically stable LTI system. To address these issues, we view SSMs through the lens of Hankel operator theory, which provides us with a unified theory for the initialization and training of SSMs. Building on this theory, we develop a new parameterization scheme, called HOPE, for LTI systems that utilizes Markov parameters within Hankel operators. This approach allows for random initializations of the LTI systems and helps to improve training stability, while also provides the SSMs with non-decaying memory capabilities. Our model efficiently implements these innovations by nonuniformly sampling the transfer functions of LTI systems, and it requires fewer parameters compared to canonical SSMs. When benchmarked against HiPPO-initialized models such as S4 and S4D, an SSM parameterized by Hankel operators demonstrates improved performance on Long-Range Arena (LRA) tasks. Moreover, we use a sequential CIFAR-10 task with padded noise to empirically corroborate our SSM's long memory capacity.
- [91] arXiv:2405.13977 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Removing Bias from Maximum Likelihood Estimation with Model AutophagyComments: 9 Pages, submission for NeurIPS 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose autophagy penalized likelihood estimation (PLE), an unbiased alternative to maximum likelihood estimation (MLE) which is more fair and less susceptible to model autophagy disorder (madness). Model autophagy refers to models trained on their own output; PLE ensures the statistics of these outputs coincide with the data statistics. This enables PLE to be statistically unbiased in certain scenarios where MLE is biased. When biased, MLE unfairly penalizes minority classes in unbalanced datasets and exacerbates the recently discovered issue of self-consuming generative modeling. Theoretical and empirical results show that 1) PLE is more fair to minority classes and 2) PLE is more stable in a self-consumed setting. Furthermore, we provide a scalable and portable implementation of PLE with a hypernetwork framework, allowing existing deep learning architectures to be easily trained with PLE. Finally, we show PLE can bridge the gap between Bayesian and frequentist paradigms in statistics.
- [92] arXiv:2405.13987 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Analysis of Corrected Graph ConvolutionsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Statistics Theory (math.ST); Machine Learning (stat.ML)
Machine learning for node classification on graphs is a prominent area driven by applications such as recommendation systems. State-of-the-art models often use multiple graph convolutions on the data, as empirical evidence suggests they can enhance performance. However, it has been shown empirically and theoretically, that too many graph convolutions can degrade performance significantly, a phenomenon known as oversmoothing. In this paper, we provide a rigorous theoretical analysis, based on the contextual stochastic block model (CSBM), of the performance of vanilla graph convolution from which we remove the principal eigenvector to avoid oversmoothing. We perform a spectral analysis for $k$ rounds of corrected graph convolutions, and we provide results for partial and exact classification. For partial classification, we show that each round of convolution can reduce the misclassification error exponentially up to a saturation level, after which performance does not worsen. For exact classification, we show that the separability threshold can be improved exponentially up to $O({\log{n}}/{\log\log{n}})$ corrected convolutions.
- [93] arXiv:2405.13998 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Bridging Operator Learning and Conditioned Neural Fields: A Unifying PerspectiveComments: 23 pages, 13 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Operator learning is an emerging area of machine learning which aims to learn mappings between infinite dimensional function spaces. Here we uncover a connection between operator learning architectures and conditioned neural fields from computer vision, providing a unified perspective for examining differences between popular operator learning models. We find that many commonly used operator learning models can be viewed as neural fields with conditioning mechanisms restricted to point-wise and/or global information. Motivated by this, we propose the Continuous Vision Transformer (CViT), a novel neural operator architecture that employs a vision transformer encoder and uses cross-attention to modulate a base field constructed with a trainable grid-based positional encoding of query coordinates. Despite its simplicity, CViT achieves state-of-the-art results across challenging benchmarks in climate modeling and fluid dynamics. Our contributions can be viewed as a first step towards adapting advanced computer vision architectures for building more flexible and accurate machine learning models in physical sciences.
- [94] arXiv:2405.14018 (cross-list from cs.CR) [pdf, ps, html, other]
-
Title: Watermarking Generative Tabular DataSubjects: Cryptography and Security (cs.CR); Applications (stat.AP)
In this paper, we introduce a simple yet effective tabular data watermarking mechanism with statistical guarantees. We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity, and also demonstrates appealing robustness against additive noise attack. The general idea is to achieve the watermarking through a strategic embedding based on simple data binning. Specifically, it divides the feature's value range into finely segmented intervals and embeds watermarks into selected ``green list" intervals. To detect the watermarks, we develop a principled statistical hypothesis-testing framework with minimal assumptions: it remains valid as long as the underlying data distribution has a continuous density function. The watermarking efficacy is demonstrated through rigorous theoretical analysis and empirical validation, highlighting its utility in enhancing the security of synthetic and real-world datasets.
- [95] arXiv:2405.14051 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: A Concentration Inequality for Maximum Mean Discrepancy (MMD)-based Statistics and Its Application in Generative ModelsSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST)
Maximum Mean Discrepancy (MMD) is a probability metric that has found numerous applications in machine learning. In this work, we focus on its application in generative models, including the minimum MMD estimator, Generative Moment Matching Network (GMMN), and Generative Adversarial Network (GAN). In these cases, MMD is part of an objective function in a minimization or min-max optimization problem. Even if its empirical performance is competitive, the consistency and convergence rate analysis of the corresponding MMD-based estimators has yet to be carried out.
We propose a uniform concentration inequality for a class of Maximum Mean Discrepancy (MMD)-based estimators, that is, a maximum deviation bound of empirical MMD values over a collection of generated distributions and adversarially learned kernels. Here, our inequality serves as an efficient tool in the theoretical analysis for MMD-based generative models. As elaborating examples, we applied our main result to provide the generalization error bounds for the MMD-based estimators in the context of the minimum MMD estimator and MMD GAN. - [96] arXiv:2405.14066 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Online Classification with PredictionsComments: 24 pagesSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
We study online classification when the learner has access to predictions about future examples. We design an online learner whose expected regret is never worse than the worst-case regret, gracefully improves with the quality of the predictions, and can be significantly better than the worst-case regret when the predictions of future examples are accurate. As a corollary, we show that if the learner is always guaranteed to observe data where future examples are easily predictable, then online learning can be as easy as transductive online learning. Our results complement recent work in online algorithms with predictions and smoothed online classification, which go beyond a worse-case analysis by using machine-learned predictions and distributional assumptions respectively.
- [97] arXiv:2405.14085 (cross-list from quant-ph) [pdf, ps, html, other]
-
Title: Testing Quantumness via Photon Statistics for Time-Bin based Quantum Random Number GeneratorsSubjects: Quantum Physics (quant-ph); Statistics Theory (math.ST)
Randomness is one of the essential components in many fields including cryptography and simulations. Several Quantum Random Number Generator (QRNG) models have been proposed to produce quantum random numbers, which, due to the quantum theory, are more secure than their classical counterparts. However, QRNGs can not produce true random numbers without deterministic classical post-processing. If the underlying distribution of the QRNG is close to a uniform distribution, a small amount of post-processing is sufficient to produce good random numbers retaining quantumness. In this work, we address the randomness and quantumness in the random numbers generated by the QRNGs. We consider two models of QRNGs, which ideally produce random numbers following different distributions (exponential and uniform), and show that, in practice, they are following similar distributions. These empirical photon distributions can be used to test the quantumness of a QRNG. In this letter, we suggest the $\chi^2$ goodness-of-fit to test quantumness, as it is known to be an effective method to test if sample data follows a known distribution. We derive a relation when the underlying sampling distributions of the QRNGs will be $\epsilon$-random. Depending on this relation, a suitable post-processing algorithm can be chosen.
- [98] arXiv:2405.14088 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: High-dimensional Learning with Noisy LabelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
This paper provides theoretical insights into high-dimensional binary classification with class-conditional noisy labels. Specifically, we study the behavior of a linear classifier with a label noisiness aware loss function, when both the dimension of data $p$ and the sample size $n$ are large and comparable. Relying on random matrix theory by supposing a Gaussian mixture data model, the performance of the linear classifier when $p,n\to \infty$ is shown to converge towards a limit, involving scalar statistics of the data. Importantly, our findings show that the low-dimensional intuitions to handle label noise do not hold in high-dimension, in the sense that the optimal classifier in low-dimension dramatically fails in high-dimension. Based on our derivations, we design an optimized method that is shown to be provably more efficient in handling noisy labels in high dimensions. Our theoretical conclusions are further confirmed by experiments on real datasets, where we show that our optimized approach outperforms the considered baselines.
- [99] arXiv:2405.14094 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Attending to Topological Spaces: The Cellular TransformerRubén Ballester, Pablo Hernández-García, Mathilde Papillon, Claudio Battiloro, Nina Miolane, Tolga Birdal, Carles Casacuberta, Sergio Escalera, Mustafa HajijSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT); Machine Learning (stat.ML)
Topological Deep Learning seeks to enhance the predictive performance of neural network models by harnessing topological structures in input data. Topological neural networks operate on spaces such as cell complexes and hypergraphs, that can be seen as generalizations of graphs. In this work, we introduce the Cellular Transformer (CT), a novel architecture that generalizes graph-based transformers to cell complexes. First, we propose a new formulation of the usual self- and cross-attention mechanisms, tailored to leverage incidence relations in cell complexes, e.g., edge-face and node-edge relations. Additionally, we propose a set of topological positional encodings specifically designed for cell complexes. By transforming three graph datasets into cell complex datasets, our experiments reveal that CT not only achieves state-of-the-art performance, but it does so without the need for more complex enhancements such as virtual nodes, in-domain structural encodings, or graph rewiring.
- [100] arXiv:2405.14104 (cross-list from econ.EM) [pdf, ps, html, other]
-
Title: On the Identifying Power of Monotonicity for Average Treatment EffectsSubjects: Econometrics (econ.EM); Methodology (stat.ME)
In the context of a binary outcome, treatment, and instrument, Balke and Pearl (1993, 1997) establish that adding monotonicity to the instrument exogeneity assumption does not decrease the identified sets for average potential outcomes and average treatment effect parameters when those assumptions are consistent with the distribution of the observable data. We show that the same results hold in the broader context of multi-valued outcome, treatment, and instrument. An important example of such a setting is a multi-arm randomized controlled trial with noncompliance.
- [101] arXiv:2405.14373 (cross-list from math.PR) [pdf, ps, html, other]
-
Title: Skew-symmetric schemes for stochastic differential equations with non-Lipschitz drift: an unadjusted Barker algorithmComments: 43 pages, 3 figures Keywords: Skew-symmetric distributions, Stochastic differential equations, Sampling algorithms, Markov Chain Monte Carlo,Subjects: Probability (math.PR); Numerical Analysis (math.NA); Computation (stat.CO)
We propose a new simple and explicit numerical scheme for time-homogeneous stochastic differential equations. The scheme is based on sampling increments at each time step from a skew-symmetric probability distribution, with the level of skewness determined by the drift and volatility of the underlying process. We show that as the step-size decreases the scheme converges weakly to the diffusion of interest. We then consider the problem of simulating from the limiting distribution of an ergodic diffusion process using the numerical scheme with a fixed step-size. We establish conditions under which the numerical scheme converges to equilibrium at a geometric rate, and quantify the bias between the equilibrium distributions of the scheme and of the true diffusion process. Notably, our results do not require a global Lipschitz assumption on the drift, in contrast to those required for the Euler--Maruyama scheme for long-time simulation at fixed step-sizes. Our weak convergence result relies on an extension of the theory of Milstein \& Tretyakov to stochastic differential equations with non-Lipschitz drift, which could also be of independent interest. We support our theoretical results with numerical simulations.
- [102] arXiv:2405.14402 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Exact Gauss-Newton Optimization for Training Deep Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We present EGN, a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges to an $\epsilon$-stationary point at a linear rate. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, and SGN optimizers across various supervised and reinforcement learning tasks.
- [103] arXiv:2405.14408 (cross-list from math.NA) [pdf, ps, html, other]
-
Title: Adaptive tempering schedules with approximative intermediate measures for filtering problemsSubjects: Numerical Analysis (math.NA); Computation (stat.CO)
Data assimilation algorithms integrate prior information from numerical model simulations with observed data. Ensemble-based filters, regarded as state-of-the-art, are widely employed for large-scale estimation tasks in disciplines such as geoscience and meteorology. Despite their inability to produce the true posterior distribution for nonlinear systems, their robustness and capacity for state tracking are noteworthy. In contrast, Particle filters yield the correct distribution in the ensemble limit but require substantially larger ensemble sizes than ensemble-based filters to maintain stability in higher-dimensional spaces. It is essential to transcend traditional Gaussian assumptions to achieve realistic quantification of uncertainties. One approach involves the hybridisation of filters, facilitated by tempering, to harness the complementary strengths of different filters. A new adaptive tempering method is proposed to tune the underlying schedule, aiming to systematically surpass the performance previously achieved. Although promising numerical results for certain filter combinations in toy examples exist in the literature, the tuning of hyperparameters presents a considerable challenge. A deeper understanding of these interactions is crucial for practical applications.
- [104] arXiv:2405.14425 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: When predict can also explain: few-shot prediction to select better neural latentsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Latent variable models serve as powerful tools to infer underlying dynamics from observed neural activity. However, due to the absence of ground truth data, prediction benchmarks are often employed as proxies. In this study, we reveal the limitations of the widely-used 'co-smoothing' prediction framework and propose an improved few-shot prediction approach that encourages more accurate latent dynamics. Utilizing a student-teacher setup with Hidden Markov Models, we demonstrate that the high co-smoothing model space can encompass models with arbitrary extraneous dynamics within their latent representations. To address this, we introduce a secondary metric -- a few-shot version of co-smoothing. This involves performing regression from the latent variables to held-out channels in the data using fewer trials. Our results indicate that among models with near-optimal co-smoothing, those with extraneous dynamics underperform in the few-shot co-smoothing compared to 'minimal' models devoid of such dynamics. We also provide analytical insights into the origin of this phenomenon. We further validate our findings on real neural data using two state-of-the-art methods: LFADS and STNDT. In the absence of ground truth, we suggest a proxy measure to quantify extraneous dynamics. By cross-decoding the latent variables of all model pairs with high co-smoothing, we identify models with minimal extraneous dynamics. We find a correlation between few-shot co-smoothing performance and this new measure. In summary, we present a novel prediction metric designed to yield latent variables that more accurately reflect the ground truth, offering a significant improvement for latent dynamics inference.
- [105] arXiv:2405.14440 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Bayesian Adaptive Calibration and Optimal DesignComments: Preprint, currently under reviewSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The process of calibrating computer models of natural phenomena is essential for applications in the physical sciences, where plenty of domain knowledge can be embedded into simulations and then calibrated against real observations. Current machine learning approaches, however, mostly rely on rerunning simulations over a fixed set of designs available in the observed data, potentially neglecting informative correlations across the design space and requiring a large amount of simulations. Instead, we consider the calibration process from the perspective of Bayesian adaptive experimental design and propose a data-efficient algorithm to run maximally informative simulations within a batch-sequential process. At each round, the algorithm jointly estimates the parameters of the posterior distribution and optimal designs by maximising a variational lower bound of the expected information gain. The simulator is modelled as a sample from a Gaussian process, which allows us to correlate simulations and observed data with the unknown calibration parameters. We show the benefits of our method when compared to related approaches across synthetic and real-data problems.
- [106] arXiv:2405.14468 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Neural Collapse versus Low-rank Bias: Is Deep Neural Collapse Really Optimal?Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Deep neural networks (DNNs) exhibit a surprising structure in their final layer known as neural collapse (NC), and a growing body of works has currently investigated the propagation of neural collapse to earlier layers of DNNs -- a phenomenon called deep neural collapse (DNC). However, existing theoretical results are restricted to special cases: linear models, only two layers or binary classification. In contrast, we focus on non-linear models of arbitrary depth in multi-class classification and reveal a surprising qualitative shift. As soon as we go beyond two layers or two classes, DNC stops being optimal for the deep unconstrained features model (DUFM) -- the standard theoretical framework for the analysis of collapse. The main culprit is a low-rank bias of multi-layer regularization schemes: this bias leads to optimal solutions of even lower rank than the neural collapse. We support our theoretical findings with experiments on both DUFM and real data, which show the emergence of the low-rank structure in the solution found by gradient descent.
- [107] arXiv:2405.14508 (cross-list from q-bio.QM) [pdf, ps, html, other]
-
Title: Prediction of cancer dynamics under treatment using Bayesian neural networks: A simulated studyEven Moa Myklebust, Arnoldo Frigessi, Fredrik Schjesvold, Jasmine Foo, Kevin Leder, Alvaro Köhn-LuqueComments: 22 pages, 10 figuresSubjects: Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
Predicting cancer dynamics under treatment is challenging due to high inter-patient heterogeneity, lack of predictive biomarkers, and sparse and noisy longitudinal data. Mathematical models can summarize cancer dynamics by a few interpretable parameters per patient. Machine learning methods can then be trained to predict the model parameters from baseline covariates, but do not account for uncertainty in the parameter estimates. Instead, hierarchical Bayesian modeling can model the relationship between baseline covariates to longitudinal measurements via mechanistic parameters while accounting for uncertainty in every part of the model.
The mapping from baseline covariates to model parameters can be modeled in several ways. A linear mapping simplifies inference but fails to capture nonlinear covariate effects and scale poorly for interaction modeling when the number of covariates is large. In contrast, Bayesian neural networks can potentially discover interactions between covariates automatically, but at a substantial cost in computational complexity.
In this work, we develop a hierarchical Bayesian model of subpopulation dynamics that uses baseline covariate information to predict cancer dynamics under treatment, inspired by cancer dynamics in multiple myeloma (MM), where serum M protein is a well-known proxy of tumor burden. As a working example, we apply the model to a simulated dataset and compare its ability to predict M protein trajectories to a model with linear covariate effects. Our results show that the Bayesian neural network covariate effect model predicts cancer dynamics more accurately than a linear covariate effect model when covariate interactions are present. The framework can also be applied to other types of cancer or other time series prediction problems that can be described with a parametric model. - [108] arXiv:2405.14522 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Explaining Black-box Model Predictions via Two-level Nested Feature Attributions with Consistency PropertySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Techniques that explain the predictions of black-box machine learning models are crucial to make the models transparent, thereby increasing trust in AI systems. The input features to the models often have a nested structure that consists of high- and low-level features, and each high-level feature is decomposed into multiple low-level features. For such inputs, both high-level feature attributions (HiFAs) and low-level feature attributions (LoFAs) are important for better understanding the model's decision. In this paper, we propose a model-agnostic local explanation method that effectively exploits the nested structure of the input to estimate the two-level feature attributions simultaneously. A key idea of the proposed method is to introduce the consistency property that should exist between the HiFAs and LoFAs, thereby bridging the separate optimization problems for estimating them. Thanks to this consistency property, the proposed method can produce HiFAs and LoFAs that are both faithful to the black-box models and consistent with each other, using a smaller number of queries to the models. In experiments on image classification in multiple instance learning and text classification using language models, we demonstrate that the HiFAs and LoFAs estimated by the proposed method are accurate, faithful to the behaviors of the black-box models, and provide consistent explanations.
- [109] arXiv:2405.14544 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Nuclear Norm Regularization for Deep LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Penalizing the nuclear norm of a function's Jacobian encourages it to locally behave like a low-rank linear map. Such functions vary locally along only a handful of directions, making the Jacobian nuclear norm a natural regularizer for machine learning problems. However, this regularizer is intractable for high-dimensional problems, as it requires computing a large Jacobian matrix and taking its singular value decomposition. We show how to efficiently penalize the Jacobian nuclear norm using techniques tailor-made for deep learning. We prove that for functions parametrized as compositions $f = g \circ h$, one may equivalently penalize the average squared Frobenius norm of $Jg$ and $Jh$. We then propose a denoising-style approximation that avoids the Jacobian computations altogether. Our method is simple, efficient, and accurate, enabling Jacobian nuclear norm regularization to scale to high-dimensional deep learning problems. We complement our theory with an empirical study of our regularizer's performance and investigate applications to denoising and representation learning.
- [110] arXiv:2405.14547 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Causal Effect Identification in a Sub-Population with Latent VariablesComments: 19 pages, 5 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The s-ID problem seeks to compute a causal effect in a specific sub-population from the observational data pertaining to the same sub population (Abouei et al., 2023). This problem has been addressed when all the variables in the system are observable. In this paper, we consider an extension of the s-ID problem that allows for the presence of latent variables. To tackle the challenges induced by the presence of latent variables in a sub-population, we first extend the classical relevant graphical definitions, such as c-components and Hedges, initially defined for the so-called ID problem (Pearl, 1995; Tian & Pearl, 2002), to their new counterparts. Subsequently, we propose a sound algorithm for the s-ID problem with latent variables.
- [111] arXiv:2405.14657 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Heteroscedastic Preferential Bayesian Optimization with Informative Noise DistributionsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Preferential Bayesian optimization (PBO) is a sample-efficient framework for learning human preferences between candidate designs. PBO classically relies on homoscedastic noise models to represent human aleatoric uncertainty. Yet, such noise fails to accurately capture the varying levels of human aleatoric uncertainty, particularly when the user possesses partial knowledge among different pairs of candidates. For instance, a chemist with solid expertise in glucose-related molecules may easily compare two compounds from that family while struggling to compare alcohol-related molecules. Currently, PBO overlooks this uncertainty during the search for a new candidate through the maximization of the acquisition function, consequently underestimating the risk associated with human uncertainty. To address this issue, we propose a heteroscedastic noise model to capture human aleatoric uncertainty. This model adaptively assigns noise levels based on the distance of a specific input to a predefined set of reliable inputs known as anchors provided by the human. Anchors encapsulate partial knowledge and offer insight into the comparative difficulty of evaluating different candidate pairs. Such a model can be seamlessly integrated into the acquisition function, thus leading to candidate design pairs that elegantly trade informativeness and ease of comparison for the human expert. We perform an extensive empirical evaluation of the proposed approach, demonstrating a consistent improvement over homoscedastic PBO.
- [112] arXiv:2405.14681 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Recursive PAC-Bayes: A Frequentist Approach to Sequential Prior Updates with No Information LossSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
PAC-Bayesian analysis is a frequentist framework for incorporating prior knowledge into learning. It was inspired by Bayesian learning, which allows sequential data processing and naturally turns posteriors from one processing step into priors for the next. However, despite two and a half decades of research, the ability to update priors sequentially without losing confidence information along the way remained elusive for PAC-Bayes. While PAC-Bayes allows construction of data-informed priors, the final confidence intervals depend only on the number of points that were not used for the construction of the prior, whereas confidence information in the prior, which is related to the number of points used to construct the prior, is lost. This limits the possibility and benefit of sequential prior updates, because the final bounds depend only on the size of the final batch.
We present a novel and, in retrospect, surprisingly simple and powerful PAC-Bayesian procedure that allows sequential prior updates with no information loss. The procedure is based on a novel decomposition of the expected loss of randomized classifiers. The decomposition rewrites the loss of the posterior as an excess loss relative to a downscaled loss of the prior plus the downscaled loss of the prior, which is bounded recursively. As a side result, we also present a generalization of the split-kl and PAC-Bayes-split-kl inequalities to discrete random variables, which we use for bounding the excess losses, and which can be of independent interest. In empirical evaluation the new procedure significantly outperforms state-of-the-art. - [113] arXiv:2405.14690 (cross-list from q-bio.QM) [pdf, ps, html, other]
-
Title: Multilevel functional data analysis modeling of human glucose response to meal intakeSubjects: Quantitative Methods (q-bio.QM); Applications (stat.AP)
Glucose meal response information collected via Continuous Glucose Monitoring (CGM) is relevant to the assessment of individual metabolic status and the support of personalized diet prescriptions. However, the complexity of the data produced by CGM monitors pushes the limits of existing analytic methods. CGM data often exhibits substantial within-person variability and has a natural multilevel structure. This research is motivated by the analysis of CGM data from individuals without diabetes in the AEGIS study. The dataset includes detailed information on meal timing and nutrition for each individual over different days. The primary focus of this study is to examine CGM glucose responses following patients' meals and explore the time-dependent associations with dietary and patient characteristics. Motivated by this problem, we propose a new analytical framework based on multilevel functional models, including a new functional mixed R-square coefficient. The use of these models illustrates 3 key points: (i) The importance of analyzing glucose responses across the entire functional domain when making diet recommendations; (ii) The differential metabolic responses between normoglycemic and prediabetic patients, particularly with regards to lipid intake; (iii) The importance of including random, person-level effects when modelling this scientific problem.
- [114] arXiv:2405.14780 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Metric Flow Matching for Smooth Interpolations on the Data ManifoldKacper Kapusniak, Peter Potaptchik, Teodora Reu, Leo Zhang, Alexander Tong, Michael Bronstein, Avishek Joey Bose, Francesco Di GiovanniSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Matching objectives underpin the success of modern generative models and rely on constructing conditional paths that transform a source distribution into a target distribution. Despite being a fundamental building block, conditional paths have been designed principally under the assumption of Euclidean geometry, resulting in straight interpolations. However, this can be particularly restrictive for tasks such as trajectory inference, where straight paths might lie outside the data manifold, thus failing to capture the underlying dynamics giving rise to the observed marginals. In this paper, we propose Metric Flow Matching (MFM), a novel simulation-free framework for conditional flow matching where interpolants are approximate geodesics learned by minimizing the kinetic energy of a data-induced Riemannian metric. This way, the generative model matches vector fields on the data manifold, which corresponds to lower uncertainty and more meaningful interpolations. We prescribe general metrics to instantiate MFM, independent of the task, and test it on a suite of challenging problems including LiDAR navigation, unpaired image translation, and modeling cellular dynamics. We observe that MFM outperforms the Euclidean baselines, particularly achieving SOTA on single-cell trajectory prediction.
- [115] arXiv:2405.14806 (cross-list from physics.data-an) [pdf, ps, html, other]
-
Title: Lorentz-Equivariant Geometric Algebra Transformers for High-Energy PhysicsComments: 10+12 pages, 5+2 figures, 2 tablesSubjects: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Machine Learning (stat.ML)
Extracting scientific understanding from particle-physics experiments requires solving diverse learning problems with high precision and good data efficiency. We propose the Lorentz Geometric Algebra Transformer (L-GATr), a new multi-purpose architecture for high-energy physics. L-GATr represents high-energy data in a geometric algebra over four-dimensional space-time and is equivariant under Lorentz transformations, the symmetry group of relativistic kinematics. At the same time, the architecture is a Transformer, which makes it versatile and scalable to large systems. L-GATr is first demonstrated on regression and classification tasks from particle physics. We then construct the first Lorentz-equivariant generative model: a continuous normalizing flow based on an L-GATr network, trained with Riemannian flow matching. Across our experiments, L-GATr is on par with or outperforms strong domain-specific baselines.
- [116] arXiv:2405.14822 (cross-list from cs.CV) [pdf, ps, html, other]
-
Title: PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion TeacherDongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano ErmonSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyond that of the original teacher DM. Our key insight is that a pre-trained, low-resolution DM can be used to deterministically encode high-resolution data to a structured latent space by solving the PF-ODE forward in time (data-to-noise), starting from an appropriately down-sampled image. Using this frozen encoder in an auto-encoder framework, we train a decoder by progressively growing its resolution. From the nature of progressively growing decoder, PaGoDA avoids re-training teacher/student models when we upsample the student model, making the whole training pipeline much cheaper. In experiments, we used our progressively growing decoder to upsample from the pre-trained model's 64x64 resolution to generate 512x512 samples, achieving 2x faster inference compared to single-step distilled Stable Diffusion like LCM. PaGoDA also achieved state-of-the-art FIDs on ImageNet across all resolutions from 64x64 to 512x512. Additionally, we demonstrated PaGoDA's effectiveness in solving inverse problems and enabling controllable generation.
- [117] arXiv:2405.14861 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Adapting to Unknown Low-Dimensional Structures in Score-Based Diffusion ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
This paper investigates score-based diffusion models when the underlying target distribution is concentrated on or near low-dimensional manifolds within the higher-dimensional space in which they formally reside, a common characteristic of natural image distributions. Despite previous efforts to understand the data generation process of diffusion models, existing theoretical support remains highly suboptimal in the presence of low-dimensional structure, which we strengthen in this paper. For the popular Denoising Diffusion Probabilistic Model (DDPM), we find that the dependency of the error incurred within each denoising step on the ambient dimension $d$ is in general unavoidable. We further identify a unique design of coefficients that yields a converges rate at the order of $O(k^{2}/\sqrt{T})$ (up to log factors), where $k$ is the intrinsic dimension of the target distribution and $T$ is the number of steps. This represents the first theoretical demonstration that the DDPM sampler can adapt to unknown low-dimensional structures in the target distribution, highlighting the critical importance of coefficient design. All of this is achieved by a novel set of analysis tools that characterize the algorithmic dynamics in a more deterministic manner.
Cross submissions for Friday, 24 May 2024 (showing 46 of 46 entries )
- [118] arXiv:2202.00563 (replaced) [pdf, ps, html, other]
-
Title: On the Limitations of General Purpose Domain Generalisation MethodsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We investigate the fundamental performance limitations of learning algorithms in several Domain Generalisation (DG) settings. Motivated by the difficulty with which previously proposed methods have in reliably outperforming Empirical Risk Minimisation (ERM), we derive upper bounds on the excess risk of ERM, and lower bounds on the minimax excess risk. Our findings show that in all the DG settings we consider, it is not possible to significantly outperform ERM. Our conclusions are limited not only to the standard covariate shift setting, but also two other settings with additional restrictions on how domains can differ. The first constrains all domains to have a non-trivial bound on pairwise distances, as measured by a broad class of integral probability metrics. The second alternate setting considers a restricted class of DG problems where all domains have the same underlying support. Our analysis also suggests how different strategies can be used to optimise the performance of ERM in each of these DG setting. We also experimentally explore hypotheses suggested by our theoretical analysis.
- [119] arXiv:2203.15897 (replaced) [pdf, ps, html, other]
-
Title: Calibrated Model Criticism Using Split Predictive ChecksComments: v3: updated some discussion of model criticism and predictive checks; improved some figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Checking how well a fitted model explains the data is one of the most fundamental parts of a Bayesian data analysis. However, existing model checking methods suffer from trade-offs between being well-calibrated, automated, and computationally efficient. To overcome these limitations, we propose split predictive checks (SPCs), which combine the ease-of-use and speed of posterior predictive checks with the good calibration properties of predictive checks that rely on model-specific derivations or inference schemes. We develop an asymptotic theory for two types of SPCs: single SPCs and the divided SPCs. Our results demonstrate that they offer complementary strengths. Single SPCs work well with smaller datasets and provide excellent power when there is substantial misspecification, such as when the estimate uncertainty in the test statistic is significantly underestimated. When the sample size is large, divided SPCs can provide better power and are able to detect more subtle form of misspecification. We validate the finite-sample utility of SPCs through extensive simulation experiments in exponential family and hierarchical models, and provide three real-data examples where SPCs offer novel insights and additional flexibility beyond what is available when using posterior predictive checks.
- [120] arXiv:2204.07747 (replaced) [pdf, ps, html, other]
-
Title: A Variational Approach to Bayesian Phylogenetic InferenceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo (MCMC) with simple proposal mechanisms. This hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper, we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We propose combining subsplit Bayesian networks, an expressive graphical model for tree topology distributions, and a structured amortization of the branch lengths over tree topologies for a suitable variational family of distributions. We train the variational approximation via stochastic gradient ascent and adopt gradient estimators for continuous and discrete variational parameters separately to deal with the composite latent space of phylogenetic models. We show that our variational approach provides competitive performance to MCMC, while requiring much fewer (though more costly) iterations due to a more efficient exploration mechanism enabled by variational inference. Experiments on a benchmark of challenging real data Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our methods.
- [121] arXiv:2206.04306 (replaced) [pdf, ps, html, other]
-
Title: Limit results for distributed estimation of invariant subspaces in multiple networks inference and PCASubjects: Statistics Theory (math.ST)
We study the problem of distributed estimation of the leading singular vectors for a collection of matrices with shared invariant subspaces. In particular we consider an algorithm that first estimates the projection matrices corresponding to the leading singular vectors for each individual matrix, then computes the average of the projection matrices, and finally returns the leading eigenvectors of the sample averages. We show that the algorithm, when applied to (1) parameters estimation for a collection of independent edge random graphs with shared singular vectors but possibly heterogeneous edge probabilities or (2) distributed PCA for independent sub-Gaussian random vectors with spiked covariance structure, yields estimates whose row-wise fluctuations are normally distributed around the rows of the true singular vectors. Leveraging these results we also consider a two-sample test for the null hypothesis that a pair of random graphs have the same edge probabilities and we present a test statistic whose limiting distribution converges to a central (resp. non-central) $\chi^2$ under the null (resp. local alternative) hypothesis.
- [122] arXiv:2209.04389 (replaced) [pdf, ps, html, other]
-
Title: Posterior contraction and uncertainty quantification for the multivariate spike-and-slab LASSOSubjects: Statistics Theory (math.ST)
We study the asymptotic properties of Deshpande et al.\ (2019)'s multivariate spike-and-slab LASSO (mSSL) procedure for simultaneous variable and covariance selection in the sparse multivariate linear regression problem. In that problem, $q$ correlated responses are regressed onto $p$ covariates and the mSSL works by placing separate spike-and-slab priors on the entries in the matrix of marginal covariate effects and off-diagonal elements in the upper triangle of the residual precision matrix. Under mild assumptions about these matrices, we establish the posterior contraction rate for the mSSL posterior in the asymptotic regime where both $p$ and $q$ diverge with $n.$ By ``de-biasing'' the corresponding MAP estimates, we obtain confidence intervals for each covariate effect and residual partial correlation. In extensive simulation studies, these intervals displayed close-to-nominal frequentist coverage in finite sample settings but tended to be substantially longer than those obtained using a version of the Bayesian bootstrap that randomly re-weights the prior. We further show that the de-biased intervals for individual covariate effects are asymptotically valid.
- [123] arXiv:2210.09560 (replaced) [pdf, ps, html, other]
-
Title: A Bayesian Convolutional Neural Network-based Generalized Linear ModelComments: 25 pages, 7 figuresSubjects: Methodology (stat.ME)
Convolutional neural networks (CNNs) provide flexible function approximations for a wide variety of applications when the input variables are in the form of images or spatial data. Although CNNs often outperform traditional statistical models in prediction accuracy, statistical inference, such as estimating the effects of covariates and quantifying the prediction uncertainty, is not trivial due to the highly complicated model structure and overparameterization. To address this challenge, we propose a new Bayesian approach by embedding CNNs within the generalized linear models (GLMs) framework. We use extracted nodes from the last hidden layer of CNN with Monte Carlo (MC) dropout as informative covariates in GLM. This improves accuracy in prediction and regression coefficient inference, allowing for the interpretation of coefficients and uncertainty quantification. By fitting ensemble GLMs across multiple realizations from MC dropout, we can account for uncertainties in extracting the features. We apply our methods to biological and epidemiological problems, which have both high-dimensional correlated inputs and vector covariates. Specifically, we consider malaria incidence data, brain tumor image data, and fMRI data. By extracting information from correlated inputs, the proposed method can provide an interpretable Bayesian analysis. The algorithm can be broadly applicable to image regressions or correlated data analysis by enabling accurate Bayesian inference quickly.
- [124] arXiv:2302.12565 (replaced) [pdf, ps, html, other]
-
Title: Variational Linearized Laplace Approximation for Bayesian Deep LearningComments: 22 pages, 8 figures, ICML 2024Journal-ref: PMLR 235 (2024)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The Linearized Laplace Approximation (LLA) has been recently used to perform uncertainty estimation on the predictions of pre-trained deep neural networks (DNNs). However, its widespread application is hindered by significant computational costs, particularly in scenarios with a large number of training points or DNN parameters. Consequently, additional approximations of LLA, such as Kronecker-factored or diagonal approximate GGN matrices, are utilized, potentially compromising the model's performance. To address these challenges, we propose a new method for approximating LLA using a variational sparse Gaussian Process (GP). Our method is based on the dual RKHS formulation of GPs and retains, as the predictive mean, the output of the original DNN. Furthermore, it allows for efficient stochastic optimization, which results in sub-linear training time in the size of the training dataset. Specifically, its training cost is independent of the number of training points. We compare our proposed method against accelerated LLA (ELLA), which relies on the Nyström approximation, as well as other LLA variants employing the sample-then-optimize principle. Experimental results, both on regression and classification datasets, show that our method outperforms these already existing efficient variants of LLA, both in terms of the quality of the predictive distribution and in terms of total computational time.
- [125] arXiv:2303.05659 (replaced) [pdf, ps, html, other]
-
Title: A marginal structural model for normal tissue complication probabilitySubjects: Methodology (stat.ME)
The goal of radiation therapy for cancer is to deliver prescribed radiation dose to the tumor while minimizing dose to the surrounding healthy tissues. To evaluate treatment plans, the dose distribution to healthy organs is commonly summarized as dose-volume histograms (DVHs). Normal tissue complication probability (NTCP) modelling has centered around making patient-level risk predictions with features extracted from the DVHs, but few have considered adapting a causal framework to evaluate the safety of alternative treatment plans. We propose causal estimands for NTCP based on deterministic and stochastic interventions, as well as propose estimators based on marginal structural models that impose bivariable monotonicity between dose, volume, and toxicity risk. The properties of these estimators are studied through simulations, and their use is illustrated in the context of radiotherapy treatment of anal canal cancer patients.
- [126] arXiv:2303.09575 (replaced) [pdf, ps, html, other]
-
Title: Sample size determination via learning-type curvesComments: 22 pages, 4 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
This paper is concerned with sample size determination methodology for prediction models. We propose combining the individual calculations via a learning-type curve. We suggest two distinct ways of doing so, a deterministic skeleton of a learning curve and a Gaussian process centred upon its deterministic counterpart. We employ several learning algorithms for modelling the primary endpoint and distinct measures for trial efficacy. We find that the performance may vary with the sample size, but borrowing information across sample size universally improves the performance of such calculations. The Gaussian process-based learning curve appears more robust and statistically efficient, while computational efficiency is comparable. We suggest that anchoring against historical evidence when extrapolating sample sizes should be adopted when such data are available. The methods are illustrated on binary and survival endpoints.
- [127] arXiv:2305.14194 (replaced) [pdf, ps, html, other]
-
Title: A spatial interference approach to account for mobility in air pollution studies with multivariate continuous treatmentsSubjects: Methodology (stat.ME)
We develop new methodology to improve our understanding of the causal effects of multivariate air pollution exposures on public health. Typically, exposure to air pollution for an individual is measured at their home geographic region, though people travel to different regions with potentially different levels of air pollution. To account for this, we incorporate estimates of the mobility of individuals from cell phone mobility data to get an improved estimate of their exposure to air pollution. We treat this as an interference problem, where individuals in one geographic region can be affected by exposures in other regions due to mobility into those areas. We propose policy-relevant estimands and derive expressions showing the extent of bias one would obtain by ignoring this mobility. We additionally highlight the benefits of the proposed interference framework relative to a measurement error framework for accounting for mobility. We develop novel estimation strategies to estimate causal effects that account for this spatial spillover utilizing flexible Bayesian methodology. Lastly, we use the proposed methodology to study the health effects of ambient air pollution on mortality among Medicare enrollees in the United States.
- [128] arXiv:2305.17028 (replaced) [pdf, ps, html, other]
-
Title: Better Batch for Deep Probabilistic Time Series ForecastingComments: 11 pages, 3 figures, 3 tables, The 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024); We corrected some misleading notations in the published versionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Deep probabilistic time series forecasting has gained attention for its ability to provide nonlinear approximation and valuable uncertainty quantification for decision-making. However, existing models often oversimplify the problem by assuming a time-independent error process and overlooking serial correlation. To overcome this limitation, we propose an innovative training method that incorporates error autocorrelation to enhance probabilistic forecasting accuracy. Our method constructs a mini-batch as a collection of $D$ consecutive time series segments for model training. It explicitly learns a time-varying covariance matrix over each mini-batch, encoding error correlation among adjacent time steps. The learned covariance matrix can be used to improve prediction accuracy and enhance uncertainty quantification. We evaluate our method on two different neural forecasting models and multiple public datasets. Experimental results confirm the effectiveness of the proposed approach in improving the performance of both models across a range of datasets, resulting in notable improvements in predictive accuracy.
- [129] arXiv:2305.17731 (replaced) [pdf, ps, html, other]
-
Title: Moment-Based Adjustments of Statistical Inference in High-Dimensional Generalized Linear ModelsComments: 33 pagesSubjects: Statistics Theory (math.ST)
We developed a statistical inference method applicable to a broad range of generalized linear models (GLMs) in high-dimensional settings, where the number of unknown coefficients scales proportionally with the sample size. Although a pioneering inference method has been developed for logistic regression, which is a specific instance of GLMs, we cannot apply this method directly to other GLMs because of unknown hyper-parameters. In this study, we addressed this limitation by developing a new inference method designed for a certain class of GLMs. Our method is based on the adjustment of asymptotic normality in high dimensions and is feasible in the sense that it is possible even with unknown hyper-parameters. Specifically, we introduce a novel convex loss-based estimator and its associated system, which are essential components of inference. Next, we devise a moment-based method for estimating the system parameters required by the method. Consequently, we construct confidence intervals for GLMs in a high-dimensional regime. We prove that our proposed method has desirable theoretical properties, such as strong consistency and exact coverage probability. Finally, we experimentally confirmed its validity.
- [130] arXiv:2306.00096 (replaced) [pdf, ps, html, other]
-
Title: Learning the Pareto Front Using Bootstrapped Observation SamplesComments: 37 pages including appendixSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We consider Pareto front identification (PFI) for linear bandits (PFILin), i.e., the goal is to identify a set of arms with undominated mean reward vectors when the mean reward vector is a linear function of the context. PFILin includes the best arm identification problem and multi-objective active learning as special cases. The sample complexity of our proposed algorithm is optimal up to a logarithmic factor. In addition, the regret incurred by our algorithm during the estimation is within a logarithmic factor of the optimal regret among all algorithms that identify the Pareto front. Our key contribution is a new estimator that in every round updates the estimate for the unknown parameter along multiple context directions -- in contrast to the conventional estimator that only updates the parameter estimate along the chosen context. This allows us to use low-regret arms to collect information about Pareto optimal arms. Our key innovation is to reuse the exploration samples multiple times; in contrast to conventional estimators that use each sample only once. Numerical experiments demonstrate that the proposed algorithm successfully identifies the Pareto front while controlling the regret.
- [131] arXiv:2306.06844 (replaced) [pdf, ps, html, other]
-
Title: Provably Efficient Bayesian Optimization with Unbiased Gaussian Process Hyperparameter EstimationComments: 25 pages, 5 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Gaussian process (GP) based Bayesian optimization (BO) is a powerful method for optimizing black-box functions efficiently. The practical performance and theoretical guarantees of this approach depend on having the correct GP hyperparameter values, which are usually unknown in advance and need to be estimated from the observed data. However, in practice, these estimations could be incorrect due to biased data sampling strategies used in BO. This can lead to degraded performance and break the sub-linear global convergence guarantee of BO. To address this issue, we propose a new BO method that can sub-linearly converge to the objective function's global optimum even when the true GP hyperparameters are unknown in advance and need to be estimated from the observed data. Our method uses a multi-armed bandit technique (EXP3) to add random data points to the BO process, and employs a novel training loss function for the GP hyperparameter estimation process that ensures consistent estimation. We further provide theoretical analysis of our proposed method. Finally, we demonstrate empirically that our method outperforms existing approaches on various synthetic and real-world problems.
- [132] arXiv:2306.11697 (replaced) [pdf, ps, html, other]
-
Title: Treatment Effects in Extreme RegimesSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Understanding treatment effects in extreme regimes is important for characterizing risks associated with different interventions. This is hindered by the unavailability of counterfactual outcomes and the rarity and difficulty of collecting extreme data in practice. To address this issue, we propose a new framework based on extreme value theory for estimating treatment effects in extreme regimes. We quantify these effects using variations in tail decay rates of potential outcomes in the presence and absence of treatments. We establish algorithms for calculating these quantities and develop related theoretical results. We demonstrate the efficacy of our approach on various standard synthetic and semi-synthetic datasets.
- [133] arXiv:2306.11895 (replaced) [pdf, ps, html, other]
-
Title: Learning Elastic Costs to Shape Monge DisplacementsMichal Klein, Aram-Alexandre Pooladian, Pierre Ablin, Eugène Ndiaye, Jonathan Niles-Weed, Marco CuturiSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Given a source and a target probability measure supported on $\mathbb{R}^d$, the Monge problem asks to find the most efficient way to map one distribution to the other. This efficiency is quantified by defining a \textit{cost} function between source and target data. Such a cost is often set by default in the machine learning literature to the squared-Euclidean distance, $\ell^2_2(\mathbf{x},\mathbf{y})=\tfrac12|\mathbf{x}-\mathbf{y}|_2^2$. Recently, Cuturi et. al '23 highlighted the benefits of using elastic costs, defined through a regularizer $\tau$ as $c(\mathbf{x},\mathbf{y})=\ell^2_2(\mathbf{x},\mathbf{y})+\tau(\mathbf{x}-\mathbf{y})$. Such costs shape the \textit{displacements} of Monge maps $T$, i.e., the difference between a source point and its image $T(\mathbf{x})-\mathbf{x})$, by giving them a structure that matches that of the proximal operator of $\tau$. In this work, we make two important contributions to the study of elastic costs: (i) For any elastic cost, we propose a numerical method to compute Monge maps that are provably optimal. This provides a much-needed routine to create synthetic problems where the ground truth OT map is known, by analogy to the Brenier theorem, which states that the gradient of any convex potential is always a valid Monge map for the $\ell_2^2$ cost; (ii) We propose a loss to \textit{learn} the parameter $\theta$ of a parameterized regularizer $\tau_\theta$, and apply it in the case where $\tau_{A}(\mathbf{z})=|A^\perp \mathbf{z}|^2_2$. This regularizer promotes displacements that lie on a low dimensional subspace of $\mathbb{R}^d$, spanned by the $p$ rows of $A\in\mathbb{R}^{p\times d}$.
- [134] arXiv:2306.13580 (replaced) [pdf, ps, html, other]
-
Title: Lower Complexity Adaptation for Empirical Entropic Optimal TransportComments: 51 pages, 4 figures, proof of LCA for sub-Gaussian measures (Theorem 3.13) correctedSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
Entropic optimal transport (EOT) presents an effective and computationally viable alternative to unregularized optimal transport (OT), offering diverse applications for large-scale data analysis. In this work, we derive novel statistical bounds for empirical plug-in estimators of the EOT cost and show that their statistical performance in the entropy regularization parameter $\epsilon$ and the sample size $n$ only depends on the simpler of the two probability measures. For instance, under sufficiently smooth costs this yields the parametric rate $n^{-1/2}$ with factor $\epsilon^{-d/2}$, where $d$ is the minimum dimension of the two population measures. This confirms that empirical EOT also adheres to the lower complexity adaptation principle, a hallmark feature only recently identified for unregularized OT. As a consequence of our theory, we show that the empirical entropic Gromov-Wasserstein distance and its unregularized version for measures on Euclidean spaces also obey this principle. Additionally, we comment on computational aspects and complement our findings with Monte Carlo simulations. Our techniques employ empirical process theory and rely on a dual formulation of EOT over a single function class. Crucial to our analysis is the observation that the entropic cost-transformation of a function class does not increase its uniform metric entropy by much.
- [135] arXiv:2307.04191 (replaced) [pdf, ps, html, other]
-
Title: On the sample complexity of parameter estimation in logistic regression with normal designSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
The logistic regression model is one of the most popular data generation model in noisy binary classification problems. In this work, we study the sample complexity of estimating the parameters of the logistic regression model up to a given $\ell_2$ error, in terms of the dimension and the inverse temperature, with standard normal covariates. The inverse temperature controls the signal-to-noise ratio of the data generation process. While both generalization bounds and asymptotic performance of the maximum-likelihood estimator for logistic regression are well-studied, the non-asymptotic sample complexity that shows the dependence on error and the inverse temperature for parameter estimation is absent from previous analyses. We show that the sample complexity curve has two change-points in terms of the inverse temperature, clearly separating the low, moderate, and high temperature regimes.
- [136] arXiv:2309.08599 (replaced) [pdf, ps, html, other]
-
Title: An assessment of racial disparities in pretrial decision-making using misclassification modelsComments: 42 pages, 1 figure, 8 tablesSubjects: Applications (stat.AP)
Pretrial risk assessment tools are used in jurisdictions across the country to assess the likelihood of "pretrial failure," the event where defendants either fail to appear for court or reoffend. Judicial officers, in turn, use these assessments to determine whether to release or detain defendants during trial. While algorithmic risk assessment tools were designed to predict pretrial failure with greater accuracy relative to judges, there is still concern that both risk assessment recommendations and pretrial decisions are biased against minority groups. In this paper, we develop methods to investigate the association between risk factors and pretrial failure, while simultaneously estimating misclassification rates of pretrial risk assessments and of judicial decisions as a function of defendant race. This approach adds to a growing literature that makes use of outcome misclassification methods to answer questions about fairness in pretrial decision-making. We give a detailed simulation study for our proposed methodology and apply these methods to data from the Virginia Department of Criminal Justice Services. We estimate that the VPRAI algorithm has near-perfect specificity, but its sensitivity differs by defendant race. Judicial decisions also display evidence of bias; we estimate wrongful detention rates of 39.7% and 51.4% among white and Black defendants, respectively.
- [137] arXiv:2309.09367 (replaced) [pdf, ps, html, other]
-
Title: ForLion: A New Algorithm for D-optimal Designs under General Parametric Statistical Models with Mixed FactorsComments: 36 pages, 7 tables, 5 figuresSubjects: Computation (stat.CO); Methodology (stat.ME)
In this paper, we address the problem of designing an experimental plan with both discrete and continuous factors under fairly general parametric statistical models. We propose a new algorithm, named ForLion, to search for locally optimal approximate designs under the D-criterion. The algorithm performs an exhaustive search in a design space with mixed factors while keeping high efficiency and reducing the number of distinct experimental settings. Its optimality is guaranteed by the general equivalence theorem. We present the relevant theoretical results for multinomial logit models (MLM) and generalized linear models (GLM), and demonstrate the superiority of our algorithm over state-of-the-art design algorithms using real-life experiments under MLM and GLM. Our simulation studies show that the ForLion algorithm could reduce the number of experimental settings by 25% or improve the relative efficiency of the designs by 17.5% on average. Our algorithm can help the experimenters reduce the time cost, the usage of experimental devices, and thus the total cost of their experiments while preserving high efficiencies of the designs.
- [138] arXiv:2309.15600 (replaced) [pdf, ps, html, other]
-
Title: pencal: an R Package for the Dynamic Prediction of Survival with Many Longitudinal PredictorsSubjects: Methodology (stat.ME); Computation (stat.CO)
In survival analysis, longitudinal information on the health status of a patient can be used to dynamically update the predicted probability that a patient will experience an event of interest. Traditional approaches to dynamic prediction such as joint models become computationally unfeasible with more than a handful of longitudinal covariates, warranting the development of methods that can handle a larger number of longitudinal covariates. We introduce the R package pencal, which implements a Penalized Regression Calibration approach that makes it possible to handle many longitudinal covariates as predictors of survival. pencal uses mixed-effects models to summarize the trajectories of the longitudinal covariates up to a prespecified landmark time, and a penalized Cox model to predict survival based on both baseline covariates and summary measures of the longitudinal covariates. This article illustrates the structure of the R package, provides a step by step example showing how to estimate PRC, compute dynamic predictions of survival and validate performance, and shows how parallelization can be used to significantly reduce computing time.
- [139] arXiv:2310.09818 (replaced) [pdf, ps, html, other]
-
Title: MCMC for Bayesian nonparametric mixture modeling under differential privacySubjects: Computation (stat.CO); Methodology (stat.ME)
Estimating the probability density of a population while preserving the privacy of individuals in that population is an important and challenging problem that has received considerable attention in recent years. While the previous literature focused on frequentist approaches, in this paper, we propose a Bayesian nonparametric mixture model under differential privacy (DP) and present two Markov chain Monte Carlo (MCMC) algorithms for posterior inference. One is a marginal approach, resembling Neal's algorithm 5 with a pseudo-marginal Metropolis-Hastings move, and the other is a conditional approach. Although our focus is primarily on local DP, we show that our MCMC algorithms can be easily extended to deal with global differential privacy mechanisms. Moreover, for some carefully chosen mechanisms and mixture kernels, we show how auxiliary parameters can be analytically marginalized, allowing standard MCMC algorithms (i.e., non-privatized, such as Neal's Algorithm 2) to be efficiently employed. Our approach is general and applicable to any mixture model and privacy mechanism. In several simulations and a real case study, we discuss the performance of our algorithms and evaluate different privacy mechanisms proposed in the frequentist literature.
- [140] arXiv:2310.14711 (replaced) [pdf, ps, html, other]
-
Title: Quasi-Maximum Likelihood Estimation of long-memory linear processesJean-Marc Bardet (SAMM), Yves Gael Tchabo Mbienkeu (UY1)Subjects: Statistics Theory (math.ST)
The purpose of this paper is to study the convergence of the quasi-maximum likelihood (QML) estimator for long memory linear processes. We first establish a correspondence between the long-memory linear process representation and the long-memory AR$(\infty)$ process representation. We then establish the almost sure consistency and asymptotic normality of the QML estimator. Numerical simulations illustrate the theoretical results and confirm the good performance of the estimator.
- [141] arXiv:2310.18449 (replaced) [pdf, ps, html, other]
-
Title: Conditional Generative Representation for Black-Box Optimization with Implicit ConstraintsSubjects: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Black-box optimization (BBO) has become increasingly relevant for tackling complex decision-making problems, especially in public policy domains such as police districting. However, its broader application in public policymaking is hindered by the complexity of defining feasible regions and the high-dimensionality of decisions. This paper introduces a novel BBO framework, termed as the Conditional And Generative Black-box Optimization (CageBO). This approach leverages a conditional variational autoencoder to learn the distribution of feasible decisions, enabling a two-way mapping between the original decision space and a simplified, constraint-free latent space. The CageBO efficiently handles the implicit constraints often found in public policy applications, allowing for optimization in the latent space while evaluating objectives in the original space. We validate our method through a case study on large-scale police districting problems in Atlanta, Georgia. Our results reveal that our CageBO offers notable improvements in performance and efficiency compared to the baselines.
- [142] arXiv:2310.19245 (replaced) [pdf, ps, html, other]
-
Title: Efficient Shapley Performance Attribution for Least-Squares RegressionComments: 36 pages, 5 figuresSubjects: Computation (stat.CO)
We consider the performance of a least-squares regression model, as judged by out-of-sample $R^2$. Shapley values give a fair attribution of the performance of a model to its input features, taking into account interdependencies between features. Evaluating the Shapley values exactly requires solving a number of regression problems that is exponential in the number of features, so a Monte Carlo-type approximation is typically used. We focus on the special case of least-squares regression models, where several tricks can be used to compute and evaluate regression models efficiently. These tricks give a substantial speed up, allowing many more Monte Carlo samples to be evaluated, achieving better accuracy. We refer to our method as least-squares Shapley performance attribution (LS-SPA), and describe our open-source implementation.
- [143] arXiv:2311.05025 (replaced) [pdf, ps, html, other]
-
Title: Unbiased Kinetic Langevin Monte Carlo with Inexact GradientsComments: 99 Pages, 13 FiguresSubjects: Computation (stat.CO); Numerical Analysis (math.NA); Methodology (stat.ME); Machine Learning (stat.ML)
We present an unbiased method for Bayesian posterior means based on kinetic Langevin dynamics that combines advanced splitting methods with enhanced gradient approximations. Our approach avoids Metropolis correction by coupling Markov chains at different discretization levels in a multilevel Monte Carlo approach. Theoretical analysis demonstrates that our proposed estimator is unbiased, attains finite variance, and satisfies a central limit theorem. It can achieve accuracy $\epsilon>0$ for estimating expectations of Lipschitz functions in $d$ dimensions with $\mathcal{O}(d^{1/4}\epsilon^{-2})$ expected gradient evaluations, without assuming warm start. We exhibit similar bounds using both approximate and stochastic gradients, and our method's computational cost is shown to scale independently of the size of the dataset. The proposed method is tested using a multinomial regression problem on the MNIST dataset and a Poisson regression model for soccer scores. Experiments indicate that the number of gradient evaluations per effective sample is independent of dimension, even when using inexact gradients. For product distributions, we give dimension-independent variance bounds. Our results demonstrate that the unbiased algorithm we present can be much more efficient than the ``gold-standard" randomized Hamiltonian Monte Carlo.
- [144] arXiv:2311.11153 (replaced) [pdf, ps, html, other]
-
Title: Biarchetype analysis: simultaneous learning of observations and features based on extremesSubjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)
We introduce a novel exploratory technique, termed biarchetype analysis, which extends archetype analysis to simultaneously identify archetypes of both observations and features. This innovative unsupervised machine learning tool aims to represent observations and features through instances of pure types, or biarchetypes, which are easily interpretable as they embody mixtures of observations and features. Furthermore, the observations and features are expressed as mixtures of the biarchetypes, which makes the structure of the data easier to understand. We propose an algorithm to solve biarchetype analysis. Although clustering is not the primary aim of this technique, biarchetype analysis is demonstrated to offer significant advantages over biclustering methods, particularly in terms of interpretability. This is attributed to biarchetypes being extreme instances, in contrast to the centroids produced by biclustering, which inherently enhances human comprehension. The application of biarchetype analysis across various machine learning challenges underscores its value, and both the source code and examples are readily accessible in R and Python at this https URL.
- [145] arXiv:2311.13572 (replaced) [pdf, ps, html, other]
-
Title: Likelihood Geometry of Reflexive PolytopesComments: 31 pages, 6 figures, 5 tablesSubjects: Statistics Theory (math.ST); Algebraic Geometry (math.AG); Combinatorics (math.CO)
We study the problem of maximum likelihood (ML) estimation for statistical models defined by reflexive polytopes. Our focus is on the maximum likelihood degree of these models as an algebraic measure of complexity of the corresponding optimization problem. We compute the ML degrees of all 4319 classes of three-dimensional reflexive polytopes, and observe some surprising behavior in terms of the presence of gaps between ML degrees and degrees of the associated toric varieties. We interpret these drops in the context of discriminants and prove formulas for the ML degree for families of reflexive polytopes, including the hypercube and its dual, the cross polytope, in arbitrary dimension. In particular, we determine a family of embeddings for the $d$-cube that implies ML degree one. Finally, we discuss generalized constructions of families of reflexive polytopes in terms of their ML degrees.
- [146] arXiv:2311.15322 (replaced) [pdf, ps, html, other]
-
Title: False Discovery Rate Control For Structured Multiple Testing: Asymmetric Rules And Conformal Q-valuesSubjects: Methodology (stat.ME)
The effective utilization of structural information in data while ensuring statistical validity poses a significant challenge in false discovery rate (FDR) analyses. Conformal inference provides rigorous theory for grounding complex machine learning methods without relying on strong assumptions or highly idealized models. However, existing conformal methods have limitations in handling structured multiple testing. This is because their validity requires the deployment of symmetric rules, which assume the exchangeability of data points and permutation-invariance of fitting algorithms. To overcome these limitations, we introduce the pseudo local index of significance (PLIS) procedure, which is capable of accommodating asymmetric rules and requires only pairwise exchangeability between the null conformity scores. We demonstrate that PLIS offers finite-sample guarantees in FDR control and the ability to assign higher weights to relevant data points. Numerical results confirm the effectiveness and robustness of PLIS and show improvements in power compared to existing model-free methods in various scenarios.
- [147] arXiv:2312.12782 (replaced) [pdf, ps, html, other]
-
Title: Spectral gap bounds for reversible hybrid Gibbs chainsSubjects: Statistics Theory (math.ST); Probability (math.PR)
Hybrid Gibbs samplers represent a prominent class of approximated Gibbs algorithms that utilize Markov chains to approximate conditional distributions, with the Metropolis-within-Gibbs algorithm standing out as a well-known example. Despite their widespread use in both statistical and non-statistical applications, very little is known about their convergence properties. This article introduces novel methods for establishing bounds on the convergence rates of hybrid Gibbs samplers. In particular, we examine the convergence characteristics of hybrid random-scan Gibbs and data augmentation algorithms. Our analysis confirms that the absolute spectral gap of a hybrid chain can be bounded based on the absolute spectral gap of the exact Gibbs chain and the absolute spectral gaps of the Markov chains employed for conditional distribution approximations. For application, we study the convergence properties of four practical hybrid Gibbs algorithms: a random-scan Metropolis-within-Gibbs sampler, a hybrid proximal sampler, random-scan Gibbs samplers with block updates, and a hybrid slice sampler.
- [148] arXiv:2401.04418 (replaced) [pdf, ps, html, other]
-
Title: R\'enyi entropy, R\'enyi divergence and Jensen-R\'enyi information generating functions, and associated properties and estimationSubjects: Statistics Theory (math.ST)
In this paper, we propose Rényi information generating function (RIGF) and discuss its various properties. The relation between the RIGF and Shannon entropy of order $q>0$ is established. Several bounds are obtained. The RIGF of escort distribution is also derived. Furthermore, we introduce Rényi divergence information generating function (RDIGF) and discuss its effect under monotone transformations. Next, we propose Jensen-Rényi information generating function (JRIGF) and establish its properties. In addition, we present non-parametric and parametric estimators of the RIGF. For illustrative purpose, a simulation study is carried out and a real data relating to the failure times of electronic components is analyzed. Finally, a comparison study between the non-parametric and parametric estimators is made in terms of absolute bias and mean square error (MSE).
- [149] arXiv:2401.15461 (replaced) [pdf, ps, html, other]
-
Title: Anytime-Valid Tests of Group Invariance through Conformal PredictionSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We develop anytime-valid tests of invariance under the action of compact groups. The resulting test statistics are optimal in a logarithmic-growth sense. We apply our method to extend recent anytime-valid tests of independence and to construct tests of normality.
- [150] arXiv:2402.00168 (replaced) [pdf, ps, html, other]
-
Title: Continuous Treatment Effects with Surrogate OutcomesZhenghao Zeng, David Arbour, Avi Feller, Raghavendra Addanki, Ryan Rossi, Ritwik Sinha, Edward H. KennedyComments: 30 pages, 7 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish the asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance.
- [151] arXiv:2403.03850 (replaced) [pdf, ps, html, other]
-
Title: Conformal prediction for multi-dimensional time series by ellipsoidal setsComments: Accepted by the Forty-first International Conference on Machine Learning (ICML 2024)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Conformal prediction (CP) has been a popular method for uncertainty quantification because it is distribution-free, model-agnostic, and theoretically sound. For forecasting problems in supervised learning, most CP methods focus on building prediction intervals for univariate responses. In this work, we develop a sequential CP method called $\texttt{MultiDimSPCI}$ that builds prediction $\textit{regions}$ for a multivariate response, especially in the context of multivariate time series, which are not exchangeable. Theoretically, we estimate $\textit{finite-sample}$ high-probability bounds on the conditional coverage gap. Empirically, we demonstrate that $\texttt{MultiDimSPCI}$ maintains valid coverage on a wide range of multivariate time series while producing smaller prediction regions than CP and non-CP baselines.
- [152] arXiv:2403.11429 (replaced) [pdf, ps, html, other]
-
Title: Long-range Ising model for regional-scale seismic risk analysisSubjects: Applications (stat.AP)
This study introduces the long-range Ising model from statistical mechanics to the Performance-Based Earthquake Engineering (PBEE) framework for regional seismic damage analysis. The application of the PBEE framework at a regional scale involves estimating the damage states of numerous structures, typically performed using fragility function-based stochastic simulations. However, these simulations often assume conditional independence or employ simplistic dependency models among the damage states of structures, leading to significant misrepresentation of regional risk. The Ising model addresses this issue by converting the available information on binary damage states (safe or failure) into a joint probability mass function, leveraging the principle of maximum entropy. The Ising model offers two main benefits: (1) it requires only the first- and second-order cross-moments, enabling seamless integration with the existing PBEE framework, and (2) it provides meaningful physical interpretations of the model parameters, facilitating the uncovering of insights not apparent from data. To demonstrate the proposed method, we applied the Ising model to 156 buildings in Antakya, Turkey, using post-hazard damage evaluation data, and to 182 buildings in Pacific Heights, San Francisco, using simulated data from the Regional Resilience Determination (R2D) tool. In both instances, the Ising model accurately reproduces the provided information and generates meaningful insights into regional damage. The study also investigates the change in Ising model parameters under varying earthquake magnitudes, along with the mean-field approximation, further facilitating the applicability of the proposed approach.
- [153] arXiv:2405.07552 (replaced) [pdf, ps, html, other]
-
Title: Distributed High-Dimensional Quantile Regression: Estimation Efficiency and Support RecoveryComments: Forty-first International Conference on Machine Learning (ICML 2024), 27 pages, 4 figures, 14 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
In this paper, we focus on distributed estimation and support recovery for high-dimensional linear quantile regression. Quantile regression is a popular alternative tool to the least squares regression for robustness against outliers and data heterogeneity. However, the non-smoothness of the check loss function poses big challenges to both computation and theory in the distributed setting. To tackle these problems, we transform the original quantile regression into the least-squares optimization. By applying a double-smoothing approach, we extend a previous Newton-type distributed approach without the restrictive independent assumption between the error term and covariates. An efficient algorithm is developed, which enjoys high computation and communication efficiency. Theoretically, the proposed distributed estimator achieves a near-oracle convergence rate and high support recovery accuracy after a constant number of iterations. Extensive experiments on synthetic examples and a real data application further demonstrate the effectiveness of the proposed method.
- [154] arXiv:2405.09493 (replaced) [pdf, ps, html, other]
-
Title: C-Learner: Constrained Learning for Causal Inference and Semiparametric StatisticsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Causal estimation (e.g. of the average treatment effect) requires estimating complex nuisance parameters (e.g. outcome models). To adjust for errors in nuisance parameter estimation, we present a novel correction method that solves for the best plug-in estimator under the constraint that the first-order error of the estimator with respect to the nuisance parameter estimate is zero. Our constrained learning framework provides a unifying perspective to prominent first-order correction approaches including one-step estimation (a.k.a. augmented inverse probability weighting) and targeting (a.k.a. targeted maximum likelihood estimation). Our semiparametric inference approach, which we call the "C-Learner", can be implemented with modern machine learning methods such as neural networks and tree ensembles, and enjoys standard guarantees like semiparametric efficiency and double robustness. Empirically, we demonstrate our approach on several datasets, including those with text features that require fine-tuning language models. We observe the C-Learner matches or outperforms other asymptotically optimal estimators, with better performance in settings with less estimated overlap.
- [155] arXiv:2405.09584 (replaced) [pdf, ps, html, other]
-
Title: Restless Bandit Problem with Rewards Generated by a Linear Gaussian Dynamical SystemSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
Decision-making under uncertainty is a fundamental problem encountered frequently and can be formulated as a stochastic multi-armed bandit problem. In the problem, the learner interacts with an environment by choosing an action at each round, where a round is an instance of an interaction. In response, the environment reveals a reward, which is sampled from a stochastic process, to the learner. The goal of the learner is to maximize cumulative reward. In this work, we assume that the rewards are the inner product of an action vector and a state vector generated by a linear Gaussian dynamical system. To predict the reward for each action, we propose a method that takes a linear combination of previously observed rewards for predicting each action's next reward. We show that, regardless of the sequence of previous actions chosen, the reward sampled for any previously chosen action can be used for predicting another action's future reward, i.e. the reward sampled for action 1 at round $t-1$ can be used for predicting the reward for action $2$ at round $t$. This is accomplished by designing a modified Kalman filter with a matrix representation that can be learned for reward prediction. Numerical evaluations are carried out on a set of linear Gaussian dynamical systems and are compared with 2 other well-known stochastic multi-armed bandit algorithms.
- [156] arXiv:2405.09831 (replaced) [pdf, ps, html, other]
-
Title: Nearly Minimax Optimal Regret for Multinomial Logistic BanditComments: Preprint. Under reviewSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the feature dimension $d$ and the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{\smash[b]{T/K}})$. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
- [157] arXiv:2405.10301 (replaced) [pdf, ps, html, other]
-
Title: Conformal Alignment: Knowing When to Trust Foundation Models with GuaranteesSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
- [158] arXiv:2405.10371 (replaced) [pdf, ps, html, other]
-
Title: Causal Discovery in Multivariate Extremes with a Hydrological Analysis of Swiss River DischargesSubjects: Methodology (stat.ME); Applications (stat.AP)
Causal asymmetry is based on the principle that an event is a cause only if its absence would not have been a cause. From there, uncovering causal effects becomes a matter of comparing a well-defined score in both directions. Motivated by studying causal effects at extreme levels of a multivariate random vector, we propose to construct a model-agnostic causal score relying solely on the assumption of the existence of a max-domain of attraction. Based on a representation of a Generalized Pareto random vector, we construct the causal score as the Wasserstein distance between the margins and a well-specified random variable. The proposed methodology is illustrated on a hydrologically simulated dataset of different characteristics of catchments in Switzerland: discharge, precipitation, and snowmelt.
- [159] arXiv:2006.02643 (replaced) [pdf, ps, html, other]
-
Title: Universal Graph Compression: Stochastic Block ModelsSubjects: Information Theory (cs.IT); Databases (cs.DB); Statistics Theory (math.ST)
Motivated by the prevalent data science applications of processing large-scale graph data such as social networks and biological networks, this paper investigates lossless compression of data in the form of a labeled graph. Particularly, we consider a widely used random graph model, stochastic block model (SBM), which captures the clustering effects in social networks. An information-theoretic universal compression framework is applied, in which one aims to design a single compressor that achieves the asymptotically optimal compression rate, for every SBM distribution, without knowing the parameters of the SBM. Such a graph compressor is proposed in this paper, which universally achieves the optimal compression rate with polynomial time complexity for a wide class of SBMs. Existing universal compression techniques are developed mostly for stationary ergodic one-dimensional sequences. However, the adjacency matrix of SBM has complex two-dimensional correlations. The challenge is alleviated through a carefully designed transform that converts two-dimensional correlated data into almost i.i.d. submatrices. The sequence of submatrices is then compressed by a Krichevsky--Trofimov compressor, whose length analysis is generalized to identically distributed but arbitrarily correlated sequences. In four benchmark graph datasets, the compressed files from competing algorithms take 2.4 to 27 times the space needed by the proposed scheme.
- [160] arXiv:2006.13456 (replaced) [pdf, ps, other]
-
Title: Likelihood-Free Gaussian Process for RegressionComments: There were errors in the proposed methodSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Gaussian process regression can flexibly represent the posterior distribution of an interest parameter given sufficient information on the likelihood. However, in some cases, we have little knowledge regarding the probability model. For example, when investing in a financial instrument, the probability model of cash flow is generally unknown. In this paper, we propose a novel framework called the likelihood-free Gaussian process (LFGP), which allows representation of the posterior distributions of interest parameters for scalable problems without directly setting their likelihood functions. The LFGP establishes clusters in which the value of the interest parameter can be considered approximately identical, and it approximates the likelihood of the interest parameter in each cluster to a Gaussian using the asymptotic normality of the maximum likelihood estimator. We expect that the proposed framework will contribute significantly to likelihood-free modeling, particularly by reducing the assumptions for the probability model and the computational costs for scalable problems.
- [161] arXiv:2007.13804 (replaced) [pdf, ps, html, other]
-
Title: The Spectral Approach to Linear Rational Expectations ModelsComments: JEL Classification: C10, C32, C62, E32Subjects: Econometrics (econ.EM); Statistics Theory (math.ST)
This paper considers linear rational expectations models in the frequency domain. The paper characterizes existence and uniqueness of solutions to particular as well as generic systems. The set of all solutions to a given system is shown to be a finite dimensional affine space in the frequency domain. It is demonstrated that solutions can be discontinuous with respect to the parameters of the models in the context of non-uniqueness, invalidating mainstream frequentist and Bayesian methods. The ill-posedness of the problem motivates regularized solutions with theoretically guaranteed uniqueness, continuity, and even differentiability properties.
- [162] arXiv:2111.00792 (replaced) [pdf, ps, html, other]
-
Title: Shift-invariant homogeneous classes of random fieldsComments: Published J. Mult. Analysis ApplicationsSubjects: Probability (math.PR); Applications (stat.AP)
Given an $R^d$-valued random field (rf) $Z(t),t\in T$ and an $\alpha$-homogeneous mapping $\kappa$ we define the corresponding equivalent class of rf's (denoted by $K_\alpha$) which include representers of the same tail measure $\nu_Z$. When $T$ is an additive group, tractable equivalent classes of interest are the shift-invariant ones, which contain in particular all independent random shifts of $Z$. This contribution is mainly concerned with the investigation of the probabilistic properties of shift-invariant $K_\alpha$'s. Important objects introduced in our setting are tail and spectral tail rf's. Further, the class of universal maps $U$ acting on elements of $K_\alpha$ turns out to be crucial for properties of functionals of $Z$. Applications of our findings concern max-stable and symmetric $\alpha$-stable rf's, their maximal indices as well as their random shift-representations.
- [163] arXiv:2202.00977 (replaced) [pdf, ps, html, other]
-
Title: HMC and underdamped Langevin united in the unadjusted convex smooth caseSubjects: Probability (math.PR); Statistics Theory (math.ST)
We consider a family of unadjusted generalized HMC samplers, which includes standard position HMC samplers and discretizations of the underdamped Langevin process. A detailed analysis and optimization of the parameters is conducted in the Gaussian case, which shows an improvement from $1/\kappa$ to $1/\sqrt{\kappa}$ for the convergence rate in terms of the condition number $\kappa$ by using partial velocity refreshment, with respect to classical full refreshments. A similar effect is observed empirically for two related algorithms, namely Metropolis-adjusted gHMC and kinetic piecewise-deterministic Markov processes. Then, a stochastic gradient version of the samplers is considered, for which dimension-free convergence rates are established for log-concave smooth targets over a large range of parameters, gathering in a unified framework previous results on position HMC and underdamped Langevin and extending them to HMC with inertia.
- [164] arXiv:2205.15049 (replaced) [pdf, ps, html, other]
-
Title: Metrizing FairnessSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We study supervised learning problems for predicting properties of individuals who belong to one of two demographic groups, and we seek predictors that are fair according to statistical parity. This means that the distributions of the predictions within the two groups should be close with respect to the Kolmogorov distance, and fairness is achieved by penalizing the dissimilarity of these two distributions in the objective function of the learning problem. In this paper, we showcase conceptual and computational benefits of measuring unfairness with integral probability metrics (IPMs) other than the Kolmogorov distance. Conceptually, we show that the generator of any IPM can be interpreted as a family of utility functions and that unfairness with respect to this IPM arises if individuals in the two demographic groups have diverging expected utilities. We also prove that the unfairness-regularized prediction loss admits unbiased gradient estimators if unfairness is measured by the squared $\mathcal L^2$-distance or by a squared maximum mean discrepancy. In this case, the fair learning problem is susceptible to efficient stochastic gradient descent (SGD) algorithms. Numerical experiments on real data show that these SGD algorithms outperform state-of-the-art methods for fair learning in that they achieve superior accuracy-unfairness trade-offs -- sometimes orders of magnitude faster. Finally, we identify conditions under which statistical parity can improve prediction accuracy.
- [165] arXiv:2208.09922 (replaced) [pdf, ps, html, other]
-
Title: Efficient Concentration with Gaussian ApproximationSubjects: Probability (math.PR); Statistics Theory (math.ST)
Concentration inequalities for the sample mean, like those due to Bernstein and Hoeffding, are valid for any sample size but overly conservative, yielding confidence intervals that are unnecessarily wide. The central limit theorem (CLT) provides asymptotic confidence intervals with optimal width, but these are invalid for all sample sizes. To resolve this tension, we develop new computable concentration inequalities with asymptotically optimal size, finite-sample validity, and sub-Gaussian decay. These bounds enable the construction of efficient confidence intervals with correct coverage for any sample size and efficient empirical Berry-Esseen bounds that require no prior knowledge of the population variance. We derive our inequalities by tightly bounding non-uniform Kolmogorov and Wasserstein distances to a Gaussian using zero-bias couplings and Stein's method of exchangeable pairs.
- [166] arXiv:2209.14440 (replaced) [pdf, ps, html, other]
-
Title: GeONet: a neural operator for learning the Wasserstein geodesicSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Optimal transport (OT) offers a versatile framework to compare complex data distributions in a geometrically meaningful way. Traditional methods for computing the Wasserstein distance and geodesic between probability measures require mesh-specific domain discretization and suffer from the curse-of-dimensionality. We present GeONet, a mesh-invariant deep neural operator network that learns the non-linear mapping from the input pair of initial and terminal distributions to the Wasserstein geodesic connecting the two endpoint distributions. In the offline training stage, GeONet learns the saddle point optimality conditions for the dynamic formulation of the OT problem in the primal and dual spaces that are characterized by a coupled PDE system. The subsequent inference stage is instantaneous and can be deployed for real-time predictions in the online learning setting. We demonstrate that GeONet achieves comparable testing accuracy to the standard OT solvers on simulation examples and the MNIST dataset with considerably reduced inference-stage computational cost by orders of magnitude.
- [167] arXiv:2211.07482 (replaced) [pdf, ps, html, other]
-
Title: Unifying O(3) Equivariant Neural Networks Design with Tensor-Network FormalismComments: 10 pages + 12-page supplementary materials, many figuresJournal-ref: Mach. Learn.: Sci. Technol. 5 025044, 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph); Machine Learning (stat.ML)
Many learning tasks, including learning potential energy surfaces from ab initio calculations, involve global spatial symmetries and permutational symmetry between atoms or general particles. Equivariant graph neural networks are a standard approach to such problems, with one of the most successful methods employing tensor products between various tensors that transform under the spatial group. However, as the number of different tensors and the complexity of relationships between them increase, maintaining parsimony and equivariance becomes increasingly challenging. In this paper, we propose using fusion diagrams, a technique widely employed in simulating SU($2$)-symmetric quantum many-body problems, to design new equivariant components for equivariant neural networks. This results in a diagrammatic approach to constructing novel neural network architectures. When applied to particles within a given local neighborhood, the resulting components, which we term "fusion blocks," serve as universal approximators of any continuous equivariant function defined in the neighborhood. We incorporate a fusion block into pre-existing equivariant architectures (Cormorant and MACE), leading to improved performance with fewer parameters on a range of challenging chemical problems. Furthermore, we apply group-equivariant neural networks to study non-adiabatic molecular dynamics of stilbene cis-trans isomerization. Our approach, which combines tensor networks with equivariant neural networks, suggests a potentially fruitful direction for designing more expressive equivariant neural networks.
- [168] arXiv:2302.07185 (replaced) [pdf, ps, html, other]
-
Title: When mitigating bias is unfair: multiplicity and arbitrariness in algorithmic group fairnessSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Most research on fair machine learning has prioritized optimizing criteria such as Demographic Parity and Equalized Odds. Despite these efforts, there remains a limited understanding of how different bias mitigation strategies affect individual predictions and whether they introduce arbitrariness into the debiasing process. This paper addresses these gaps by exploring whether models that achieve comparable fairness and accuracy metrics impact the same individuals and mitigate bias in a consistent manner. We introduce the FRAME (FaiRness Arbitrariness and Multiplicity Evaluation) framework, which evaluates bias mitigation through five dimensions: Impact Size (how many people were affected), Change Direction (positive versus negative changes), Decision Rates (impact on models' acceptance rates), Affected Subpopulations (who was affected), and Neglected Subpopulations (where unfairness persists). This framework is intended to help practitioners understand the impacts of debiasing processes and make better-informed decisions regarding model selection. Applying FRAME to various bias mitigation approaches across key datasets allows us to exhibit significant differences in the behaviors of debiasing methods. These findings highlight the limitations of current fairness criteria and the inherent arbitrariness in the debiasing process.
- [169] arXiv:2302.10684 (replaced) [pdf, ps, html, other]
-
Title: Contraction and Convergence Rates for Discretized Kinetic Langevin DynamicsComments: 34 pages, 1 figureJournal-ref: SIAM Journal on Numerical Analysis, 62(3):1226-1258, 2024Subjects: Numerical Analysis (math.NA); Computation (stat.CO)
We provide a framework to analyze the convergence of discretized kinetic Langevin dynamics for $M$-$\nabla$Lipschitz, $m$-convex potentials. Our approach gives convergence rates of $\mathcal{O}(m/M)$, with explicit stepsize restrictions, which are of the same order as the stability threshold for Gaussian targets and are valid for a large interval of the friction parameter. We apply this methodology to various integration schemes which are popular in the molecular dynamics and machine learning communities. Further, we introduce the property ``$\gamma$-limit convergent" (GLC) to characterize underdamped Langevin schemes that converge to overdamped dynamics in the high-friction limit and which have stepsize restrictions that are independent of the friction parameter; we show that this property is not generic by exhibiting methods from both the class and its complement. Finally, we provide asymptotic bias estimates for the BAOAB scheme, which remain accurate in the high-friction limit by comparison to a modified stochastic dynamics which preserves the invariant measure.
- [170] arXiv:2304.07278 (replaced) [pdf, ps, html, other]
-
Title: Minimax-Optimal Reward-Agnostic Exploration in Reinforcement LearningComments: accepted for presentation in COLT 2024Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Systems and Control (eess.SY); Statistics Theory (math.ST); Machine Learning (stat.ML)
This paper studies reward-agnostic exploration in reinforcement learning (RL) -- a scenario where the learner is unware of the reward functions during the exploration stage -- and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of \begin{align*}
\frac{SAH^3}{\varepsilon^2} \text{ sample episodes (up to log factor)} \end{align*} without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as ``reward-free exploration.'' The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms. - [171] arXiv:2304.14606 (replaced) [pdf, ps, html, other]
-
Title: Algorithmic Recourse with Missing ValuesComments: 30 pages, 15 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper proposes a new framework of algorithmic recourse (AR) that works even in the presence of missing values. AR aims to provide a recourse action for altering the undesired prediction result given by a classifier. Existing AR methods assume that we can access complete information on the features of an input instance. However, we often encounter missing values in a given instance (e.g., due to privacy concerns), and previous studies have not discussed such a practical situation. In this paper, we first empirically and theoretically show the risk that a naive approach with a single imputation technique fails to obtain good actions regarding their validity, cost, and features to be changed. To alleviate this risk, we formulate the task of obtaining a valid and low-cost action for a given incomplete instance by incorporating the idea of multiple imputation. Then, we provide some theoretical analyses of our task and propose a practical solution based on mixed-integer linear optimization. Experimental results demonstrated the efficacy of our method in the presence of missing values compared to the baselines.
- [172] arXiv:2305.15912 (replaced) [pdf, ps, html, other]
-
Title: ReLU Characteristic Activation AnalysisComments: code available at: this https URLSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce a novel approach for analyzing the training dynamics of ReLU networks by examining the characteristic activation boundaries of individual ReLU neurons. Our proposed analysis reveals a critical instability in common neural network parameterizations and normalizations during stochastic optimization, which impedes fast convergence and hurts generalization performance. Addressing this, we propose Geometric Parameterization (GmP), a novel neural network parameterization technique that effectively separates the radial and angular components of weights in the hyperspherical coordinate system. We show theoretically that GmP resolves the aforementioned instability issue. We report empirical results on various models and benchmarks to verify GmP's theoretical advantages of optimization stability, convergence speed and generalization performance.
- [173] arXiv:2306.13214 (replaced) [pdf, ps, html, other]
-
Title: Prior-itizing Privacy: A Bayesian Approach to Setting the Privacy Budget in Differential PrivacyComments: 9-page main document with 2 figures and a 27-page appendix with 3 figuresSubjects: Cryptography and Security (cs.CR); Methodology (stat.ME)
When releasing outputs from confidential data, agencies need to balance the analytical usefulness of the released data with the obligation to protect data subjects' confidentiality. For releases satisfying differential privacy, this balance is reflected by the privacy budget, $\varepsilon$. We provide a framework for setting $\varepsilon$ based on its relationship with Bayesian posterior probabilities of disclosure. The agency responsible for the data release decides how much posterior risk it is willing to accept at various levels of prior risk, which implies a unique $\varepsilon$. Agencies can evaluate different risk profiles to determine one that leads to an acceptable trade-off in risk and utility.
- [174] arXiv:2306.16564 (replaced) [pdf, ps, html, other]
-
Title: Pareto Optimal Learning for Estimating Large Language Model ErrorsSubjects: Computation and Language (cs.CL); Machine Learning (stat.ML)
Large Language Models (LLMs) have shown impressive abilities in many applications. When a concrete and precise answer is desired, it is important to have a quantitative estimation of the potential error rate. However, this can be challenging due to the text-in-text-out nature of generative models. We present a method based on Pareto optimization that generates a risk score to estimate the probability of error in an LLM response by integrating multiple sources of information. We prove theoretically that the error estimator optimized in our framework aligns with the LLM and the information sources in an Pareto optimal manner. Experimental results show that the risk scores estimated by our method are well correlated with the true LLM error rate, thus facilitating error correction. By dynamically combining with prompting strategies such as self-verification and information retrieval, we demonstrate the proposed method can be utilized to increase the performance of an LLM, surpassing state-of-the-art task specific models.
- [175] arXiv:2307.01198 (replaced) [pdf, ps, html, other]
-
Title: Improved sampling via learned diffusionsComments: Accepted at ICLR 2024Journal-ref: International Conference on Learning Representations, 2024Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
Recently, a series of papers proposed deep learning-based approaches to sample from target distributions using controlled diffusion processes, being trained only on the unnormalized target densities without access to samples. Building on previous work, we identify these approaches as special cases of a generalized Schrödinger bridge problem, seeking a stochastic evolution between a given prior distribution and the specified target. We further generalize this framework by introducing a variational formulation based on divergences between path space measures of time-reversed diffusion processes. This abstract perspective leads to practical losses that can be optimized by gradient-based algorithms and includes previous objectives as special cases. At the same time, it allows us to consider divergences other than the reverse Kullback-Leibler divergence that is known to suffer from mode collapse. In particular, we propose the so-called log-variance loss, which exhibits favorable numerical properties and leads to significantly improved performance across all considered approaches.
- [176] arXiv:2308.05564 (replaced) [pdf, ps, html, other]
-
Title: Large Skew-t Copula Models and Asymmetric Dependence in Intraday Equity ReturnsSubjects: Econometrics (econ.EM); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Computation (stat.CO)
Skew-t copula models are attractive for the modeling of financial data because they allow for asymmetric and extreme tail dependence. We show that the copula implicit in the skew-t distribution of Azzalini and Capitanio (2003) allows for a higher level of pairwise asymmetric dependence than two popular alternative skew-t copulas. Estimation of this copula in high dimensions is challenging, and we propose a fast and accurate Bayesian variational inference (VI) approach to do so. The method uses a generative representation of the skew-t distribution to define an augmented posterior that can be approximated accurately. A stochastic gradient ascent algorithm is used to solve the variational optimization. The methodology is used to estimate skew-t factor copula models with up to 15 factors for intraday returns from 2017 to 2021 on 93 U.S. equities. The copula captures substantial heterogeneity in asymmetric dependence over equity pairs, in addition to the variability in pairwise correlations. In a moving window study we show that the asymmetric dependencies also vary over time, and that intraday predictive densities from the skew-t copula are more accurate than those from benchmark copula models. Portfolio selection strategies based on the estimated pairwise asymmetric dependencies improve performance relative to the index.
- [177] arXiv:2310.11011 (replaced) [pdf, ps, html, other]
-
Title: From Identifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative ModelingComments: Published in Transactions on Machine Learning Research (TMLR) (05/2024); 72 pages, 27 figures, 4 tablesJournal-ref: Transactions on Machine Learning Research, 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Deep generative models have shown tremendous capability in data density estimation and data generation from finite samples. While these models have shown impressive performance by learning correlations among features in the data, some fundamental shortcomings are their lack of explainability, tendency to induce spurious correlations, and poor out-of-distribution extrapolation. To remedy such challenges, recent work has proposed a shift toward causal generative models. Causal models offer several beneficial properties to deep generative models, such as distribution shift robustness, fairness, and interpretability. Structural causal models (SCMs) describe data-generating processes and model complex causal relationships and mechanisms among variables in a system. Thus, SCMs can naturally be combined with deep generative models. We provide a technical survey on causal generative modeling categorized into causal representation learning and controllable counterfactual generation methods. We focus on fundamental theory, methodology, drawbacks, datasets, and metrics. Then, we cover applications of causal generative models in fairness, privacy, out-of-distribution generalization, precision medicine, and biological sciences. Lastly, we discuss open problems and fruitful research directions for future work in the field.
- [178] arXiv:2310.15512 (replaced) [pdf, ps, html, other]
-
Title: Inference for Rank-Rank RegressionsSubjects: Econometrics (econ.EM); Statistics Theory (math.ST)
Slope coefficients in rank-rank regressions are popular measures of intergenerational mobility. In this paper, we first point out two important properties of the OLS estimator in such regressions: commonly used variance estimators do not consistently estimate the asymptotic variance of the OLS estimator and, when the underlying distribution is not continuous, the OLS estimator may be highly sensitive to the way in which ties are handled. Motivated by these findings we derive the asymptotic theory for the OLS estimator in a general rank-rank regression specification without making assumptions about the continuity of the underlying distribution. We then extend the asymptotic theory to other regressions involving ranks that have been used in empirical work. Finally, we apply our new inference methods to three empirical studies. We find that the confidence intervals based on estimators of the correct variance may sometimes be substantially shorter and sometimes substantially longer than those based on commonly used variance estimators. The differences in confidence intervals concern economically meaningful values of mobility and thus may lead to different conclusions when comparing mobility across different regions or countries.
- [179] arXiv:2310.17571 (replaced) [pdf, ps, html, other]
-
Title: Inside the black box: Neural network-based real-time prediction of US recessionsSubjects: Econometrics (econ.EM); Machine Learning (stat.ML)
Long short-term memory (LSTM) and gated recurrent unit (GRU) are used to model US recessions from 1967 to 2021. Their predictive performances are compared to those of the traditional linear models. The out-of-sample performance suggests the application of LSTM and GRU in recession forecasting, especially for longer-term forecasts. The Shapley additive explanations (SHAP) method is applied to both groups of models. The SHAP-based different weight assignments imply the capability of these types of neural networks to capture the business cycle asymmetries and nonlinearities. The SHAP method delivers key recession indicators, such as the S&P 500 index for short-term forecasting up to 3 months and the term spread for longer-term forecasting up to 12 months. These findings are robust against other interpretation methods, such as the local interpretable model-agnostic explanations (LIME) and the marginal effects.
- [180] arXiv:2311.00541 (replaced) [pdf, ps, html, other]
-
Title: An Embedded Diachronic Sense Change Model with a Case Study from Ancient GreekSubjects: Computation and Language (cs.CL); Methodology (stat.ME)
Word meanings change over time, and word senses evolve, emerge or die out in the process. For ancient languages, where the corpora are often small and sparse, modelling such changes accurately proves challenging, and quantifying uncertainty in sense-change estimates consequently becomes important. GASC (Genre-Aware Semantic Change) and DiSC (Diachronic Sense Change) are existing generative models that have been used to analyse sense change for target words from an ancient Greek text corpus, using unsupervised learning without the help of any pre-training. These models represent the senses of a given target word such as ``kosmos'' (meaning decoration, order or world) as distributions over context words, and sense prevalence as a distribution over senses. The models are fitted using Markov Chain Monte Carlo (MCMC) methods to measure temporal changes in these representations. This paper introduces EDiSC, an Embedded DiSC model, which combines word embeddings with DiSC to provide superior model performance. It is shown empirically that EDiSC offers improved predictive accuracy, ground-truth recovery and uncertainty quantification, as well as better sampling efficiency and scalability properties with MCMC methods. The challenges of fitting these models are also discussed.
- [181] arXiv:2311.05728 (replaced) [pdf, ps, html, other]
-
Title: A Physics-Informed, Deep Double Reservoir Network for Forecasting Boundary Layer VelocitySubjects: Fluid Dynamics (physics.flu-dyn); Applications (stat.AP)
When a fluid flows over a solid surface, it creates a thin boundary layer where the flow velocity is influenced by the surface through viscosity, and can transition from laminar to turbulent at sufficiently high speeds. Understanding and forecasting the fluid dynamics under these conditions is one of the most challenging scientific problems in fluid dynamics. It is therefore of high interest to formulate models able to capture the nonlinear spatio-temporal velocity structure as well as produce forecasts in a computationally efficient manner. Traditional statistical approaches are limited in their ability to produce timely forecasts of complex, nonlinear spatio-temporal structures which are at the same time able to incorporate the underlying flow physics. In this work, we propose a model to accurately forecast boundary layer velocities with a deep double reservoir computing network which is capable of capturing the complex, nonlinear dynamics of the boundary layer while at the same time incorporating physical constraints via a penalty obtained by a Partial Differential Equation (PDE). Simulation studies on a one-dimensional viscous fluid demonstrate how the proposed model is able to produce accurate forecasts while simultaneously accounting for energy loss. The application focuses on boundary layer data on a water tunnel with a PDE penalty derived from an appropriate simplification of the Navier-Stokes equations, showing forecasts improved by 33.7% and 80.0% in terms of mass conservation and variability of velocity fluctuation, respectfully, against non physics-informed methods.
- [182] arXiv:2311.06840 (replaced) [pdf, ps, html, other]
-
Title: Omitted Labels in Causality: A Study of ParadoxesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Social and Information Networks (cs.SI); Methodology (stat.ME)
We explore what we call ``omitted label contexts,'' in which training data is limited to a subset of the possible labels. This setting is common among specialized human experts or specific focused studies. We lean on well-studied paradoxes (Simpson's and Condorcet) to illustrate the more general difficulties of causal inference in omitted label contexts. Contrary to the fundamental principles on which much of causal inference is built, we show that ``correct'' adjustments sometimes require non-exchangeable treatment and control groups. These pitfalls lead us to the study networks of conclusions drawn from different contexts and the structures the form, proving an interesting connection between these networks and social choice theory.
- [183] arXiv:2311.07454 (replaced) [pdf, ps, html, other]
-
Title: Causal Discovery under Latent Class ConfoundingSubjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Statistics Theory (math.ST)
An acyclic causal structure can be described using a directed acyclic graph (DAG) with arrows indicating causation. The task of learning this structure from data is known as "causal discovery." Diverse populations or changing environments can sometimes give rise to heterogeneous data. This heterogeneity can be thought of as a mixture model with multiple "sources," each exerting their own distinct signature on the observed variables. From this perspective, the source is a latent common cause for every observed variable. While some methods for causal discovery are able to work around unobserved confounding in special cases, the only known ways to deal with a global confounder (such as a latent class) involve parametric assumptions. Focusing on discrete observables, we demonstrate that globally confounded causal structures can still be identifiable without parametric assumptions, so long as the number of latent classes remains small relative to the size and sparsity of the underlying DAG.
- [184] arXiv:2311.14676 (replaced) [pdf, ps, html, other]
-
Title: Decoding Social Sentiment in DAO: A Comparative Analysis of Blockchain Governance CommunitiesSubjects: Computers and Society (cs.CY); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC); General Economics (econ.GN); Applications (stat.AP)
Blockchain technology is leading a revolutionary transformation across diverse industries, with effective governance being critical for the success and sustainability of blockchain projects. Community forums, pivotal in engaging decentralized autonomous organizations (DAOs), significantly impact blockchain governance decisions. Concurrently, Natural Language Processing (NLP), particularly sentiment analysis, provides powerful insights from textual data. While prior research has explored the potential of NLP tools in social media sentiment analysis, there is a gap in understanding the sentiment landscape of blockchain governance communities. The evolving discourse and sentiment dynamics on the forums of top DAOs remain largely unknown. This paper delves deep into the evolving discourse and sentiment dynamics on the public forums of leading DeFi projects: Aave, Uniswap, Curve DAO, Yearn.finance, Merit Circle, and Balancer, focusing primarily on discussions related to governance issues. Our study shows that participants in decentralized communities generally express positive sentiments during Discord discussions. Furthermore, there is a potential interaction between discussion intensity and sentiment dynamics; higher discussion volume may contribute to a more stable sentiment from code analysis. The insights gained from this study are valuable for decision-makers in blockchain governance, underscoring the pivotal role of sentiment analysis in interpreting community emotions and its evolving impact on the landscape of blockchain governance. This research significantly contributes to the interdisciplinary exploration of the intersection of blockchain and society, specifically emphasizing the decentralized blockchain governance ecosystem. We provide our data and code for replicability as open access on GitHub.
- [185] arXiv:2311.18672 (replaced) [pdf, ps, html, other]
-
Title: A Comparison Between Invariant and Equivariant Classical and Quantum Graph Neural NetworksRoy T. Forestano, Marçal Comajoan Cara, Gopal Ramesh Dahale, Zhongtian Dong, Sergei Gleyzer, Daniel Justice, Kyoungchul Kong, Tom Magorsch, Konstantin T. Matchev, Katia Matcheva, Eyup B. UnluComments: 15 pages, 7 figures, 3 appendicesJournal-ref: Axioms 13 (2024) 160Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Machine Learning (stat.ML)
Machine learning algorithms are heavily relied on to understand the vast amounts of data from high-energy particle collisions at the CERN Large Hadron Collider (LHC). The data from such collision events can naturally be represented with graph structures. Therefore, deep geometric methods, such as graph neural networks (GNNs), have been leveraged for various data analysis tasks in high-energy physics. One typical task is jet tagging, where jets are viewed as point clouds with distinct features and edge connections between their constituent particles. The increasing size and complexity of the LHC particle datasets, as well as the computational models used for their analysis, greatly motivate the development of alternative fast and efficient computational paradigms such as quantum computation. In addition, to enhance the validity and robustness of deep networks, one can leverage the fundamental symmetries present in the data through the use of invariant inputs and equivariant layers. In this paper, we perform a fair and comprehensive comparison between classical graph neural networks (GNNs) and equivariant graph neural networks (EGNNs) and their quantum counterparts: quantum graph neural networks (QGNNs) and equivariant quantum graph neural networks (EQGNN). The four architectures were benchmarked on a binary classification task to classify the parton-level particle initiating the jet. Based on their AUC scores, the quantum networks were shown to outperform the classical networks. However, seeing the computational advantage of the quantum networks in practice may have to wait for the further development of quantum technology and its associated APIs.
- [186] arXiv:2312.05134 (replaced) [pdf, ps, html, other]
-
Title: Optimal Multi-Distribution LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension d, we propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon^2 (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory are further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle.
Additionally, we establish the necessity of randomization, revealing a large sample size barrier when only deterministic hypotheses are permitted. These findings resolve three open problems presented in COLT 2023 (i.e., citet[Problems 1, 3 and 4]{awasthi2023sample}). - [187] arXiv:2401.06687 (replaced) [pdf, ps, html, other]
-
Title: Proximal Causal Inference With Text DataComments: 26 pagesSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)
Recent text-based causal methods attempt to mitigate confounding bias by estimating proxies of confounding variables that are partially or imperfectly measured from unstructured text data. These approaches, however, assume analysts have supervised labels of the confounders given text for a subset of instances, a constraint that is sometimes infeasible due to data privacy or annotation costs. In this work, we address settings in which an important confounding variable is completely unobserved. We propose a new causal inference method that uses multiple instances of pre-treatment text data, infers two proxies from two zero-shot models on the separate instances, and applies these proxies in the proximal g-formula. We prove that our text-based proxy method satisfies identification conditions required by the proximal g-formula while other seemingly reasonable proposals do not. We evaluate our method in synthetic and semi-synthetic settings and find that it produces estimates with low bias. To address untestable assumptions associated with the proximal g-formula, we further propose an odds ratio falsification heuristic. This new combination of proximal causal inference and zero-shot classifiers expands the set of text-specific causal methods available to practitioners.
- [188] arXiv:2402.00484 (replaced) [pdf, ps, html, other]
-
Title: Extreme value statistics of nerve transmission delaySubjects: Neurons and Cognition (q-bio.NC); Mathematical Physics (math-ph); Applications (stat.AP)
Nerve transmission delay is an important topic in neuroscience. Spike signals fired or received at the dendrites of a neuron travel from the axon to the presynaptic cell. The spike signal triggers a chemical reaction at the synapse, wherein a presynaptic cell transfers neurotransmitters to the postsynaptic cell, and regenerates electrical signals by a chemical reaction process through ion channels and transmits it to neighboring neurons. In the context of describing the complex physiological reaction process as a stochastic process, this study aimed to show that the distribution of the maximum time interval of spike signals follows extreme order statistics. By considering the statistical variance in the time constant of the Leaky Integrate-and-Fire model, which is a deterministic time evolution model of spike signals, we enabled randomness in the time interval of spike signals. When the time constant follows an exponential distribution function, the time interval of the spike signal also follows an exponential distribution. In this case, our theory and simulations confirmed that the histogram of the maximum time interval follows the Gumbel distribution, which is one of the three types of extreme value statistics. We also confirmed that the histogram of the maximum time interval follows a Fréchet distribution when the time interval of the spike signal follows a Pareto distribution. These findings confirm that nerve transmission delay can be described using extreme value statistics and could, therefore, be used as a new indicator for transmission delay.
- [189] arXiv:2402.00847 (replaced) [pdf, ps, html, other]
-
Title: BootsTAP: Bootstrapped Training for Tracking-Any-PointCarl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew ZissermanSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a selfsupervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at this https URL
- [190] arXiv:2402.01454 (replaced) [pdf, ps, html, other]
-
Title: Integrating Large Language Models in Causal Discovery: A Statistical Causal ApproachMasayuki Takayama, Tadahisa Okuda, Thong Pham, Tatsuyoshi Ikenoue, Shingo Fukuma, Shohei Shimizu, Akiyoshi SannaiSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
In practical statistical causal discovery (SCD), embedding domain expert knowledge as constraints into the algorithm is significant for creating consistent meaningful causal models, despite the challenges in systematic acquisition of the background knowledge. To overcome these challenges, this paper proposes a novel methodology for causal inference, in which SCD methods and knowledge based causal inference (KBCI) with a large language model (LLM) are synthesized through ``statistical causal prompting (SCP)'' for LLMs and prior knowledge augmentation for SCD. Experiments have revealed that GPT-4 can cause the output of the LLM-KBCI and the SCD result with prior knowledge from LLM-KBCI to approach the ground truth, and that the SCD result can be further improved, if GPT-4 undergoes SCP. Furthermore, by using an unpublished real-world dataset, we have demonstrated that the background knowledge provided by the LLM can improve SCD on this dataset, even if this dataset has never been included in the training data of the LLM. The proposed approach can thus address challenges such as dataset biases and limitations, illustrating the potential of LLMs to improve data-driven causal inference across diverse scientific domains.
- [191] arXiv:2402.01929 (replaced) [pdf, ps, html, other]
-
Title: Sample, estimate, aggregate: A recipe for causal discovery foundation modelsComments: Preprint. Under reviewSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Causal discovery, the task of inferring causal structure from data, promises to accelerate scientific research, inform policy making, and more. However, causal discovery algorithms over larger sets of variables tend to be brittle against misspecification or when data are limited. To mitigate these challenges, we train a supervised model that learns to predict a larger causal graph from the outputs of classical causal discovery algorithms run over subsets of variables, along with other statistical hints like inverse covariance. Our approach is enabled by the observation that typical errors in the outputs of classical methods remain comparable across datasets. Theoretically, we show that this model is well-specified, in the sense that it can recover a causal graph consistent with graphs over subsets. Empirically, we train the model to be robust to erroneous estimates using diverse synthetic data. Experiments on real and synthetic data demonstrate that this model maintains high accuracy in the face of misspecification or distribution shift, and can be adapted at low cost to different discovery algorithms or choice of statistics.
- [192] arXiv:2402.02239 (replaced) [pdf, ps, html, other]
-
Title: Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-WassersteinComments: 38 pages, 15 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets. Traditionally, this involves using dimensionality reduction (DR) methods to project data onto lower-dimensional spaces or organizing points into meaningful clusters (clustering). In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem. This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem. We empirically demonstrate its relevance to the identification of low-dimensional prototypes representing data at different scales, across multiple image and genomic datasets.
- [193] arXiv:2402.02277 (replaced) [pdf, ps, html, other]
-
Title: Causal Bayesian Optimization via Exogenous Distribution LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Maximizing a target variable as an operational objective in a structural causal model is an important problem. Existing Causal Bayesian Optimization~(CBO) methods either rely on hard interventions that alter the causal structure to maximize the reward; or introduce action nodes to endogenous variables so that the data generation mechanisms are adjusted to achieve the objective. In this paper, a novel method is introduced to learn the distribution of exogenous variables, which is typically ignored or marginalized through expectation by existing methods. Exogenous distribution learning improves the approximation accuracy of structural causal models in a surrogate model that is usually trained with limited observational data. Moreover, the learned exogenous distribution extends existing CBO to general causal schemes beyond Additive Noise Models~(ANM). The recovery of exogenous variables allows us to use a more flexible prior for noise or unobserved hidden variables. We develop a new CBO method by leveraging the learned exogenous distribution. Experiments on different datasets and applications show the benefits of our proposed method.
- [194] arXiv:2402.02851 (replaced) [pdf, ps, html, other]
-
Title: Enhancing Compositional Generalization via Compositional Feature AlignmentComments: Published in Transactions on Machine Learning Research (TMLR). The code is released at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Real-world applications of machine learning models often confront data distribution shifts, wherein discrepancies exist between the training and test data distributions. In the common multi-domain multi-class setup, as the number of classes and domains scales up, it becomes infeasible to gather training data for every domain-class combination. This challenge naturally leads the quest for models with Compositional Generalization (CG) ability, where models can generalize to unseen domain-class combinations. To delve into the CG challenge, we develop CG-Bench, a suite of CG benchmarks derived from existing real-world image datasets, and observe that the prevalent pretraining-finetuning paradigm on foundational models, such as CLIP and DINOv2, struggles with the challenge. To address this challenge, we propose Compositional Feature Alignment (CFA), a simple two-stage finetuning technique that i) learns two orthogonal linear heads on a pretrained encoder with respect to class and domain labels, and ii) fine-tunes the encoder with the newly learned head frozen. We theoretically and empirically justify that CFA encourages compositional feature learning of pretrained models. We further conduct extensive experiments on CG-Bench for CLIP and DINOv2, two powerful pretrained vision foundation models. Experiment results show that CFA outperforms common finetuning techniques in compositional generalization, corroborating CFA's efficacy in compositional feature learning.
- [195] arXiv:2402.03587 (replaced) [pdf, ps, html, other]
-
Title: Information-Theoretic Active Correlation ClusteringSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study correlation clustering where the pairwise similarities are not known in advance. For this purpose, we employ active learning to query pairwise similarities in a cost-efficient way. We propose a number of effective information-theoretic acquisition functions based on entropy and information gain. We extensively investigate the performance of our methods in different settings and demonstrate their superior performance compared to the alternatives.
- [196] arXiv:2402.03687 (replaced) [pdf, ps, html, other]
-
Title: Pard: Permutation-Invariant Autoregressive Diffusion for Graph GenerationComments: Diffusion Model on GraphsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Graph generation has been dominated by autoregressive models due to their simplicity and effectiveness, despite their sensitivity to ordering. Yet diffusion models have garnered increasing attention, as they offer comparable performance while being permutation-invariant. Current graph diffusion models generate graphs in a one-shot fashion, but they require extra features and thousands of denoising steps to achieve optimal performance. We introduce PARD, a Permutation-invariant Auto Regressive Diffusion model that integrates diffusion models with autoregressive methods. PARD harnesses the effectiveness and efficiency of the autoregressive model while maintaining permutation invariance without ordering sensitivity. Specifically, we show that contrary to sets, elements in a graph are not entirely unordered and there is a unique partial order for nodes and edges. With this partial order, PARD generates a graph in a block-by-block, autoregressive fashion, where each block's probability is conditionally modeled by a shared diffusion model with an equivariant network. To ensure efficiency while being expressive, we further propose a higher-order graph transformer, which integrates transformer with PPGN. Like GPT, we extend the higher-order graph transformer to support parallel training of all blocks. Without any extra features, PARD achieves state-of-the-art performance on molecular and non-molecular datasets, and scales to large datasets like MOSES containing 1.9M molecules. Pard is open-sourced at this https URL.
- [197] arXiv:2402.03985 (replaced) [pdf, ps, html, other]
-
Title: A Bias-Variance Decomposition for Ensembles over Multiple Synthetic DatasetsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning, from increased accuracy to more effective model selection and uncertainty estimation. These benefits have clear empirical support, but the theoretical understanding of them is currently very light. We seek to increase the theoretical understanding by deriving bias-variance decompositions for several settings of using multiple synthetic datasets, including differentially private synthetic data. Our theory predicts multiple synthetic datasets to be especially beneficial for high-variance downstream predictors, and yields a simple rule of thumb to select the appropriate number of synthetic datasets in the case of mean-squared error and Brier score. We investigate how our theory works in practice by evaluating the performance of an ensemble over many synthetic datasets for several real datasets and downstream predictors. The results follow our theory, showing that our insights are practically relevant.
- [198] arXiv:2402.04906 (replaced) [pdf, ps, html, other]
-
Title: Conformal Convolution and Monte Carlo Meta-learners for Predictive Inference of Individual Treatment EffectsComments: 25 pages, 14 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Knowledge of the effect of interventions, known as the treatment effect, is paramount for decision-making. Approaches to estimating this treatment effect using conditional average treatment effect (CATE) meta-learners often provide only a point estimate of this treatment effect, while additional uncertainty quantification is frequently desired to enhance decision-making confidence. To address this, we introduce two novel approaches: the conformal convolution T-learner (CCT-learner) and conformal Monte Carlo (CMC) meta-learners. The approaches leverage weighted conformal predictive systems (WCPS), Monte Carlo sampling, and CATE meta-learners to generate predictive distributions of individual treatment effect (ITE) that could enhance individualized decision-making. Although we show how assumptions about the noise distribution of the outcome influence the uncertainty predictions, our experiments demonstrate that the CCT- and CMC meta-learners achieve strong coverage while maintaining narrow interval widths. They also generate probabilistically calibrated predictive distributions, providing reliable ranges of ITEs across various synthetic and semi-synthetic datasets.
Code: this https URL - [199] arXiv:2402.05569 (replaced) [pdf, ps, html, other]
-
Title: Simplifying Hypergraph Neural NetworksSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Machine Learning (stat.ML)
Hypergraphs are crucial for modeling higher-order interactions in real-world data. Hypergraph neural networks (HNNs) effectively utilise these structures by message passing to generate informative node features for various downstream tasks like node classification. However, the message passing block in existing HNNs typically requires a computationally intensive training process, which limits their practical use. To tackle this challenge, we propose an alternative approach by decoupling the usage of the hypergraph structural information from the model training stage. The proposed model, simplified hypergraph neural network (SHNN), contains a training-free message-passing block that can be precomputed before the training of SHNN, thereby reducing the computational burden. We theoretically support the efficiency and effectiveness of SHNN by showing that: 1) It is more training-efficient compared to existing HNNs; 2) It utilises as much information as existing HNNs for node feature generation; and 3) It is robust against the oversmoothing issue while using long-range interactions. Experiments based on six real-world hypergraph benchmarks in node classification and hyperlink prediction present that, compared to state-of-the-art HNNs, SHNN shows both competitive performance and superior training efficiency. Specifically, on Cora-CA, SHNN achieves the highest node classification accuracy with just 2% training time of the best baseline.
- [200] arXiv:2402.08193 (replaced) [pdf, ps, html, other]
-
Title: Gaussian Ensemble Belief Propagation for Efficient Inference in High-Dimensional SystemsComments: Under conference submissionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Efficient inference in high-dimensional models remains a central challenge in machine learning. This paper introduces the Gaussian Ensemble Belief Propagation (GEnBP) algorithm, a fusion of the Ensemble Kalman filter and Gaussian Belief Propagation (GaBP) methods. GEnBP updates ensembles by passing low-rank local messages over a graphical model. This combination inherits favourable qualities from each method. Ensemble techniques allow GEnBP to handle high-dimensional states, parameters and intricate, noisy, black-box generation processes. The use of local messages in a graphical model structure ensures that the approach can efficiently handle complex dependence structures. GEnBP is advantageous when the ensemble size may be considerably smaller than the inference dimension. This scenario often arises in fields such as spatiotemporal modelling, image processing and physical model inversion. GEnBP can be applied to general problem structures, including data assimilation, system identification and hierarchical models.
Supporting code is available at this https URL - [201] arXiv:2402.10445 (replaced) [pdf, ps, html, other]
-
Title: Collaborative Learning with Different Labeling FunctionsComments: To appear at ICML 2024; v2 and v3 included additional discussion on related workSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
We study a variant of Collaborative PAC Learning, in which we aim to learn an accurate classifier for each of the $n$ data distributions, while minimizing the number of samples drawn from them in total. Unlike in the usual collaborative learning setup, it is not assumed that there exists a single classifier that is simultaneously accurate for all distributions.
We show that, when the data distributions satisfy a weaker realizability assumption, which appeared in [Crammer and Mansour, 2012] in the context of multi-task learning, sample-efficient learning is still feasible. We give a learning algorithm based on Empirical Risk Minimization (ERM) on a natural augmentation of the hypothesis class, and the analysis relies on an upper bound on the VC dimension of this augmented class.
In terms of the computational efficiency, we show that ERM on the augmented hypothesis class is NP-hard, which gives evidence against the existence of computationally efficient learners in general. On the positive side, for two special cases, we give learners that are both sample- and computationally-efficient. - [202] arXiv:2402.12875 (replaced) [pdf, ps, html, other]
-
Title: Chain of Thought Empowers Transformers to Solve Inherently Serial ProblemsComments: 38 pages, 10 figures. Accepted by ICLR 2024Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
- [203] arXiv:2402.13380 (replaced) [pdf, ps, html, other]
-
Title: Toward TransfORmers: Revolutionizing the Solution of Mixed Integer Programs with TransformersSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Combinatorics (math.CO); Optimization and Control (math.OC); Machine Learning (stat.ML)
In this study, we introduce an innovative deep learning framework that employs a transformer model to address the challenges of mixed-integer programs, specifically focusing on the Capacitated Lot Sizing Problem (CLSP). Our approach, to our knowledge, is the first to utilize transformers to predict the binary variables of a mixed-integer programming (MIP) problem. Specifically, our approach harnesses the encoder decoder transformer's ability to process sequential data, making it well-suited for predicting binary variables indicating production setup decisions in each period of the CLSP. This problem is inherently dynamic, and we need to handle sequential decision making under constraints. We present an efficient algorithm in which CLSP solutions are learned through a transformer neural network. The proposed post-processed transformer algorithm surpasses the state-of-the-art solver, CPLEX and Long Short-Term Memory (LSTM) in solution time, optimal gap, and percent infeasibility over 240K benchmark CLSP instances tested. After the ML model is trained, conducting inference on the model, reduces the MIP into a linear program (LP). This transforms the ML-based algorithm, combined with an LP solver, into a polynomial-time approximation algorithm to solve a well-known NP-Hard problem, with almost perfect solution quality.
- [204] arXiv:2402.17826 (replaced) [pdf, ps, html, other]
-
Title: Prediction-Powered Ranking of Large Language ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
Large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a (small) set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with the distribution of human pairwise preferences asymptotically. Using pairwise comparisons made by humans in the LMSYS Chatbot Arena platform and pairwise comparisons made by three strong large language models, we empirically demonstrate the effectivity of our framework and show that the rank-sets constructed using only pairwise comparisons by the strong large language models are often inconsistent with (the distribution of) human pairwise preferences.
- [205] arXiv:2403.02957 (replaced) [pdf, ps, html, other]
-
Title: On the Asymptotic Mean Square Error Optimality of Diffusion ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Diffusion models (DMs) as generative priors have recently shown great potential for denoising tasks but lack theoretical understanding with respect to their mean square error (MSE) optimality. This paper proposes a novel denoising strategy inspired by the structure of the MSE-optimal conditional mean estimator (CME). The resulting DM-based denoiser can be conveniently employed using a pre-trained DM, being particularly fast by truncating reverse diffusion steps and not requiring stochastic re-sampling. We present a comprehensive (non-)asymptotic optimality analysis of the proposed diffusion-based denoiser, demonstrating polynomial-time convergence to the CME under mild conditions. Our analysis also derives a novel Lipschitz constant that depends solely on the DM's hyperparameters. Further, we offer a new perspective on DMs, showing that they inherently combine an asymptotically optimal denoiser with a powerful generator, modifiable by switching re-sampling in the reverse process on or off. The theoretical findings are thoroughly validated with experiments based on various benchmark datasets.
- [206] arXiv:2403.04919 (replaced) [pdf, ps, html, other]
-
Title: Identifying Causal Effects Under Functional DependenciesSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC); Methodology (stat.ME)
We study the identification of causal effects, motivated by two improvements to identifiability which can be attained if one knows that some variables in a causal graph are functionally determined by their parents (without needing to know the specific functions). First, an unidentifiable causal effect may become identifiable when certain variables are functional. Second, certain functional variables can be excluded from being observed without affecting the identifiability of a causal effect, which may significantly reduce the number of needed variables in observational data. Our results are largely based on an elimination procedure which removes functional variables from a causal graph while preserving key properties in the resulting causal graph, including the identifiability of causal effects.
- [207] arXiv:2405.00675 (replaced) [pdf, ps, html, other]
-
Title: Self-Play Probabilistic Preference Optimization for Language Model AlignmentComments: 26 pages, 4 figures, 5 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed \textit{Self-play Probabilistic Preference Optimization} (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys a theoretical convergence guarantee. Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53\% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.
- [208] arXiv:2405.06627 (replaced) [pdf, ps, html, other]
-
Title: Conformal Validity Guarantees Exist for Any Data DistributionComments: ICML 2024. Code available at this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
As machine learning (ML) gains widespread adoption, practitioners are increasingly seeking means to quantify and control the risk these systems incur. This challenge is especially salient when ML systems have autonomy to collect their own data, such as in black-box optimization and active learning, where their actions induce sequential feedback-loop shifts in the data distribution. Conformal prediction has emerged as a promising approach to uncertainty and risk quantification, but prior variants' validity guarantees have assumed some form of ``quasi-exchangeability'' on the data distribution, thereby excluding many types of sequential shifts. In this paper we prove that conformal prediction can theoretically be extended to \textit{any} joint data distribution, not just exchangeable or quasi-exchangeable ones, although it is exceedingly impractical to compute in the most general case. For practical applications, we outline a procedure for deriving specific conformal algorithms for any data distribution, and we use this procedure to derive tractable algorithms for a series of ML-agent-induced covariate shifts. We evaluate the proposed algorithms empirically on synthetic black-box optimization and active learning tasks.
- [209] arXiv:2405.09784 (replaced) [pdf, ps, html, other]
-
Title: Online bipartite matching with imperfect adviceComments: Accepted into ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
We study the problem of online unweighted bipartite matching with $n$ offline vertices and $n$ online vertices where one wishes to be competitive against the optimal offline algorithm. While the classic RANKING algorithm of Karp et al. [1990] provably attains competitive ratio of $1-1/e > 1/2$, we show that no learning-augmented method can be both 1-consistent and strictly better than $1/2$-robust under the adversarial arrival model. Meanwhile, under the random arrival model, we show how one can utilize methods from distribution testing to design an algorithm that takes in external advice about the online vertices and provably achieves competitive ratio interpolating between any ratio attainable by advice-free methods and the optimal ratio of 1, depending on the advice quality.
- [210] arXiv:2405.10093 (replaced) [pdf, ps, html, other]
-
Title: LaT-PFN: A Joint Embedding Predictive Architecture for In-context Time-series ForecastingComments: 9 pages plus references and appendix, 2 tables, 11 figures, added seeds, correctionsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We introduce LatentTimePFN (LaT-PFN), a foundational Time Series model with a strong embedding space that enables zero-shot forecasting. To achieve this, we perform in-context learning in latent space utilizing a novel integration of the Prior-data Fitted Networks (PFN) and Joint Embedding Predictive Architecture (JEPA) frameworks. We leverage the JEPA framework to create a prediction-optimized latent representation of the underlying stochastic process that generates time series and combines it with contextual learning, using a PFN. Furthermore, we improve on preceding works by utilizing related time series as a context and introducing a normalized abstract time axis. This reduces training time and increases the versatility of the model by allowing any time granularity and forecast horizon. We show that this results in superior zero-shot predictions compared to established baselines. We also demonstrate our latent space produces informative embeddings of both individual time steps and fixed-length summaries of entire series. Finally, we observe the emergence of multi-step patch embeddings without explicit training, suggesting the model actively learns discrete tokens that encode local structures in the data, analogous to vision transformers.
- [211] arXiv:2405.10289 (replaced) [pdf, ps, html, other]
-
Title: Subgradient Convergence Implies Subdifferential Convergence on Weakly Convex Functions: With Uniform Rates GuaranteesComments: This revision adds Lemma 1 and corrects several typosSubjects: Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
In nonsmooth, nonconvex stochastic optimization, understanding the uniform convergence of subdifferential mappings is crucial for analyzing stationary points of sample average approximations of risk as they approach the population risk. Yet, characterizing this convergence remains a fundamental challenge.
This work introduces a novel perspective by connecting the uniform convergence of subdifferential mappings to that of subgradient mappings as empirical risk converges to the population risk. We prove that, for stochastic weakly-convex objectives, and within any open set, a uniform bound on the convergence of subgradients -- chosen arbitrarily from the corresponding subdifferential sets -- translates to a uniform bound on the convergence of the subdifferential sets itself, measured by the Hausdorff metric.
Using this technique, we derive uniform convergence rates for subdifferential sets of stochastic convex-composite objectives. Our results do not rely on key distributional assumptions in the literature, which require the population and finite sample subdifferentials to be continuous in the Hausdorff metric, yet still provide tight convergence rates. These guarantees lead to new insights into the nonsmooth landscapes of such objectives within finite samples.