Statistics Theory
See recent articles
- [1] arXiv:2406.08732 [pdf, html, other]
-
Title: Relative belief inferences from decision theoryComments: arXiv admin note: substantial text overlap with arXiv:1104.3258Subjects: Statistics Theory (math.ST)
Relative belief inferences are shown to arise as Bayes rules or limiting Bayes rules. These inferences are invariant under reparameterizations and possess a number of optimal properties. In particular, relative belief inferences are based on a direct measure of statistical evidence.
- [2] arXiv:2406.08808 [pdf, html, other]
-
Title: Smoothed NPMLEs in nonparametric Poisson mixtures and beyondComments: 20 pagesSubjects: Statistics Theory (math.ST)
We discuss nonparametric mixing distribution estimation under the Gaussian-smoothed optimal transport (GOT) distance. It is shown that a recently formulated conjecture -- that the Poisson nonparametric maximum likelihood estimator can achieve root-$n$ rate of convergence under the GOT distance -- holds up to some logarithmic terms. We also establish the same conclusion for other minimum-distance estimators, and discuss mixture models beyond the Poisson.
- [3] arXiv:2406.08892 [pdf, html, other]
-
Title: Minimaxity under the half-Cauchy priorComments: The title of this article is quite similar to that of our previous article on arXiv 2308.09339, in which we discussed some variants of the half-Cauchy prior. In this article, we focus on the half-Cauchy prior itselfSubjects: Statistics Theory (math.ST)
This is a follow-up paper of Polson and Scott (2012, Bayesian Analysis), which claimed that the half-Cauchy prior is a sensible default prior for a scale parameter in hierarchical models. For estimation of a p-variate normal mean under the quadratic loss, they demonstrated that the Bayes estimator with respect to the half-Cauchy prior seems to be minimax through numerical experiments. In this paper, we theoretically establish the minimaxity of the corresponding Bayes estimator using the interval arithmetric.
New submissions for Friday, 14 June 2024 (showing 3 of 3 entries )
- [4] arXiv:2406.08918 (cross-list from cs.CR) [pdf, html, other]
-
Title: Beyond the Calibration Point: Mechanism Comparison in Differential PrivacyComments: ICML 2024Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
In differentially private (DP) machine learning, the privacy guarantees of DP mechanisms are often reported and compared on the basis of a single $(\varepsilon, \delta)$-pair. This practice overlooks that DP guarantees can vary substantially \emph{even between mechanisms sharing a given $(\varepsilon, \delta)$}, and potentially introduces privacy vulnerabilities which can remain undetected. This motivates the need for robust, rigorous methods for comparing DP guarantees in such cases. Here, we introduce the $\Delta$-divergence between mechanisms which quantifies the worst-case excess privacy vulnerability of choosing one mechanism over another in terms of $(\varepsilon, \delta)$, $f$-DP and in terms of a newly presented Bayesian interpretation. Moreover, as a generalisation of the Blackwell theorem, it is endowed with strong decision-theoretic foundations. Through application examples, we show that our techniques can facilitate informed decision-making and reveal gaps in the current understanding of privacy risks, as current practices in DP-SGD often result in choosing mechanisms with high excess privacy vulnerabilities.
- [5] arXiv:2406.09048 (cross-list from stat.ML) [pdf, other]
-
Title: Central Limit Theorem for Bayesian Neural Network trained with Variational InferenceArnaud Descours (MAGNET), Tom Huix (X), Arnaud Guillin (LMBP), Manon Michel (LMBP), Éric Moulines (X), Boris Nectoux (LMBP)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
In this paper, we rigorously derive Central Limit Theorems (CLT) for Bayesian two-layerneural networks in the infinite-width limit and trained by variational inference on a regression task. The different networks are trained via different maximization schemes of the regularized evidence lower bound: (i) the idealized case with exact estimation of a multiple Gaussian integral from the reparametrization trick, (ii) a minibatch scheme using Monte Carlo sampling, commonly known as Bayes-by-Backprop, and (iii) a computationally cheaper algorithm named Minimal VI. The latter was recently introduced by leveraging the information obtained at the level of the mean-field limit. Laws of large numbers are already rigorously proven for the three schemes that admits the same asymptotic limit. By deriving CLT, this work shows that the idealized and Bayes-by-Backprop schemes have similar fluctuation behavior, that is different from the Minimal VI one. Numerical experiments then illustrate that the Minimal VI scheme is still more efficient, in spite of bigger variances, thanks to its important gain in computational complexity.
- [6] arXiv:2406.09049 (cross-list from cs.LG) [pdf, html, other]
-
Title: Efficiently Deciding Algebraic Equivalence of Bow-Free Acyclic Path DiagramsComments: To appear in the proceedings of the 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024)Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
For causal discovery in the presence of latent confounders, constraints beyond conditional independences exist that can enable causal discovery algorithms to distinguish more pairs of graphs. Such constraints are not well-understood yet. In the setting of linear structural equation models without bows, we study algebraic constraints and argue that these provide the most fine-grained resolution achievable. We propose efficient algorithms that decide whether two graphs impose the same algebraic constraints, or whether the constraints imposed by one graph are a subset of those imposed by another graph.
- [7] arXiv:2406.09169 (cross-list from cs.SI) [pdf, html, other]
-
Title: Empirical Networks are Sparse: Enhancing Multi-Edge Models with Zero-InflationComments: 18 pages article + 9 pages SI, 4 figuresSubjects: Social and Information Networks (cs.SI); Statistics Theory (math.ST); Physics and Society (physics.soc-ph); Methodology (stat.ME)
Real-world networks are sparse. As we show in this article, even when a large number of interactions is observed most node pairs remain disconnected. We demonstrate that classical multi-edge network models, such as the $G(N,p)$, configuration models, and stochastic block models, fail to accurately capture this phenomenon. To mitigate this issue, zero-inflation must be integrated into these traditional models. Through zero-inflation, we incorporate a mechanism that accounts for the excess number of zeroes (disconnected pairs) observed in empirical data. By performing an analysis on all the datasets from the Sociopatterns repository, we illustrate how zero-inflated models more accurately reflect the sparsity and heavy-tailed edge count distributions observed in empirical data. Our findings underscore that failing to account for these ubiquitous properties in real-world networks inadvertently leads to biased models which do not accurately represent complex systems and their dynamics.
- [8] arXiv:2406.09183 (cross-list from stat.ML) [pdf, html, other]
-
Title: Ridge interpolators in correlated factor regression models -- exact risk analysisSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
We consider correlated \emph{factor} regression models (FRM) and analyze the performance of classical ridge interpolators. Utilizing powerful \emph{Random Duality Theory} (RDT) mathematical engine, we obtain \emph{precise} closed form characterizations of the underlying optimization problems and all associated optimizing quantities. In particular, we provide \emph{excess prediction risk} characterizations that clearly show the dependence on all key model parameters, covariance matrices, loadings, and dimensions. As a function of the over-parametrization ratio, the generalized least squares (GLS) risk also exhibits the well known \emph{double-descent} (non-monotonic) behavior. Similarly to the classical linear regression models (LRM), we demonstrate that such FRM phenomenon can be smoothened out by the optimally tuned ridge regularization. The theoretical results are supplemented by numerical simulations and an excellent agrement between the two is observed. Moreover, we note that ``ridge smootenhing'' is often of limited effect already for over-parametrization ratios above $5$ and of virtually no effect for those above $10$. This solidifies the notion that one of the recently most popular neural networks paradigms -- \emph{zero-training (interpolating) generalizes well} -- enjoys wider applicability, including the one within the FRM estimation/prediction context.
- [9] arXiv:2406.09194 (cross-list from stat.ML) [pdf, other]
-
Title: Bengining overfitting in Fixed Dimension via Physics-Informed Learning with Smooth Iductive BiasSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST)
Recent advances in machine learning theory showed that interpolation to noisy samples using over-parameterized machine learning algorithms always leads to inconsistency. However, this work surprisingly discovers that interpolated machine learning can exhibit benign overfitting and consistency when using physics-informed learning for supervised tasks governed by partial differential equations (PDEs) describing laws of physics. An analysis provides an asymptotic Sobolev norm learning curve for kernel ridge(less) regression addressing linear inverse problems involving elliptic PDEs. The results reveal that the PDE operators can stabilize variance and lead to benign overfitting for fixed-dimensional problems, contrasting standard regression settings. The impact of various inductive biases introduced by minimizing different Sobolev norms as implicit regularization is also examined. Notably, the convergence rate is independent of the specific (smooth) inductive bias for both ridge and ridgeless regression. For regularized least squares estimators, all (smooth enough) inductive biases can achieve optimal convergence rates when the regularization parameter is properly chosen. The smoothness requirement recovers a condition previously found in the Bayesian setting and extends conclusions to minimum norm interpolation estimators.
- [10] arXiv:2406.09195 (cross-list from stat.ME) [pdf, html, other]
-
Title: When Pearson $\chi^2$ and other divisible statistics are not goodness-of-fit testsSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Data Analysis, Statistics and Probability (physics.data-an); Computation (stat.CO)
Thousands of experiments are analyzed and papers are published each year involving the statistical analysis of grouped data. While this area of statistics is often perceived - somewhat naively - as saturated, several misconceptions still affect everyday practice, and new frontiers have so far remained unexplored. Researchers must be aware of the limitations affecting their analyses and what are the new possibilities in their hands.
Motivated by this need, the article introduces a unifying approach to the analysis of grouped data which allows us to study the class of divisible statistics - that includes Pearson's $\chi^2$, the likelihood ratio as special cases - with a fresh perspective. The contributions collected in this manuscript span from modeling and estimation to distribution-free goodness-of-fit tests.
Perhaps the most surprising result presented here is that, in a sparse regime, all tests proposed in the literature are dominated by a class of weighted linear statistics. - [11] arXiv:2406.09199 (cross-list from stat.ML) [pdf, html, other]
-
Title: Precise analysis of ridge interpolators under heavy correlations -- a Random Duality Theory viewSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
We consider fully row/column-correlated linear regression models and study several classical estimators (including minimum norm interpolators (GLS), ordinary least squares (LS), and ridge regressors). We show that \emph{Random Duality Theory} (RDT) can be utilized to obtain precise closed form characterizations of all estimators related optimizing quantities of interest, including the \emph{prediction risk} (testing or generalization error). On a qualitative level out results recover the risk's well known non-monotonic (so-called double-descent) behavior as the number of features/sample size ratio increases. On a quantitative level, our closed form results show how the risk explicitly depends on all key model parameters, including the problem dimensions and covariance matrices. Moreover, a special case of our results, obtained when intra-sample (or time-series) correlations are not present, precisely match the corresponding ones obtained via spectral methods in [6,16,17,24].
- [12] arXiv:2406.09375 (cross-list from stat.ML) [pdf, html, other]
-
Title: Learning conditional distributions on continuous spacesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We investigate sample-based learning of conditional distributions on multi-dimensional unit boxes, allowing for different dimensions of the feature and target spaces. Our approach involves clustering data near varying query points in the feature space to create empirical measures in the target space. We employ two distinct clustering schemes: one based on a fixed-radius ball and the other on nearest neighbors. We establish upper bounds for the convergence rates of both methods and, from these bounds, deduce optimal configurations for the radius and the number of neighbors. We propose to incorporate the nearest neighbors method into neural network training, as our empirical analysis indicates it has better performance in practice. For efficiency, our training process utilizes approximate nearest neighbors search with random binary space partitioning. Additionally, we employ the Sinkhorn algorithm and a sparsity-enforced transport plan. Our empirical findings demonstrate that, with a suitably designed structure, the neural network has the ability to adapt to a suitable level of Lipschitz continuity locally. For reproducibility, our code is available at \url{this https URL}.
Cross submissions for Friday, 14 June 2024 (showing 9 of 9 entries )
- [13] arXiv:2105.03122 (replaced) [pdf, html, other]
-
Title: The Coreness and H-Index of Random Geometric GraphsSubjects: Statistics Theory (math.ST); Probability (math.PR)
In network analysis, a measure of node centrality provides a scale indicating how central a node is within a network. The coreness is a popular notion of centrality that accounts for the maximal smallest degree of a subgraph containing a given node. In this paper, we study the coreness of random geometric graphs and show that, with an increasing number of nodes and properly chosen connectivity radius, the coreness converges to a new object, that we call the continuum coreness. In the process, we show that other popular notions of centrality measures, namely the H-index and its iterates, also converge under the same setting to new limiting objects.
- [14] arXiv:2110.02318 (replaced) [pdf, other]
-
Title: Approximate Message Passing for orthogonally invariant ensembles: Multivariate non-linearities and spectral initializationComments: 68 pages, 4 figures. Accepted to Information and Inference: A Journal of the IMASubjects: Statistics Theory (math.ST); Methodology (stat.ME)
We study a class of Approximate Message Passing (AMP) algorithms for symmetric and rectangular spiked random matrix models with orthogonally invariant noise. The AMP iterates have fixed dimension $K \geq 1$, a multivariate non-linearity is applied in each AMP iteration, and the algorithm is spectrally initialized with $K$ super-critical sample eigenvectors. We derive the forms of the Onsager debiasing coefficients and corresponding AMP state evolution, which depend on the free cumulants of the noise spectral distribution. This extends previous results for such models with $K=1$ and an independent initialization.
Applying this approach to Bayesian principal components analysis, we introduce a Bayes-OAMP algorithm that uses as its non-linearity the posterior mean conditional on all preceding AMP iterates. We describe a practical implementation of this algorithm, where all debiasing and state evolution parameters are estimated from the observed data, and we illustrate the accuracy and stability of this approach in simulations. - [15] arXiv:2305.12883 (replaced) [pdf, html, other]
-
Title: Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression ErrorsComments: 19 pages, 5 figuresSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
In recent years, there has been a significant growth in research focusing on minimum $\ell_2$ norm (ridgeless) interpolation least squares estimators. However, the majority of these analyses have been limited to an unrealistic regression error structure, assuming independent and identically distributed errors with zero mean and common variance. In this paper, we explore prediction risk as well as estimation risk under more general regression error assumptions, highlighting the benefits of overparameterization in a more realistic setting that allows for clustered or serial dependence. Notably, we establish that the estimation difficulties associated with the variance components of both risks can be summarized through the trace of the variance-covariance matrix of the regression errors. Our findings suggest that the benefits of overparameterization can extend to time series, panel and grouped data.
- [16] arXiv:2403.12110 (replaced) [pdf, html, other]
-
Title: Robust estimations from distribution structures: I. MeanSubjects: Statistics Theory (math.ST); Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME); Other Statistics (stat.OT)
As the most fundamental problem in statistics, robust location estimation has many prominent solutions, such as the trimmed mean, Winsorized mean, Hodges Lehmann estimator, Huber M estimator, and median of means. Recent studies suggest that their maximum biases concerning the mean can be quite different, but the underlying mechanisms largely remain unclear. This study exploited a semiparametric method to classify distributions by the asymptotic orderliness of quantile combinations with varying breakdown points, showing their interrelations and connections to parametric distributions. Further deductions explain why the Winsorized mean typically has smaller biases compared to the trimmed mean; two sequences of semiparametric robust mean estimators emerge, particularly highlighting the superiority of the median Hodges Lehmann mean. This article sheds light on the understanding of the common nature of probability distributions.
- [17] arXiv:2403.14570 (replaced) [pdf, html, other]
-
Title: Robust estimations from distribution structures: II. Central MomentsSubjects: Statistics Theory (math.ST)
In descriptive statistics, $U$-statistics arise naturally in producing minimum-variance unbiased estimators. In 1984, Serfling considered the distribution formed by evaluating the kernel of the $U$-statistics and proposed generalized $L$-statistics which includes Hodges-Lehamnn estimator and Bickel-Lehmann spread as special cases. However, the structures of the kernel distributions remain unclear. In 1954, Hodges and Lehmann demonstrated that if $X$ and $Y$ are independently sampled from the same unimodal distribution, $X-Y$ will exhibit symmetrical unimodality with its peak centered at zero. Building upon this foundational work, the current study delves into the structure of the kernel distribution. It is shown that the $\mathbf{k}$th central moment kernel distributions ($\mathbf{k}>2$) derived from a unimodal distribution exhibit location invariance and is also nearly unimodal with the mode and median close to zero. This article provides an approach to study the general structure of kernel distributions.
- [18] arXiv:2403.16039 (replaced) [pdf, html, other]
-
Title: Robust estimations from distribution structures: III. Invariant MomentsSubjects: Statistics Theory (math.ST)
Descriptive statistics for parametric models are currently highly sensative to departures, gross errors, and/or random errors. Here, leveraging the structures of parametric distributions and their central moment kernel distributions, a class of estimators, consistent simultanously for both a semiparametric distribution and a distinct parametric distribution, is proposed. These efficient estimators are robust to both gross errors and departures from parametric assumptions, making them ideal for estimating the mean and central moments of common unimodal distributions. This article opens up the possibility of utilizing the common nature of probability models to construct near-optimal estimators that are suitable for various scenarios.
- [19] arXiv:2206.13037 (replaced) [pdf, other]
-
Title: Universality of Approximate Message Passing algorithms and tensor networksComments: 54 pages. Accepted to The Annals of Applied ProbabilitySubjects: Probability (math.PR); Information Theory (cs.IT); Statistics Theory (math.ST)
Approximate Message Passing (AMP) algorithms provide a valuable tool for studying mean-field approximations and dynamics in a variety of applications. Although these algorithms are often first derived for matrices having independent Gaussian entries or satisfying rotational invariance in law, their state evolution characterizations are expected to hold over larger universality classes of random matrix ensembles.
We develop several new results on AMP universality. For AMP algorithms tailored to independent Gaussian entries, we show that their state evolutions hold over broadly defined generalized Wigner and white noise ensembles, including matrices with heavy-tailed entries and heterogeneous entrywise variances that may arise in data applications. For AMP algorithms tailored to rotational invariance in law, we show that their state evolutions hold over delocalized sign-and-permutation-invariant matrix ensembles that have a limit distribution over the diagonal, including sensing matrices composed of subsampled Hadamard or Fourier transforms and diagonal operators.
We establish these results via a simplified moment-method proof, reducing AMP universality to the study of products of random matrices and diagonal tensors along a tensor network. As a by-product of our analyses, we show that the aforementioned matrix ensembles satisfy a notion of asymptotic freeness with respect to such tensor networks, which parallels usual definitions of freeness for traces of matrix products. - [20] arXiv:2401.11422 (replaced) [pdf, html, other]
-
Title: Local Identification in Instrumental Variable Multivariate Quantile Regression ModelsSubjects: Econometrics (econ.EM); Statistics Theory (math.ST)
In the instrumental variable quantile regression (IVQR) model of Chernozhukov and Hansen (2005), a one-dimensional unobserved rank variable monotonically determines a single potential outcome. Even when multiple outcomes are simultaneously of interest, it is common to apply the IVQR model to each of them separately. This practice implicitly assumes that the rank variable of each regression model affects only the corresponding outcome and does not affect the other outcomes. In reality, however, it is often the case that all rank variables together determine the outcomes, which leads to a systematic correlation between the outcomes. To deal with this, we propose a nonlinear IV model that allows for multivariate unobserved heterogeneity, each of which is considered as a rank variable for an observed outcome. We show that the structural function of our model is locally identified under the assumption that the IV and the treatment variable are sufficiently positively correlated.
- [21] arXiv:2404.19707 (replaced) [pdf, html, other]
-
Title: Identification by non-Gaussianity in structural threshold and smooth transition vector autoregressive modelsSubjects: Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
Linear structural vector autoregressive models can be identified statistically without imposing restrictions on the model if the shocks are mutually independent and at most one of them is Gaussian. We show that this result extends to structural threshold and smooth transition vector autoregressive models incorporating a time-varying impact matrix defined as a weighted sum of the impact matrices of the regimes. We also discuss labelling of the shocks, maximum likelihood estimation of the parameters, and stationarity the model. The introduced methods are implemented to the accompanying R package sstvars. Our empirical application studies the effects of the climate policy uncertainty shock on the U.S. macroeconomy. In a structural logistic smooth transition vector autoregressive model consisting of two regimes, we find that a positive climate policy uncertainty shock decreases production in times of low economic policy uncertainty but slightly increases it in times of high economic policy uncertainty.