Statistics Theory
See recent articles
Showing new listings for Tuesday, 5 November 2024
- [1] arXiv:2411.01112 [pdf, html, other]
-
Title: Optimal low-rank approximations of posteriors for linear Gaussian inverse problems on Hilbert spacesSubjects: Statistics Theory (math.ST)
For linear inverse problems with Gaussian priors and observation noise, the posterior is Gaussian, with mean and covariance determined by the conditioning formula. We analyse measure approximation problems of finding the best approximation to the posterior in a family of Gaussians with approximate covariance or approximate mean, for Hilbert parameter spaces and finite-dimensional observations. We quantify the error of the approximating Gaussian either with the Kullback-Leibler divergence or the family of Rényi divergences. Using the Feldman-Hajek theorem and recent results on reduced-rank operator approximations, we identify optimal solutions to these measure approximation problems. Our results extend those of Spantini et al. (SIAM J. Sci. Comput. 2015) to Hilbertian parameter spaces. In addition, our results show that the posterior differs from the prior only on a subspace of dimension equal to the rank of the Hessian of the negative log-likelihood, and that this subspace is a subspace of the Cameron-Martin space of the prior.
- [2] arXiv:2411.01234 [pdf, html, other]
-
Title: Identifying and bounding the probability of necessity for causes of effects with ordinal outcomesSubjects: Statistics Theory (math.ST)
Although the existing causal inference literature focuses on the forward-looking perspective by estimating effects of causes, the backward-looking perspective can provide insights into causes of effects. In backward-looking causal inference, the probability of necessity measures the probability that a certain event is caused by the treatment given the observed treatment and outcome. Most existing results focus on binary outcomes. Motivated by applications with ordinal outcomes, we propose a general definition of the probability of necessity. However, identifying the probability of necessity is challenging because it involves the joint distribution of the potential outcomes. We propose a novel assumption of monotonic incremental treatment effect to identify the probability of necessity with ordinal outcomes. We also discuss the testable implications of this key identification assumption. When it fails, we derive explicit formulas of the sharp large-sample bounds on the probability of necessity.
- [3] arXiv:2411.01237 [pdf, html, other]
-
Title: Sparse Linear Regression: Sequential Convex Relaxation, Robust Restricted Null Space Property, and Variable SelectionComments: 38 pages, 4 figuresSubjects: Statistics Theory (math.ST)
For high dimensional sparse linear regression problems, we propose a sequential convex relaxation algorithm (iSCRA-TL1) by solving inexactly a sequence of truncated $\ell_1$-norm regularized minimization problems, in which the working index sets are constructed iteratively with an adaptive strategy. We employ the robust restricted null space property and sequential restricted null space property (rRNSP and rSRNSP) to provide the theoretical certificates of iSCRA-TL1. Specifically, under a mild rRNSP or rSRNSP, iSCRA-TL1 is shown to identify the support of the true $r$-sparse vector by solving at most $r$ truncated $\ell_1$-norm regularized problems, and the $\ell_1$-norm error bound of its iterates from the oracle solution is also established. As a consequence, an oracle estimator of high-dimensional linear regression problems can be achieved by solving at most $r\!+\!1$ truncated $\ell_1$-norm regularized problems. To the best of our knowledge, this is the first sequential convex relaxation algorithm to produce an oracle estimator under a weaker NSP condition within a specific number of steps, provided that the Lasso estimator lacks high quality, say, the supports of its first $r$ largest (in modulus) entries do not coincide with those of the true vector.
- [4] arXiv:2411.01275 [pdf, html, other]
-
Title: Optimal Private and Communication Constraint Distributed Goodness-of-Fit Testing for Discrete Distributions in the Large Sample RegimeComments: To appear in the Thirty-eight Conference on Neural Information Processing Systems -- 10 page article + 20 pages appendix and referencesSubjects: Statistics Theory (math.ST)
We study distributed goodness-of-fit testing for discrete distribution under bandwidth and differential privacy constraints. Information constraint distributed goodness-of-fit testing is a problem that has received considerable attention recently. The important case of discrete distributions is theoretically well understood in the classical case where all data is available in one "central" location. In a federated setting, however, data is distributed across multiple "locations" (e.g. servers) and cannot readily be shared due to e.g. bandwidth or privacy constraints that each server needs to satisfy. We show how recently derived results for goodness-of-fit testing for the mean of a multivariate Gaussian model extend to the discrete distributions, by leveraging Le Cam's theory of statistical equivalence. In doing so, we derive matching minimax upper- and lower-bounds for the goodness-of-fit testing for discrete distributions under bandwidth or privacy constraints in the regime where the number of samples held locally is large.
- [5] arXiv:2411.01563 [pdf, other]
-
Title: Statistical guarantees for denoising reflected diffusion modelsSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
In recent years, denoising diffusion models have become a crucial area of research due to their abundance in the rapidly expanding field of generative AI. While recent statistical advances have delivered explanations for the generation ability of idealised denoising diffusion models for high-dimensional target data, implementations introduce thresholding procedures for the generating process to overcome issues arising from the unbounded state space of such models. This mismatch between theoretical design and implementation of diffusion models has been addressed empirically by using a \emph{reflected} diffusion process as the driver of noise instead. In this paper, we study statistical guarantees of these denoising reflected diffusion models. In particular, we establish minimax optimal rates of convergence in total variation, up to a polylogarithmic factor, under Sobolev smoothness assumptions. Our main contributions include the statistical analysis of this novel class of denoising reflected diffusion models and a refined score approximation method in both time and space, leveraging spectral decomposition and rigorous neural network analysis.
- [6] arXiv:2411.01884 [pdf, html, other]
-
Title: Frequentist Oracle Properties of Bayesian Stacking EstimatorsSubjects: Statistics Theory (math.ST)
Compromise estimation entails using a weighted average of outputs from several candidate models, and is a viable alternative to model selection when the choice of model is not obvious. As such, it is a tool used by both frequentists and Bayesians, and in both cases, the literature is vast and includes studies of performance in simulations and applied examples. However, frequentist researchers often prove oracle properties, showing that a proposed average asymptotically performs at least as well as any other average comprising the same candidates. On the Bayesian side, such oracle properties are yet to be established. This paper considers Bayesian stacking estimators, and evaluates their performance using frequentist asymptotics. Oracle properties are derived for estimators stacking Bayesian linear and logistic regression models, and combined with Monte Carlo experiments that show Bayesian stacking may outperform the best candidate model included in the stack. Thus, the result is not only a frequentist motivation of a fundamentally Bayesian procedure, but also an extended range of methods available to frequentist practitioners.
- [7] arXiv:2411.01888 [pdf, html, other]
-
Title: The Long Time Limit of Diffusion MeansComments: 21 pages, no figuresSubjects: Statistics Theory (math.ST); Differential Geometry (math.DG)
In statistics on manifolds, the notion of the mean of a probability distribution becomes more involved than in a linear space. Several location statistics have been proposed, which reduce to the ordinary mean in Euclidean space. A relatively new family of contenders in this field are Diffusion Means, which are a one parameter family of location statistics modeled as initial points of isotropic diffusion with the diffusion time as parameter. It is natural to consider limit cases of the diffusion time parameter and it turns out that for short times the diffusion mean set approaches the intrinsic mean set. For long diffusion times, the limit is less obvious but for spheres of arbitrary dimension the diffusion mean set has been shown to converge to the extrinsic mean set. Here, we extend this result to the real projective spaces in their unique smooth isometric embedding into a linear space. We conjecture that the long time limit is always given by the extrinsic mean in the isometric embedding for connected compact symmetric spaces with unique isometric embedding.
- [8] arXiv:2411.02137 [pdf, html, other]
-
Title: Finite-sample performance of the maximum likelihood estimator in logistic regressionSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
Logistic regression is a classical model for describing the probabilistic dependence of binary responses to multivariate covariates. We consider the predictive performance of the maximum likelihood estimator (MLE) for logistic regression, assessed in terms of logistic risk. We consider two questions: first, that of the existence of the MLE (which occurs when the dataset is not linearly separated), and second that of its accuracy when it exists. These properties depend on both the dimension of covariates and on the signal strength. In the case of Gaussian covariates and a well-specified logistic model, we obtain sharp non-asymptotic guarantees for the existence and excess logistic risk of the MLE. We then generalize these results in two ways: first, to non-Gaussian covariates satisfying a certain two-dimensional margin condition, and second to the general case of statistical learning with a possibly misspecified logistic model. Finally, we consider the case of a Bernoulli design, where the behavior of the MLE is highly sensitive to the parameter direction.
- [9] arXiv:2411.02357 [pdf, html, other]
-
Title: On statistical independence and density independenceSubjects: Statistics Theory (math.ST); Number Theory (math.NT)
The object of observation in present paper is statistical independence of real sequences and its description as independence with re spect to certain class of densities.
New submissions (showing 9 of 9 entries)
- [10] arXiv:2411.01382 (cross-list from stat.ME) [pdf, html, other]
-
Title: On MCMC mixing under unidentified nonparametric models with an application to survival predictions under transformation modelsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
The multi-modal posterior under unidentified nonparametric models yields poor mixing of Markov Chain Monte Carlo (MCMC), which is a stumbling block to Bayesian predictions. In this article, we conceptualize a prior informativeness threshold that is essentially the variance of posterior modes and expressed by the uncertainty hyperparameters of nonparametric priors. The threshold plays the role of a lower bound of the within-chain MCMC variance to ensure MCMC mixing, and engines prior modification through hyperparameter tuning to descend the mode variance. Our method distinguishes from existing postprocessing methods in that it directly samples well-mixed MCMC chains on the unconstrained space, and inherits the original posterior predictive distribution in predictive inference. Our method succeeds in Bayesian survival predictions under an unidentified nonparametric transformation model, guarded by the inferential theories of the posterior variance, under elicitation of two delicate nonparametric priors. Comprehensive simulations and real-world data analysis demonstrate that our method achieves MCMC mixing and outperforms existing approaches in survival predictions.
- [11] arXiv:2411.01588 (cross-list from stat.ME) [pdf, html, other]
-
Title: Statistical Inference on High Dimensional Gaussian Graphical Regression ModelsComments: 27 Pages, 4 figures, 4 tablesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Gaussian graphical regressions have emerged as a powerful approach for regressing the precision matrix of a Gaussian graphical model on covariates, which, unlike traditional Gaussian graphical models, can help determine how graphs are modulated by high dimensional subject-level covariates, and recover both the population-level and subject-level graphs. To fit the model, a multi-task learning approach {achieves} %has been shown to result in lower error rates compared to node-wise regressions. However, due to the high complexity and dimensionality of the Gaussian graphical regression problem, the important task of statistical inference remains unexplored. We propose a class of debiased estimators based on multi-task learners for statistical inference in Gaussian graphical regressions. We show that debiasing can be performed quickly and separately for the multi-task learners. In a key debiasing step {that estimates} %involving the estimation of the inverse covariance matrix, we propose a novel {projection technique} %diagonalization approach that dramatically reduces computational costs {in optimization} to scale only with the sample size $n$. We show that our debiased estimators enjoy a fast convergence rate and asymptotically follow a normal distribution, enabling valid statistical inference such as constructing confidence intervals and performing hypothesis testing. Simulation studies confirm the practical utility of the proposed approach, and we further apply it to analyze gene co-expression graph data from a brain cancer study, revealing meaningful biological relationships.
- [12] arXiv:2411.01629 (cross-list from stat.ML) [pdf, html, other]
-
Title: Denoising Diffusions with Optimal Transport: Localization, Curvature, and Multi-Scale ComplexityComments: 29 pages, 11 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
Adding noise is easy; what about denoising? Diffusion is easy; what about reverting a diffusion? Diffusion-based generative models aim to denoise a Langevin diffusion chain, moving from a log-concave equilibrium measure $\nu$, say isotropic Gaussian, back to a complex, possibly non-log-concave initial measure $\mu$. The score function performs denoising, going backward in time, predicting the conditional mean of the past location given the current. We show that score denoising is the optimal backward map in transportation cost. What is its localization uncertainty? We show that the curvature function determines this localization uncertainty, measured as the conditional variance of the past location given the current. We study in this paper the effectiveness of the diffuse-then-denoise process: the contraction of the forward diffusion chain, offset by the possible expansion of the backward denoising chain, governs the denoising difficulty. For any initial measure $\mu$, we prove that this offset net contraction at time $t$ is characterized by the curvature complexity of a smoothed $\mu$ at a specific signal-to-noise ratio (SNR) scale $r(t)$. We discover that the multi-scale curvature complexity collectively determines the difficulty of the denoising chain. Our multi-scale complexity quantifies a fine-grained notion of average-case curvature instead of the worst-case. Curiously, it depends on an integrated tail function, measuring the relative mass of locations with positive curvature versus those with negative curvature; denoising at a specific SNR scale is easy if such an integrated tail is light. We conclude with several non-log-concave examples to demonstrate how the multi-scale complexity probes the bottleneck SNR for the diffuse-then-denoise process.
- [13] arXiv:2411.02141 (cross-list from math.PR) [pdf, html, other]
-
Title: More on Round-Robin Tournament Models with a Unique Maximum ScoreSubjects: Probability (math.PR); Combinatorics (math.CO); Statistics Theory (math.ST)
In this note we extend a recent result showing the uniqueness of the maximum score in a classical round-robin tournament to the round-robin tournament models with equally strong players.
- [14] arXiv:2411.02184 (cross-list from stat.ML) [pdf, html, other]
-
Title: Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis on the role of model complexitySubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST)
While overparameterization is known to benefit generalization, its impact on Out-Of-Distribution (OOD) detection is less understood. This paper investigates the influence of model complexity in OOD detection. We propose an expected OOD risk metric to evaluate classifiers confidence on both training and OOD samples. Leveraging Random Matrix Theory, we derive bounds for the expected OOD risk of binary least-squares classifiers applied to Gaussian data. We show that the OOD risk depicts an infinite peak, when the number of parameters is equal to the number of samples, which we associate with the double descent phenomenon. Our experimental study on different OOD detection methods across multiple neural architectures extends our theoretical insights and highlights a double descent curve. Our observations suggest that overparameterization does not necessarily lead to better OOD detection. Using the Neural Collapse framework, we provide insights to better understand this behavior. To facilitate reproducibility, our code will be made publicly available upon publication.
- [15] arXiv:2411.02225 (cross-list from stat.ML) [pdf, html, other]
-
Title: Variable Selection in Convex Piecewise Linear RegressionSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
This paper presents Sparse Gradient Descent as a solution for variable selection in convex piecewise linear regression where the model is given as $\mathrm{max}\langle a_j^\star, x \rangle + b_j^\star$ for $j = 1,\dots,k$ where $x \in \mathbb R^d$ is the covariate vector. Here, $\{a_j^\star\}_{j=1}^k$ and $\{b_j^\star\}_{j=1}^k$ denote the ground-truth weight vectors and intercepts. A non-asymptotic local convergence analysis is provided for Sp-GD under sub-Gaussian noise when the covariate distribution satisfies sub-Gaussianity and anti-concentration property. When the model order and parameters are fixed, Sp-GD provides an $\epsilon$-accurate estimate given $\mathcal{O}(\max(\epsilon^{-2}\sigma_z^2,1)s\log(d/s))$ observations where $\sigma_z^2$ denotes the noise variance. This also implies the exact parameter recovery by Sp-GD from $\mathcal{O}(s\log(d/s))$ noise-free observations. Since optimizing the squared loss for sparse max-affine is non-convex, an initialization scheme is proposed to provide a suitable initial estimate within the basin of attraction for Sp-GD, i.e. sufficiently accurate to invoke the convergence guarantees. The initialization scheme uses sparse principal component analysis to estimate the subspace spanned by $\{ a_j^\star\}_{j=1}^k$ then applies an $r$-covering search to estimate the model parameters. A non-asymptotic analysis is presented for this initialization scheme when the covariates and noise samples follow Gaussian distributions. When the model order and parameters are fixed, this initialization scheme provides an $\epsilon$-accurate estimate given $\mathcal{O}(\epsilon^{-2}\max(\sigma_z^4,\sigma_z^2,1)s^2\log^4(d))$ observations. Numerical Monte Carlo results corroborate theoretical findings for Sp-GD and the initialization scheme.
- [16] arXiv:2411.02298 (cross-list from cs.LG) [pdf, html, other]
-
Title: Sample-Efficient Private Learning of Mixtures of GaussiansComments: 52 pages. To appear in Neural Information Processing Systems (NeurIPS), 2024Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
We study the problem of learning mixtures of Gaussians with approximate differential privacy. We prove that roughly $kd^2 + k^{1.5} d^{1.75} + k^2 d$ samples suffice to learn a mixture of $k$ arbitrary $d$-dimensional Gaussians up to low total variation distance, with differential privacy. Our work improves over the previous best result [AAL24b] (which required roughly $k^2 d^4$ samples) and is provably optimal when $d$ is much larger than $k^2$. Moreover, we give the first optimal bound for privately learning mixtures of $k$ univariate (i.e., $1$-dimensional) Gaussians. Importantly, we show that the sample complexity for privately learning mixtures of univariate Gaussians is linear in the number of components $k$, whereas the previous best sample complexity [AAL21] was quadratic in $k$. Our algorithms utilize various techniques, including the inverse sensitivity mechanism [AD20b, AD20a, HKMN23], sample compression for distributions [ABDH+20], and methods for bounding volumes of sumsets.
Cross submissions (showing 7 of 7 entries)
- [17] arXiv:2006.02329 (replaced) [pdf, html, other]
-
Title: Conformal e-testingComments: 21 pages and 2 figuresSubjects: Statistics Theory (math.ST)
There is a useful counterpart of conformal prediction for e-values, called conformal e-prediction. Conformal prediction can serve as basis for testing the assumption of exchangeability, leading to conformal testing. Similarly, conformal e-prediction can also serve as basis for testing. The resulting conformal e-testing looks very different from but inherits some strengths of conformal testing; it even has some advantages over conformal testing. In this paper we discuss systematically both strengths and limitations of conformal e-testing.
- [18] arXiv:2204.10488 (replaced) [pdf, html, other]
-
Title: The Equivariance Criterion in a Linear Model for Fixed-X CasesSubjects: Statistics Theory (math.ST)
In this article, we explored the usage of the equivariance criterion in linear model with fixed-X for the estimation and extended the model to allow multiple populations, which, in turn, leads to a larger transformation group. The minimum risk equivariant estimators of the coefficient vector and the covariance matrix were derived via the maximal invariants, which was consistent with earlier works. This article serves as an early exploration of the equivariance criterion in linear model.
- [19] arXiv:2210.10395 (replaced) [pdf, html, other]
-
Title: Grenander--Stone estimator: stacked constrained estimation of a discrete distribution over a general directed acyclic graphSubjects: Statistics Theory (math.ST)
In this paper we integrate isotonic regression with Stone's cross-validation-based method to estimate a distribution with a general countable support with a partial order relation defined on it. We prove that the estimator is strongly consistent for any underlying distribution, derive its rate of convergence, and in the case of one-dimensional support we obtain Marshal-type inequality for cumulative distribution function of the estimator. Also, we construct the asymptotically correct conservative global confidence band for the estimator. It is shown that, first, the estimator performs good even for small sized data sets, second, the estimator outperforms in the case of non-isotonic underlying distribution, and, third, it performs almost as good as Grenander estimator when the true distribution is isotonic. Therefore, the new estimator provides a trade-off between goodness-of-fit, monotonicity and quality of probabilistic forecast. We apply the estimator to the time-to-onset data of visceral leishmaniasis in Brazil collected from $2007$ to $2014$.
- [20] arXiv:2308.06899 (replaced) [pdf, html, other]
-
Title: Improved dimension dependence in the Bernstein von Mises Theorem via a new Laplace approximation boundComments: Changes from v2: BvM on logistic regression extended to arbitrary GLMsSubjects: Statistics Theory (math.ST)
The Bernstein-von Mises theorem (BvM) gives conditions under which the posterior distribution of a parameter $\theta\in\Theta\subseteq\mathbb R^d$ based on $n$ independent samples is asymptotically normal. In the high-dimensional regime, a key question is to determine the growth rate of $d$ with $n$ required for the BvM to hold. We show that up to a model-dependent coefficient, $n\gg d^2$ suffices for the BvM to hold in two settings: arbitrary generalized linear models, which include exponential families as a special case, and multinomial data, in which the parameter of interest is an unknown probability mass functions on $d+1$ states. Our results improve on the tightest previously known condition for posterior asymptotic normality, $n\gg d^3$. Our statements of the BvM are nonasymptotic, taking the form of explicit high-probability bounds. To prove the BvM, we derive a new simple and explicit bound on the total variation distance between a measure $\pi\propto e^{-nf}$ on $\Theta\subseteq\mathbb R^d$ and its Laplace approximation.
- [21] arXiv:2405.06877 (replaced) [pdf, html, other]
-
Title: On the orthogonally equivariant estimators of a covariance matrixSubjects: Statistics Theory (math.ST)
In this note, when the dimension $p$ is large we look into the insight of the Mar$\check{c}$enko-Pastur equation to get an explicit equality relationship, and use the obtained equality to establish a new kind of orthogonally equivariant estimator of the population covariance matrix. Under some regularity conditions, the proposed novel estimators of the population eigenvalues are shown to be consistent for the eigenvalues of population covariance matrix. It is also shown that the proposed estimator is the best orthogonally equivariant estimator of population covariance matrix under the normalized Stein loss function.
- [22] arXiv:2405.11246 (replaced) [pdf, html, other]
-
Title: On the consistent estimators of the population covariance matrix and its reparameterizationsComments: arXiv admin note: text overlap with arXiv:2405.06877Subjects: Statistics Theory (math.ST)
For the high-dimensional covariance estimation problem, when $\lim_{n\to \infty}p/n=c \in (0,1)$ the orthogonally equivariant estimator of the population covariance matrix proposed by Tsai and Tsai (2024b) enjoys some optimal properties. Under some regularity conditions, they showed that their novel estimators of eigenvalues are consistent with the eigenvalues of the population covariance matrix. In this note, first, we show that their novel estimator is a consistent estimator of the population covariance matrix under a high-dimensional asymptotic setup. Moreover, we may show that the novel estimator is the MLE of the population covariance matrix when $c \in (0, 1)$. The novel estimator is incorporated to establish the optimal decomposite $T_{T}^{2}-$test for a high-dimensional statistical hypothesis testing problem and to make the statistical inference for the high-dimensional principal component analysis-related problems without the sparsity assumption. Some remarks when $p >n $, especially for the high-dimensional low-sample size categorical data models $p >> n$, are made in the final section.
- [23] arXiv:2405.20909 (replaced) [pdf, html, other]
-
Title: Nonparametric regression on random geometric graphs sampled from submanifoldsSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
We consider the nonparametric regression problem when the covariates are located on an unknown smooth compact submanifold of a Euclidean space. Under defining a random geometric graph structure over the covariates we analyze the asymptotic frequentist behaviour of the posterior distribution arising from Bayesian priors designed through random basis expansion in the graph Laplacian eigenbasis. Under Holder smoothness assumption on the regression function and the density of the covariates over the submanifold, we prove that the posterior contraction rates of such methods are minimax optimal (up to logarithmic factors) for any positive smoothness index.
- [24] arXiv:2406.07972 (replaced) [pdf, html, other]
-
Title: Expected value and a Cayley-Menger type formula for the generalized earth mover's distanceComments: 20 pages; the second half of this paper supersedes my preprint arXiv:2306.12030, and thus there is some overlap of expositionSubjects: Statistics Theory (math.ST)
The earth mover's distance (EMD), also known as the 1-Wasserstein metric, measures the minimum amount of work required to transform one probability distribution into another. The EMD can be naturally generalized to measure the "distance" between any number (say $d$) of distributions. In previous work (2021), we found a recursive formula for the expected value of the generalized EMD, assuming the uniform distribution on the standard $n$-simplex. This recursion, however, was computationally expensive, requiring $\binom{d+n}{d}$ many iterations. The main result of the present paper is a nonrecursive formula for this expected value, expressed as the integral of a certain polynomial of degree at most $dn$. As a secondary result, we resolve an unanswered problem by giving a formula for the generalized EMD in terms of pairwise EMDs; this can be viewed as an analogue of the Cayley-Menger determinant formula that gives the hypervolume of a simplex in terms of its edge lengths.
- [25] arXiv:2407.05281 (replaced) [pdf, html, other]
-
Title: Tail Index Estimation for Discrete Heavy-Tailed DistributionsSubjects: Statistics Theory (math.ST)
It is the purpose of this paper to investigate the issue of estimating the regularity index $\beta>0$ of a discrete heavy-tailed r.v. $S$, \textit{i.e.} a r.v. $S$ valued in $\mathbb{N}^*$ such that $\mathbb{P}(S>n)=L(n)\cdot n^{-\beta}$ for all $n\geq 1$, where $L:\mathbb{R}^*_+\to \mathbb{R}_+$ is a slowly varying function. As a first go, we consider the situation where inference is based on independent copies $S_1,\; \ldots,\; S_n$ of the generic variable $S$. Just like the popular Hill estimator in the continuous heavy-tail situation, the estimator $\widehat{\beta}$ we propose can be derived by means of a suitable reformulation of the regularly varying condition, replacing $S$'s survivor function by its empirical counterpart. Under mild assumptions, a non-asymptotic bound for the deviation between $\widehat{\beta}$ and $\beta$ is established, as well as limit results (consistency and asymptotic normality). Beyond the i.i.d. case, the inference method proposed is extended to the estimation of the regularity index of a regenerative $\beta$-null recurrent Markov chain. Since the parameter $\beta$ can be then viewed as the tail index of the (regularly varying) distribution of the return time of the chain $X$ to any (pseudo-) regenerative set, in this case, the estimator is constructed from the successive regeneration times. Because the durations between consecutive regeneration times are asymptotically independent, we can prove that the consistency of the estimator promoted is preserved. In addition to the theoretical analysis carried out, simulation results provide empirical evidence of the relevance of the inference technique proposed.
- [26] arXiv:2407.15256 (replaced) [pdf, other]
-
Title: Weak-instrument-robust subvector inference in instrumental variables regression: A subvector Lagrange multiplier test and properties of subvector Anderson-Rubin confidence setsSubjects: Statistics Theory (math.ST); Econometrics (econ.EM)
We propose a weak-instrument-robust subvector Lagrange multiplier test for instrumental variables regression. We show that it is asymptotically size-correct under a technical condition. This is the first weak-instrument-robust subvector test for instrumental variables regression to recover the degrees of freedom of the commonly used non-weak-instrument-robust Wald test. Additionally, we provide a closed-form solution for subvector confidence sets obtained by inverting the subvector Anderson-Rubin test. We show that they are centered around a k-class estimator. Also, we show that the subvector confidence sets for single coefficients of the causal parameter are jointly bounded if and only if Anderson's likelihood-ratio test rejects the hypothesis that the first-stage regression parameter is of reduced rank, that is, that the causal parameter is not identified. Finally, we show that if a confidence set obtained by inverting the Anderson-Rubin test is bounded and nonempty, it is equal to a Wald-based confidence set with a data-dependent confidence level. We explicitly compute this Wald-based confidence test.
- [27] arXiv:2410.01427 (replaced) [pdf, html, other]
-
Title: Regularized e-processes: anytime valid inference with knowledge-based efficiency gainsComments: Comments welcome (via email or) at this https URLSubjects: Statistics Theory (math.ST); Probability (math.PR); Methodology (stat.ME)
Classical statistical methods have theoretical justification when the sample size is predetermined. In applications, however, it's often the case that sample sizes aren't predetermined; instead, they're often data-dependent. Since those methods designed for static sample sizes aren't reliable when sample sizes are dynamic, there's been recent interest in e-processes and corresponding tests and confidence sets that are anytime valid in the sense that their justification holds up for arbitrary dynamic data-collection plans. But if the investigator has relevant-yet-incomplete prior information about the quantity of interest, then there's an opportunity for efficiency gain, but existing approaches can't accommodate this. The present paper offer a new, regularized e-process framework that features a knowledge-based, imprecise-probabilistic regularization with improved efficiency. A generalized version of Ville's inequality is established, ensuring that inference based on the regularized e-process remains anytime valid in a novel, knowledge-dependent sense. In addition, the proposed regularized e-processes facilitate possibility-theoretic uncertainty quantification with strong frequentist-like calibration properties and other desirable Bayesian-like features: satisfies the likelihood principle, avoids sure-loss, and offers formal decision-making with reliability guarantees.
- [28] arXiv:2410.02629 (replaced) [pdf, html, other]
-
Title: Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust RegressionComments: Camera-ready version of NeurIPS 2024 paperSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME)
This paper studies the generalization performance of iterates obtained by Gradient Descent (GD), Stochastic Gradient Descent (SGD) and their proximal variants in high-dimensional robust regression problems. The number of features is comparable to the sample size and errors may be heavy-tailed. We introduce estimators that precisely track the generalization error of the iterates along the trajectory of the iterative algorithm. These estimators are provably consistent under suitable conditions. The results are illustrated through several examples, including Huber regression, pseudo-Huber regression, and their penalized variants with non-smooth regularizer. We provide explicit generalization error estimates for iterates generated from GD and SGD, or from proximal SGD in the presence of a non-smooth regularizer. The proposed risk estimates serve as effective proxies for the actual generalization error, allowing us to determine the optimal stopping iteration that minimizes the generalization error. Extensive simulations confirm the effectiveness of the proposed generalization error estimates.
- [29] arXiv:2109.02024 (replaced) [pdf, html, other]
-
Title: On the dependence between a Wiener process and its running maxima and running minima processesSubjects: Probability (math.PR); Statistics Theory (math.ST)
We study a triple of stochastic processes: a Wiener process $W_t$, $t \geq 0$, its running maxima process $M_t=\sup \{W_s: s \in [0,t]\}$ and its running minima process $m_t=\inf \{W_s: s \in [0,t]\}$. We derive the analytical formulas for the joint distribution function and the corresponding copula. As an application we draw out an analytical formula for pricing double barrier options.
- [30] arXiv:2204.10291 (replaced) [pdf, html, other]
-
Title: Structural Nested Mean Models Under Parallel Trends AssumptionsZach Shahn, Oliver Dukes, Meghana Shamsunder, David Richardson, Eric Tchetgen Tchetgen, James RobinsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We link and extend two approaches to estimating time-varying treatment effects on repeated continuous outcomes--time-varying Difference in Differences (DiD; see Roth et al. (2023) and Chaisemartin et al. (2023) for reviews) and Structural Nested Mean Models (SNMMs; see Vansteelandt and Joffe (2014) for a review). In particular, we show that SNMMs, which were previously only known to be nonparametrically identified under a no unobserved confounding assumption, are also identified under a generalized version of the parallel trends assumption typically used to justify time-varying DiD methods. Because SNMMs model a broader set of causal estimands, our results allow practitioners of existing time-varying DiD approaches to address additional types of substantive questions under similar assumptions. SNMMs enable estimation of time-varying effect heterogeneity, lasting effects of a `blip' of treatment at a single time point, effects of sustained interventions (possibly on continuous or multi-dimensional treatments) when treatment repeatedly changes value in the data, controlled direct effects, effects of dynamic treatment strategies that depend on covariate history, and more. Our results also allow analysts who apply SNMMs under the no unobserved confounding assumption to estimate some of the same causal effects under alternative identifying assumptions. We provide a method for sensitivity analysis to violations of our parallel trends assumption. We further explain how to estimate optimal treatment regimes via optimal regime SNMMs under parallel trends assumptions plus an assumption that there is no effect modification by unobserved confounders. Finally, we illustrate our methods with real data applications estimating effects of Medicaid expansion on uninsurance rates, effects of floods on flood insurance take-up, and effects of sustained changes in temperature on crop yields.
- [31] arXiv:2305.15592 (replaced) [pdf, html, other]
-
Title: Large Sample Theory for Bures-Wasserstein BarycentresSubjects: Probability (math.PR); Statistics Theory (math.ST)
We establish a strong law of large numbers and a central limit theorem in the Bures-Wasserstein space of covariance operators -- or equivalently centred Gaussian measures -- over a general separable Hilbert space. Specifically, we show that empirical barycentre sequences indexed by sample size are almost certainly relatively compact, with accumulation points comprising population barycentres. We give a sufficient regularity condition for the limit to be unique. When the limit is unique, we also establish a central limit theorem under a refined pair of moment and regularity conditions.
Finally, we prove strong operator convergence of the empirical optimal transport maps to their population counterparts. Though our results naturally extend finite-dimensional counterparts, including associated regularity conditions, our techniques are distinctly different owing to the functional nature of the problem in the general setting. A key element is the characterisation of compact sets in the Bures-Wasserstein topology that reflects an ordered Heine-Borel property of the Bures-Wasserstein space. - [32] arXiv:2309.13441 (replaced) [pdf, html, other]
-
Title: Anytime valid and asymptotically optimal inference driven by predictive recursionComments: Comments welcome at this https URLSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Distinguishing two candidate models is a fundamental and practically important statistical problem. Error rate control is crucial to the testing logic but, in complex nonparametric settings, can be difficult to achieve, especially when the stopping rule that determines the data collection process is not available. This paper proposes an e-process construction based on the predictive recursion (PR) algorithm originally designed to recursively fit nonparametric mixture models. The resulting PRe-process affords anytime valid inference and is asymptotically efficient in the sense that its growth rate is first-order optimal relative to PR's mixture model.
- [33] arXiv:2401.14277 (replaced) [pdf, html, other]
-
Title: An Instance-Based Approach to the Trace Reconstruction ProblemComments: 7 pages, part of this paper was presented at the 58th Annual Conference on Information Sciences and Systems (CISS 2024), funding information added in updated document, an error in the presentation of the main results in the CISS 2024 version of the paper is fixed in the updated documentSubjects: Information Theory (cs.IT); Data Structures and Algorithms (cs.DS); Probability (math.PR); Statistics Theory (math.ST)
In the trace reconstruction problem, one observes the output of passing a binary string $s \in \{0,1\}^n$ through a deletion channel $T$ times and wishes to recover $s$ from the resulting $T$ "traces." Most of the literature has focused on characterizing the hardness of this problem in terms of the number of traces $T$ needed for perfect reconstruction either in the worst case or in the average case (over input sequences $s$). In this paper, we propose an alternative, instance-based approach to the problem. We define the "Levenshtein difficulty" of a problem instance $(s,T)$ as the probability that the resulting traces do not provide enough information for correct recovery with full certainty. One can then try to characterize, for a specific $s$, how $T$ needs to scale in order for the Levenshtein difficulty to go to zero, and seek reconstruction algorithms that match this scaling for each $s$. We derive a lower bound on the Levenshtein difficulty, and prove that $T$ needs to scale exponentially fast in $n$ for the Levenshtein difficulty to approach zero for a very broad class of strings. For a class of binary strings with alternating long runs, we design an algorithm whose probability of reconstruction error approaches zero whenever the Levenshtein difficulty approaches zero. For this class, we also prove that the error probability of this algorithm decays to zero at least as fast as the Levenshtein difficulty.
- [34] arXiv:2404.15060 (replaced) [pdf, html, other]
-
Title: Fast and reliable confidence intervals for a variance componentSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We show that confidence intervals in a variance component model, with asymptotically correct uniform coverage probability, can be obtained by inverting certain test-statistics based on the score for the restricted likelihood. The results apply in settings where the variance is near or at the boundary of the parameter set. Simulations indicate the proposed test-statistics are approximately pivotal and lead to confidence intervals with near-nominal coverage even in small samples. We illustrate our methods' application in spatially-resolved transcriptomics where we compute approximately 15,000 confidence intervals, used for gene ranking, in less than 4 minutes. In the settings we consider, the proposed method is between two and 28,000 times faster than popular alternatives, depending on how many confidence intervals are computed.
- [35] arXiv:2409.12799 (replaced) [pdf, html, other]
-
Title: The Central Role of the Loss Function in Reinforcement LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
This paper illustrates the central role of loss functions in data-driven decision making, providing a comprehensive survey on their influence in cost-sensitive classification (CSC) and reinforcement learning (RL). We demonstrate how different regression loss functions affect the sample efficiency and adaptivity of value-based decision making algorithms. Across multiple settings, we prove that algorithms using the binary cross-entropy loss achieve first-order bounds scaling with the optimal policy's cost and are much more efficient than the commonly used squared loss. Moreover, we prove that distributional algorithms using the maximum likelihood loss achieve second-order bounds scaling with the policy variance and are even sharper than first-order bounds. This in particular proves the benefits of distributional RL. We hope that this paper serves as a guide analyzing decision making algorithms with varying loss functions, and can inspire the reader to seek out better loss functions to improve any decision making algorithm.
- [36] arXiv:2409.20207 (replaced) [pdf, other]
-
Title: New matrix perturbation bounds via combinatorial expansion I: Perturbation of eigenspacesSubjects: Spectral Theory (math.SP); Combinatorics (math.CO); Functional Analysis (math.FA); Probability (math.PR); Statistics Theory (math.ST)
Matrix perturbation bounds (such as Weyl and Davis-Kahan) are frequently used in many branches of mathematics. Most of the classical results in this area are optimal, in the worst case analysis. However, in modern applications, both the ground and the nose matrices frequently have extra structural properties. For instance, it is often assumed that the ground matrix is essentially low rank, and the nose matrix is random or pseudo-random. We aim to rebuild a part of perturbation theory, adapting to these modern assumptions. The key idea is to exploit the skewness between the leading eigenvectors of the ground matrix and the noise matrix. We will do this by combining the classical contour integration method with combinatorial ideas, resulting in a new machinery, which has a wide range of applications. Our new bounds are optimal under mild assumptions, with direct applications to central problems in many different areas. Among others, we derive a sharp result for the perturbation of a low rank matrix with random perturbation, answering an open question in this area. Next, we derive new, optimal, results concerning covariance estimator of the spiked model, an important model in statistics, bridging two different directions of current research. Finally, and somewhat unexpectedly, we can use our results on the perturbation of eigenspaces to derive new results concerning eigenvalues of deterministic and random matrices. In particular, we obtain new results concerning the outliers in the deformed Wigner model and the least singular value of random matrices with non-zero mean.
- [37] arXiv:2410.21922 (replaced) [pdf, html, other]
-
Title: Prior Knowledge Accelerate Variance ComputingSubjects: Computation (stat.CO); Statistics Theory (math.ST)
Variance is a basic metric to evaluate the degree of data dispersion, and it is also frequently used in the realm of statistics. However, due to the computing variance and the large dataset being time-consuming, there is an urge to accelerate this computing process. The paper suggests a new method to reduce the time of this computation, it assumes a scenario in which we already know the variance of the original dataset, and the whole variance of this merge dataset could be expressed in the form of addition between the original variance and a remainder term. When we want to calculate the total variance after this adds up, the method only needs to calculate the remainder to get the result instead of recalculating the total variance again, which we named this type of method as PKA(Prior Knowledge Acceleration). The paper mathematically proves the effectiveness of PKA in variance calculation, and the conditions for this method to accelerate properly.