Statistics
See recent articles
Showing new listings for Thursday, 21 November 2024
- [1] arXiv:2411.12786 [pdf, html, other]
-
Title: Off-policy estimation with adaptively collected data: the power of online learningComments: 37 pages. Accepted to the 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, British Columbia, CanadaSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
We consider estimation of a linear functional of the treatment effect using adaptively collected data. This task finds a variety of applications including the off-policy evaluation (\textsf{OPE}) in contextual bandits, and estimation of the average treatment effect (\textsf{ATE}) in causal inference. While a certain class of augmented inverse propensity weighting (\textsf{AIPW}) estimators enjoys desirable asymptotic properties including the semi-parametric efficiency, much less is known about their non-asymptotic theory with adaptively collected data. To fill in the gap, we first establish generic upper bounds on the mean-squared error of the class of AIPW estimators that crucially depends on a sequentially weighted error between the treatment effect and its estimates. Motivated by this, we also propose a general reduction scheme that allows one to produce a sequence of estimates for the treatment effect via online learning to minimize the sequentially weighted estimation error. To illustrate this, we provide three concrete instantiations in (\romannumeral 1) the tabular case; (\romannumeral 2) the case of linear function approximation; and (\romannumeral 3) the case of general function approximation for the outcome model. We then provide a local minimax lower bound to show the instance-dependent optimality of the \textsf{AIPW} estimator using no-regret online learning algorithms.
- [2] arXiv:2411.12840 [pdf, other]
-
Title: The Aldous--Hoover Theorem in Categorical ProbabilityComments: 39 pagesSubjects: Statistics Theory (math.ST); Logic in Computer Science (cs.LO); Category Theory (math.CT); Probability (math.PR)
The Aldous-Hoover Theorem concerns an infinite matrix of random variables whose distribution is invariant under finite permutations of rows and columns. It states that, up to equality in distribution, each random variable in the matrix can be expressed as a function only depending on four key variables: one common to the entire matrix, one that encodes information about its row, one that encodes information about its column, and a fourth one specific to the matrix entry.
We state and prove the theorem within a category-theoretic approach to probability, namely the theory of Markov categories. This makes the proof more transparent and intuitive when compared to measure-theoretic ones. A key role is played by a newly identified categorical property, the Cauchy--Schwarz axiom, which also facilitates a new synthetic de Finetti Theorem.
We further provide a variant of our proof using the ordered Markov property and the d-separation criterion, both generalized from Bayesian networks to Markov categories. We expect that this approach will facilitate a systematic development of more complex results in the future, such as categorical approaches to hierarchical exchangeability. - [3] arXiv:2411.12854 [pdf, other]
-
Title: A new Input Convex Neural Network with application to options pricingComments: 29 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We introduce a new class of neural networks designed to be convex functions of their inputs, leveraging the principle that any convex function can be represented as the supremum of the affine functions it dominates. These neural networks, inherently convex with respect to their inputs, are particularly well-suited for approximating the prices of options with convex payoffs. We detail the architecture of this, and establish theoretical convergence bounds that validate its approximation capabilities. We also introduce a \emph{scrambling} phase to improve the training of these networks. Finally, we demonstrate numerically the effectiveness of these networks in estimating prices for three types of options with convex payoffs: Basket, Bermudan, and Swing options.
- [4] arXiv:2411.12871 [pdf, html, other]
-
Title: Modelling Directed Networks with ReciprocitySubjects: Methodology (stat.ME)
Asymmetric relational data is increasingly prevalent across diverse fields, underscoring the need for directed network models to address the complex challenges posed by their unique structures. Unlike undirected models, directed models can capture reciprocity, the tendency of nodes to form mutual links. In this work, we address a fundamental question: what is the effective sample size for modeling reciprocity? We examine this by analyzing the Bernoulli model with reciprocity, allowing for varying sparsity levels between non-reciprocal and reciprocal effects. We then extend this framework to a model that incorporates node-specific heterogeneity and link-specific reciprocity using covariates. Our findings reveal intriguing interplays between non-reciprocal and reciprocal effects in sparse networks. We propose a straightforward inference procedure based on maximum likelihood estimation that operates without prior knowledge of sparsity levels, whether covariates are included or not.
- [5] arXiv:2411.12878 [pdf, other]
-
Title: Local Anti-Concentration Class: Logarithmic Regret for Greedy Linear Contextual BanditComments: NeurIPS2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We study the performance guarantees of exploration-free greedy algorithms for the linear contextual bandit problem. We introduce a novel condition, named the \textit{Local Anti-Concentration} (LAC) condition, which enables a greedy bandit algorithm to achieve provable efficiency. We show that the LAC condition is satisfied by a broad class of distributions, including Gaussian, exponential, uniform, Cauchy, and Student's~$t$ distributions, along with other exponential family distributions and their truncated variants. This significantly expands the class of distributions under which greedy algorithms can perform efficiently. Under our proposed LAC condition, we prove that the cumulative expected regret of the greedy algorithm for the linear contextual bandit is bounded by $O(\operatorname{poly} \log T)$. Our results establish the widest range of distributions known to date that allow a sublinear regret bound for greedy algorithms, further achieving a sharp poly-logarithmic regret.
- [6] arXiv:2411.12889 [pdf, html, other]
-
Title: Goodness-of-fit tests for generalized Poisson distributionsSubjects: Methodology (stat.ME)
This paper presents and examines computationally convenient goodness-of-fit tests for the family of generalized Poisson distributions, which encompasses notable distributions such as the Compound Poisson and the Katz distributions. The tests are consistent against fixed alternatives and their null distribution can be consistently approximated by a parametric bootstrap. The goodness of the bootstrap estimator and the power for finite sample sizes are numerically assessed through an extensive simulation experiment, including comparisons with other tests. In many cases, the novel tests either outperform or match the performance of existing ones. Real data applications are considered for illustrative purposes.
- [7] arXiv:2411.12936 [pdf, html, other]
-
Title: Statistical inference for mean-field queueing systemsSubjects: Statistics Theory (math.ST); Probability (math.PR)
Mean-field limits have been used now as a standard tool in approximations, including for networks with a large number of nodes. Statistical inference on mean-filed models has attracted more attention recently mainly due to the rapid emergence of data-driven systems. However, studies reported in the literature have been mainly limited to continuous models. In this paper, we initiate a study of statistical inference on discrete mean-field models (or jump processes) in terms of a well-known and extensively studied model, known as the power-of-L, or the supermarket model, to demonstrate how to deal with new challenges in discrete models. We focus on system parameter estimation based on the observations of system states at discrete time epochs over a finite period. We show that by harnessing the weak convergence results developed for the supermarket model in the literature, an asymptotic inference scheme based on an approximate least squares estimation can be obtained from the mean-field limiting equation. Also, by leveraging the law of large numbers alongside the central limit theorem, the consistency of the estimator and its asymptotic normality can be established when the number of servers and the number of observations go to infinity. Moreover, numerical results for the power-of-two model are provided to show the efficiency and accuracy of the proposed estimator.
- [8] arXiv:2411.12938 [pdf, html, other]
-
Title: Probability distributions and calculations for Hake's ratio statistics in measuring effect sizeComments: 23 pages, 5 figures, 1 tableSubjects: Computation (stat.CO); Data Analysis, Statistics and Probability (physics.data-an)
Ratio statistics and distributions play a crucial role in various fields, including linear regression, metrology, nuclear physics, operations research, econometrics, biostatistics, genetics, and engineering. In this work, we examine the statistical properties and probability calculations of the Hake normalized gain as a measure of effect size and educational effectiveness in physics education. Leveraging existing knowledge about the Hake ratio as a ratio of normal variables and utilizing open data science tools, we developed two novel computational approaches for computing ratio distributions. Our pilot numerical study demonstrates the speed, accuracy, and reliability of calculating ratio distributions through (1) DE quadrature with/without barycentric interpolation, a very quick and efficient quadrature method, and (2) a 2D vectorized numerical inversion of characteristic functions, which offers broader applicability by not requiring knowledge of PDFs or the independence of ratio constituents. These numerical explorations not only deepen the understanding of the Hake ratio's distribution but also showcase the efficiency, precision, and versatility of our proposed methods, making them highly suitable for fast data analysis based on exact probability ratio distributions. This capability has potential applications in multidimensional statistics and uncertainty analysis in metrology, where precise and reliable data handling is essential.
- [9] arXiv:2411.12944 [pdf, html, other]
-
Title: From Estimands to Robust Inference of Treatment Effects in Platform TrialsSubjects: Methodology (stat.ME)
A platform trial is an innovative clinical trial design that uses a master protocol (i.e., one overarching protocol) to evaluate multiple treatments in an ongoing manner and can accelerate the evaluation of new treatments. However, the flexibility that marks the potential of platform trials also creates inferential challenges. Two key challenges are the precise definition of treatment effects and the robust and efficient inference on these effects. To address these challenges, we first define a clinically meaningful estimand that characterizes the treatment effect as a function of the expected outcomes under two given treatments among concurrently eligible patients. Then, we develop weighting and post-stratification methods for estimation of treatment effects with minimal assumptions. To fully leverage the efficiency potential of data from concurrently eligible patients, we also consider a model-assisted approach for baseline covariate adjustment to gain efficiency while maintaining robustness against model misspecification. We derive and compare asymptotic distributions of proposed estimators in theory and propose robust variance estimators. The proposed estimators are empirically evaluated in a simulation study and illustrated using the SIMPLIFY trial. Our methods are implemented in the R package RobinCID.
- [10] arXiv:2411.12965 [pdf, html, other]
-
Title: On adaptivity and minimax optimality of two-sided nearest neighborsComments: 29 pages, 7 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
Nearest neighbor (NN) algorithms have been extensively used for missing data problems in recommender systems and sequential decision-making systems. Prior theoretical analysis has established favorable guarantees for NN when the underlying data is sufficiently smooth and the missingness probabilities are lower bounded. Here we analyze NN with non-smooth non-linear functions with vast amounts of missingness. In particular, we consider matrix completion settings where the entries of the underlying matrix follow a latent non-linear factor model, with the non-linearity belonging to a \Holder function class that is less smooth than Lipschitz. Our results establish following favorable properties for a suitable two-sided NN: (1) The mean squared error (MSE) of NN adapts to the smoothness of the non-linearity, (2) under certain regularity conditions, the NN error rate matches the rate obtained by an oracle equipped with the knowledge of both the row and column latent factors, and finally (3) NN's MSE is non-trivial for a wide range of settings even when several matrix entries might be missing deterministically. We support our theoretical findings via extensive numerical simulations and a case study with data from a mobile health study, HeartSteps.
- [11] arXiv:2411.12995 [pdf, html, other]
-
Title: Eliminating Ratio Bias for Gradient-based Simulated Parameter EstimationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
This article addresses the challenge of parameter calibration in stochastic models where the likelihood function is not analytically available. We propose a gradient-based simulated parameter estimation framework, leveraging a multi-time scale algorithm that tackles the issue of ratio bias in both maximum likelihood estimation and posterior density estimation problems. Additionally, we introduce a nested simulation optimization structure, providing theoretical analyses including strong convergence, asymptotic normality, convergence rate, and budget allocation strategies for the proposed algorithm. The framework is further extended to neural network training, offering a novel perspective on stochastic approximation in machine learning. Numerical experiments show that our algorithm can improve the estimation accuracy and save computational costs.
- [12] arXiv:2411.13080 [pdf, html, other]
-
Title: Distribution-free Measures of Association based on Optimal TransportComments: 24 pages. To appear in the Indian J. Pure Appl. Math, special issue in honor of Prof. K. R. Parthasarathy. arXiv admin note: text overlap with arXiv:2010.01768Subjects: Statistics Theory (math.ST); Probability (math.PR); Methodology (stat.ME)
In this paper we propose and study a class of nonparametric, yet interpretable measures of association between two random vectors $X$ and $Y$ taking values in $\mathbb{R}^{d_1}$ and $\mathbb{R}^{d_2}$ respectively ($d_1, d_2\ge 1$). These nonparametric measures -- defined using the theory of reproducing kernel Hilbert spaces coupled with optimal transport -- capture the strength of dependence between $X$ and $Y$ and have the property that they are 0 if and only if the variables are independent and 1 if and only if one variable is a measurable function of the other. Further, these population measures can be consistently estimated using the general framework of geometric graphs which include $k$-nearest neighbor graphs and minimum spanning trees. Additionally, these measures can also be readily used to construct an exact finite sample distribution-free test of mutual independence between $X$ and $Y$. In fact, as far as we are aware, these are the only procedures that possess all the above mentioned desirable properties. The correlation coefficient proposed in Dette et al. (2013), Chatterjee (2021), Azadkia and Chatterjee (2021), at the population level, can be seen as a special case of this general class of measures.
- [13] arXiv:2411.13131 [pdf, html, other]
-
Title: Bayesian Parameter Estimation of Normal Distribution from Sample Mean and Extreme ValuesSubjects: Methodology (stat.ME)
This paper proposes a Bayesian method for estimating the parameters of a normal distribution when only limited summary statistics (sample mean, minimum, maximum, and sample size) are available. To estimate the parameters of a normal distribution, we introduce a data augmentation approach using the Gibbs sampler, where intermediate values are treated as missing values and samples from a truncated normal distribution conditional on the observed sample mean, minimum, and maximum values. Through simulation studies, we demonstrate that our method achieves estimation accuracy comparable to theoretical expectations.
- [14] arXiv:2411.13199 [pdf, html, other]
-
Title: Optimal Rates for Multiple Models in Matrix CompletionComments: 35 pages. All comments are warmly welcomedSubjects: Statistics Theory (math.ST)
In this paper, we demonstrate how a class of advanced matrix concentration inequalities, introduced in \cite{brailovskaya2024universality}, can be used to eliminate the dimensional factor in the convergence rate of matrix completion. This dimensional factor represents a significant gap between the upper bound and the minimax lower bound, especially in high dimension. Through a more precise spectral norm analysis, we remove the dimensional factors for five different estimators in various settings, thereby establishing their minimax rate optimality.
- [15] arXiv:2411.13203 [pdf, other]
-
Title: A computational framework for integrating Predictive processes with evidence Accumulation Models (PAM)Antonino Visalli, Francesco Maria Calistroni, Margherita Calderan, Francesco Donnarumma, Marco Zorzi, Ettore AmbrosiniSubjects: Applications (stat.AP)
Evidence Accumulation Models (EAMs) have been widely used to investigate speeded decision-making processes, but they have largely neglected the role of predictive processes emphasized by theories of the predictive brain. In this paper, we present the Predictive evidence Accumulation Models (PAM), a novel computational framework that integrates predictive processes into EAMs. Grounded in the "observing the observer" framework, PAM combines models of Bayesian perceptual inference, such as the Hierarchical Gaussian Filter, with three established EAMs (the Diffusion Decision Model, Lognormal Race Model, and Race Diffusion Model) to model decision-making under uncertainty. We validate PAM through parameter recovery simulations, demonstrating its accuracy and computational efficiency across various decision-making scenarios. Additionally, we provide a step-by-step tutorial using real data to illustrate PAM's application and discuss its theoretical implications. PAM represents a significant advancement in the computational modeling of decision-making, bridging the gap between predictive brain theories and EAMs, and offers a promising tool for future empirical research.
- [16] arXiv:2411.13356 [pdf, html, other]
-
Title: Optimal Designs for Spherical Harmonic RegressionSubjects: Applications (stat.AP)
This short paper is concerned with the use of spherical t-designs as optimal designs for the spherical harmonic regression model in three dimensions over a range of specified criteria. The nature of the designs is explored and their availability and suitability is reviewed.
- [17] arXiv:2411.13370 [pdf, html, other]
-
Title: Analysis of Higher Education Dropouts Dynamics through Multilevel Functional Decomposition of Recurrent Events in Counting ProcessesSubjects: Applications (stat.AP)
This paper analyzes the dynamics of higher education dropouts through an innovative approach that integrates recurrent events modeling and point process theory with functional data analysis. We propose a novel methodology that extends existing frameworks to accommodate hierarchical data structures, demonstrating its potential through a simulation study. Using administrative data from student careers at Politecnico di Milano, we explore dropout patterns during the first year across different bachelor's degree programs and schools. Specifically, we employ Cox-based recurrent event models, treating dropouts as repeated occurrences within both programs and schools. Additionally, we apply functional modeling of recurrent events and multilevel principal component analysis to disentangle latent effects associated with degree programs and schools, identifying critical periods of dropout risk and providing valuable insights for institutions seeking to implement strategies aimed at reducing dropout rates.
- [18] arXiv:2411.13396 [pdf, html, other]
-
Title: Sensitivity Analysis on Policy-Augmented Graphical Hybrid Models with Shapley Value EstimationSubjects: Machine Learning (stat.ML); Computation (stat.CO)
Driven by the critical challenges in biomanufacturing, including high complexity and high uncertainty, we propose a comprehensive and computationally efficient sensitivity analysis framework for general nonlinear policy-augmented knowledge graphical (pKG) hybrid models that characterize the risk- and science-based understandings of underlying stochastic decision process mechanisms. The criticality of each input (i.e., random factors, policy parameters, and model parameters) is measured by applying Shapley value (SV) sensitivity analysis to pKG (called SV-pKG), accounting for process causal interdependences. To quickly assess the SV for heavily instrumented bioprocesses, we approximate their dynamics with linear Gaussian pKG models and improve the SV estimation efficiency by utilizing the linear Gaussian properties. In addition, we propose an effective permutation sampling method with TFWW transformation and variance reduction techniques, namely the quasi-Monte Carlo and antithetic sampling methods, to further improve the sampling efficiency and estimation accuracy of SV for both general nonlinear and linear Gaussian pKG models. Our proposed framework can benefit efficient interpretation and support stable optimal process control in biomanufacturing.
- [19] arXiv:2411.13432 [pdf, html, other]
-
Title: Spatial error models with heteroskedastic normal perturbations and joint modeling of mean and varianceSubjects: Methodology (stat.ME)
This work presents the spatial error model with heteroskedasticity, which allows the joint modeling of the parameters associated with both the mean and the variance, within a traditional approach to spatial econometrics. The estimation algorithm is based on the log-likelihood function and incorporates the use of GAMLSS models in an iterative form. Two theoretical results show the advantages of the model to the usual models of spatial econometrics and allow obtaining the bias of weighted least squares estimators. The proposed methodology is tested through simulations, showing notable results in terms of the ability to recover all parameters and the consistency of its estimates. Finally, this model is applied to identify the factors associated with school desertion in Colombia.
- [20] arXiv:2411.13479 [pdf, html, other]
-
Title: Conformal Prediction for Hierarchical DataGuillaume Principato, Yvenn Amara-Ouali, Yannig Goude, Bachir Hamrouche, Jean-Michel Poggi, Gilles StoltzComments: 14 pages, 2 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
Reconciliation has become an essential tool in multivariate point forecasting for hierarchical time series. However, there is still a lack of understanding of the theoretical properties of probabilistic Forecast Reconciliation techniques. Meanwhile, Conformal Prediction is a general framework with growing appeal that provides prediction sets with probabilistic guarantees in finite sample. In this paper, we propose a first step towards combining Conformal Prediction and Forecast Reconciliation by analyzing how including a reconciliation step in the Split Conformal Prediction (SCP) procedure enhances the resulting prediction sets. In particular, we show that the validity granted by SCP remains while improving the efficiency of the prediction sets. We also advocate a variation of the theoretical procedure for practical use. Finally, we illustrate these results with simulations.
- [21] arXiv:2411.13542 [pdf, html, other]
-
Title: The R\'enyi Outlier TestComments: 4 pagesSubjects: Methodology (stat.ME)
Cox and Kartsonaki proposed a simple outlier test for a vector of p-values based on the Rényi transformation that is fast for large $p$ and numerically stable for very small p-values -- key properties for large data analysis. We propose and implement a generalization of this procedure we call the Rényi Outlier Test (ROT). This procedure maintains the key properties of the original but is much more robust to uncertainty in the number of outliers expected a priori among the p-values. The ROT can also account for two types of prior information that are common in modern data analysis. The first is the prior probability that a given p-value may be outlying. The second is an estimate of how far of an outlier a p-value might be, conditional on it being an outlier; in other words, an estimate of effect size. Using a series of pre-calculated spline functions, we provide a fast and numerically stable implementation of the ROT in our R package renyi.
New submissions (showing 21 of 21 entries)
- [22] arXiv:2401.07876 (cross-list from math.PR) [pdf, other]
-
Title: Characterization of the asymptotic behavior of $U$-statistics on row-column exchangeable matricesSubjects: Probability (math.PR); Statistics Theory (math.ST)
We consider $U$-statistics on row-column exchangeable matrices. We derive a decomposition for them, based on orthogonal projections on probability spaces generated by sets of Aldous-Hoover-Kallenberg variables. The specificity of these sets is that they are indexed by bipartite graphs, which allows for the use of concepts from graph theory to describe this decomposition. The decomposition is used to investigate the asymptotic behavior of $U$-statistics of row-column exchangeable matrices, including in degenerate cases. In particular, it depends only on a few terms of the decomposition, corresponding to the non-zero elements that are indexed by the smallest graphs, named principal support graphs, after an analogous concept suggested by Janson and Nowicki (1991). Hence, we show that the asymptotic behavior of a $U$-statistic and its degeneracy are characterized by the properties of its principal support graphs. Indeed, their number of nodes gives the convergence rate of a $U$-statistic to its limit distribution. Specifically, the latter is degenerate if and only if this number is strictly greater than 1. Finally, when the principal support graphs are connected, we find that the limit distribution is Gaussian, even in degenerate cases.
- [23] arXiv:2411.12753 (cross-list from q-fin.TR) [pdf, html, other]
-
Title: Supervised Autoencoders with Fractionally Differentiated Features and Triple Barrier Labelling Enhance Predictions on Noisy DataComments: arXiv admin note: substantial text overlap with arXiv:2404.01866Subjects: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
This paper investigates the enhancement of financial time series forecasting with the use of neural networks through supervised autoencoders (SAE), to improve investment strategy performance. Using the Sharpe and Information Ratios, it specifically examines the impact of noise augmentation and triple barrier labeling on risk-adjusted returns. The study focuses on Bitcoin, Litecoin, and Ethereum as the traded assets from January 1, 2016, to April 30, 2022. Findings indicate that supervised autoencoders, with balanced noise augmentation and bottleneck size, significantly boost strategy effectiveness. However, excessive noise and large bottleneck sizes can impair performance.
- [24] arXiv:2411.12843 (cross-list from cs.LG) [pdf, html, other]
-
Title: Reward Modeling with Ordinal Feedback: Wisdom of the CrowdSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Learning a reward model (RM) from human preferences has been an important component in aligning large language models (LLMs). The canonical setup of learning RMs from pairwise preference data is rooted in the classic Bradley-Terry (BT) model that accepts binary feedback, i.e., the label being either Response 1 is better than Response 2, or the opposite. Such a setup inevitably discards potentially useful samples (such as "tied" between the two responses) and loses more fine-grained information (such as "slightly better"). In this paper, we propose a framework for learning RMs under ordinal feedback which generalizes the case of binary preference feedback to any arbitrary granularity. Specifically, we first identify a marginal unbiasedness condition, which generalizes the assumption of the BT model in the existing binary feedback setting. The condition validates itself via the sociological concept of the wisdom of the crowd. Under the condition, we develop a natural probability model for pairwise preference data under ordinal feedback and analyze its properties. We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity compared to the case of binary feedback. The proposed learning objective and the theory also extend to hinge loss and direct policy optimization (DPO). In particular, the theoretical analysis may be of independent interest when applying to a seemingly unrelated problem of knowledge distillation to interpret the bias-variance trade-off therein. The framework also sheds light on writing guidance for human annotators. Our numerical experiments validate that fine-grained feedback leads to better reward learning for both in-distribution and out-of-distribution settings. Further experiments show that incorporating a certain proportion of samples with tied preference boosts RM learning.
- [25] arXiv:2411.12845 (cross-list from econ.GN) [pdf, html, other]
-
Title: Underlying Core Inflation with Multiple RegimesSubjects: General Economics (econ.GN); Applications (stat.AP); Methodology (stat.ME)
This paper introduces a new approach for estimating core inflation indicators based on common factors across a broad range of price indices. Specifically, by utilizing procedures for detecting multiple regimes in high-dimensional factor models, we propose two types of core inflation indicators: one incorporating multiple structural breaks and another based on Markov switching. The structural breaks approach can eliminate revisions for past regimes, though it functions as an offline indicator, as real-time detection of breaks is not feasible with this method. On the other hand, the Markov switching approach can reduce revisions while being useful in real time, making it a simple and robust core inflation indicator suitable for real-time monitoring and as a short-term guide for monetary policy. Additionally, this approach allows us to estimate the probability of being in different inflationary regimes. To demonstrate the effectiveness of these indicators, we apply them to Canadian price data. To compare the real-time performance of the Markov switching approach to the benchmark model without regime-switching, we assess their abilities to forecast headline inflation and minimize revisions. We find that the Markov switching model delivers superior predictive accuracy and significantly reduces revisions during periods of substantial inflation changes. Hence, our findings suggest that accounting for time-varying factors and parameters enhances inflation signal accuracy and reduces data requirements, especially following sudden economic shifts.
- [26] arXiv:2411.12886 (cross-list from q-bio.BM) [pdf, html, other]
-
Title: NPGPT: Natural Product-Like Compound Generation with GPT-based Chemical Language ModelsSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
Natural products are substances produced by organisms in nature and often possess biological activity and structural diversity. Drug development based on natural products has been common for many years. However, the intricate structures of these compounds present challenges in terms of structure determination and synthesis, particularly compared to the efficiency of high-throughput screening of synthetic compounds. In recent years, deep learning-based methods have been applied to the generation of molecules. In this study, we trained chemical language models on a natural product dataset and generated natural product-like compounds. The results showed that the distribution of the compounds generated was similar to that of natural products. We also evaluated the effectiveness of the generated compounds as drug candidates. Our method can be used to explore the vast chemical space and reduce the time and cost of drug discovery of natural products.
- [27] arXiv:2411.12925 (cross-list from cs.LG) [pdf, html, other]
-
Title: Loss-to-Loss Prediction: Scaling Laws for All DatasetsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.
- [28] arXiv:2411.13028 (cross-list from cs.LG) [pdf, html, other]
-
Title: A Theory for Compressibility of Graph Transformers for Transductive LearningHamed Shirzad, Honghao Lin, Ameya Velingker, Balaji Venkatachalam, David Woodruff, Danica SutherlandSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks, as the independent and identically distributed (i.i.d.) assumption does not hold among samples. Instead, all train/test/validation samples are present during training, making them more akin to a semi-supervised task. These differences make the analysis of the models substantially different from other models. Recently, Graph Transformers have significantly improved results on these datasets by overcoming long-range dependency problems. However, the quadratic complexity of full Transformers has driven the community to explore more efficient variants, such as those with sparser attention patterns. While the attention matrix has been extensively discussed, the hidden dimension or width of the network has received less attention. In this work, we establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed. Our results apply to both sparse and dense variants of Graph Transformers.
- [29] arXiv:2411.13029 (cross-list from cs.LG) [pdf, html, other]
-
Title: Probably Approximately Precision and Recall LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Precision and Recall are foundational metrics in machine learning where both accurate predictions and comprehensive coverage are essential, such as in recommender systems and multi-label learning. In these tasks, balancing precision (the proportion of relevant items among those predicted) and recall (the proportion of relevant items successfully predicted) is crucial. A key challenge is that one-sided feedback--where only positive examples are observed during training--is inherent in many practical problems. For instance, in recommender systems like YouTube, training data only consists of videos that a user has actively selected, while unselected items remain unseen. Despite this lack of negative feedback in training, avoiding undesirable recommendations at test time is essential.
We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions, such as between users and items. This framework subsumes the classical binary and multi-class PAC learning models as well as multi-label learning with partial feedback, where only a single random correct label per example is observed, rather than all correct labels.
Our work uncovers a rich statistical and algorithmic landscape, with nuanced boundaries on what can and cannot be learned. Notably, classical methods like Empirical Risk Minimization fail in this setting, even for simple hypothesis classes with only two hypotheses. To address these challenges, we develop novel algorithms that learn exclusively from positive data, effectively minimizing both precision and recall losses. Specifically, in the realizable setting, we design algorithms that achieve optimal sample complexity guarantees. In the agnostic case, we show that it is impossible to achieve additive error guarantees--as is standard in PAC learning--and instead obtain meaningful multiplicative approximations. - [30] arXiv:2411.13083 (cross-list from cs.LG) [pdf, html, other]
-
Title: Omnipredicting Single-Index Models with Multi-Index ModelsSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
Recent work on supervised learning [GKR+22] defined the notion of omnipredictors, i.e., predictor functions $p$ over features that are simultaneously competitive for minimizing a family of loss functions $\mathcal{L}$ against a comparator class $\mathcal{C}$. Omniprediction requires approximating the Bayes-optimal predictor beyond the loss minimization paradigm, and has generated significant interest in the learning theory community. However, even for basic settings such as agnostically learning single-index models (SIMs), existing omnipredictor constructions require impractically-large sample complexities and runtimes, and output complex, highly-improper hypotheses.
Our main contribution is a new, simple construction of omnipredictors for SIMs. We give a learner outputting an omnipredictor that is $\varepsilon$-competitive on any matching loss induced by a monotone, Lipschitz link function, when the comparator class is bounded linear predictors. Our algorithm requires $\approx \varepsilon^{-4}$ samples and runs in nearly-linear time, and its sample complexity improves to $\approx \varepsilon^{-2}$ if link functions are bi-Lipschitz. This significantly improves upon the only prior known construction, due to [HJKRR18, GHK+23], which used $\gtrsim \varepsilon^{-10}$ samples.
We achieve our construction via a new, sharp analysis of the classical Isotron algorithm [KS09, KKKS11] in the challenging agnostic learning setting, of potential independent interest. Previously, Isotron was known to properly learn SIMs in the realizable setting, as well as constant-factor competitive hypotheses under the squared loss [ZWDD24]. As they are based on Isotron, our omnipredictors are multi-index models with $\approx \varepsilon^{-2}$ prediction heads, bringing us closer to the tantalizing goal of proper omniprediction for general loss families and comparators. - [31] arXiv:2411.13169 (cross-list from cs.LG) [pdf, html, other]
-
Title: A Unified Analysis for Finite Weight AveragingComments: 34 pagesSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Averaging iterations of Stochastic Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Stochastic Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA). Especially, with a finite weight averaging method, LAWA can attain faster convergence and better generalization. However, its theoretical explanation is still less explored since there are fundamental differences between finite and infinite settings. In this work, we first generalize SGD and LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization. A key challenge is the inapplicability of traditional methods in the sense of expectation or optimal values for infinite-dimensional settings in analyzing FWA's convergence. Second, the cumulative gradients introduced by FWA introduce additional confusion to the generalization analysis, especially making it more difficult to discuss them under different assumptions. Extending the final iteration convergence analysis to the FWA, this paper, under a convexity assumption, establishes a convergence bound $\mathcal{O}(\log\left(\frac{T}{k}\right)/\sqrt{T})$, where $k\in[1, T/2]$ is a constant representing the last $k$ iterations. Compared to SGD with $\mathcal{O}(\log(T)/\sqrt{T})$, we prove theoretically that FWA has a faster convergence rate and explain the effect of the number of average points. In the generalization analysis, we find a recursive representation for bounding the cumulative gradient using mathematical induction. We provide bounds for constant and decay learning rates and the convex and non-convex cases to show the good generalization performance of FWA. Finally, experimental results on several benchmarks verify our theoretical results.
- [32] arXiv:2411.13361 (cross-list from physics.comp-ph) [pdf, html, other]
-
Title: Integration of Active Learning and MCMC Sampling for Efficient Bayesian Calibration of Mechanical PropertiesComments: 28 pages, 14 figuresSubjects: Computational Physics (physics.comp-ph); Applications (stat.AP); Machine Learning (stat.ML)
Recent advancements in Markov chain Monte Carlo (MCMC) sampling and surrogate modelling have significantly enhanced the feasibility of Bayesian analysis across engineering fields. However, the selection and integration of surrogate models and cutting-edge MCMC algorithms, often depend on ad-hoc decisions. A systematic assessment of their combined influence on analytical accuracy and efficiency is notably lacking. The present work offers a comprehensive comparative study, employing a scalable case study in computational mechanics focused on the inference of spatially varying material parameters, that sheds light on the impact of methodological choices for surrogate modelling and sampling. We show that a priori training of the surrogate model introduces large errors in the posterior estimation even in low to moderate dimensions. We introduce a simple active learning strategy based on the path of the MCMC algorithm that is superior to all a priori trained models, and determine its training data requirements. We demonstrate that the choice of the MCMC algorithm has only a small influence on the amount of training data but no significant influence on the accuracy of the resulting surrogate model. Further, we show that the accuracy of the posterior estimation largely depends on the surrogate model, but not even a tailored surrogate guarantees convergence of the this http URL, we identify the forward model as the bottleneck in the inference process, not the MCMC algorithm. While related works focus on employing advanced MCMC algorithms, we demonstrate that the training data requirements render the surrogate modelling approach infeasible before the benefits of these gradient-based MCMC algorithms on cheap models can be reaped.
- [33] arXiv:2411.13392 (cross-list from math.AG) [pdf, html, other]
-
Title: Classification of real hyperplane singularities by real log canonical thresholdsSubjects: Algebraic Geometry (math.AG); Commutative Algebra (math.AC); Statistics Theory (math.ST)
The log canonical threshold (lct) is a fundamental invariant in birational geometry, essential for understanding the complexity of singularities in algebraic varieties. Its real counterpart, the real log canonical threshold (rlct), also known as the learning coefficient, has become increasingly relevant in statistics and machine learning, where it plays a critical role in model selection and error estimation for singular statistical models. In this paper, we investigate the rlct and its multiplicity for real (not necessarily reduced) hyperplane arrangements. We derive explicit combinatorial formulas for these invariants, generalizing earlier results that were limited to specific examples. Moreover, we provide a general algebraic theory for real log canonical thresholds, and present a SageMath implementation for efficiently computing the rlct and its multiplicity in the case or real hyperplane arrangements. Applications to examples are given, illustrating how the formulas also can be used to analyze the asymptotic behavior of high-dimensional volume integrals.
- [34] arXiv:2411.13443 (cross-list from math.NA) [pdf, other]
-
Title: Nonlinear Assimilation with Score-based Sequential Langevin SamplingSubjects: Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
This paper presents a novel approach for nonlinear assimilation called score-based sequential Langevin sampling (SSLS) within a recursive Bayesian framework. SSLS decomposes the assimilation process into a sequence of prediction and update steps, utilizing dynamic models for prediction and observation data for updating via score-based Langevin Monte Carlo. An annealing strategy is incorporated to enhance convergence and facilitate multi-modal sampling. The convergence of SSLS in TV-distance is analyzed under certain conditions, providing insights into error behavior related to hyper-parameters. Numerical examples demonstrate its outstanding performance in high-dimensional and nonlinear scenarios, as well as in situations with sparse or partial measurements. Furthermore, SSLS effectively quantifies the uncertainty associated with the estimated states, highlighting its potential for error calibration.
- [35] arXiv:2411.13462 (cross-list from cs.DS) [pdf, html, other]
-
Title: Sampling and Integration of Logconcave Functions by Algorithmic DiffusionComments: 60 pages, 1 figureSubjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
We study the complexity of sampling, rounding, and integrating arbitrary logconcave functions. Our new approach provides the first complexity improvements in nearly two decades for general logconcave functions for all three problems, and matches the best-known complexities for the special case of uniform distributions on convex bodies. For the sampling problem, our output guarantees are significantly stronger than previously known, and lead to a streamlined analysis of statistical estimation based on dependent random samples.
Cross submissions (showing 14 of 14 entries)
- [36] arXiv:1710.06078 (replaced) [pdf, html, other]
-
Title: Estimate exponential memory decay in Hidden Markov Model and its applicationsSubjects: Machine Learning (stat.ML); Methodology (stat.ME)
Inference in hidden Markov model has been challenging in terms of scalability due to dependencies in the observation data. In this paper, we utilize the inherent memory decay in hidden Markov models, such that the forward and backward probabilities can be carried out with subsequences, enabling efficient inference over long sequences of observations. We formulate this forward filtering process in the setting of the random dynamical system and there exist Lyapunov exponents in the i.i.d random matrices production. And the rate of the memory decay is known as $\lambda_2-\lambda_1$, the gap of the top two Lyapunov exponents almost surely. An efficient and accurate algorithm is proposed to numerically estimate the gap after the soft-max parametrization. The length of subsequences $B$ given the controlled error $\epsilon$ is $B=\log(\epsilon)/(\lambda_2-\lambda_1)$. We theoretically prove the validity of the algorithm and demonstrate the effectiveness with numerical examples. The method developed here can be applied to widely used algorithms, such as mini-batch stochastic gradient method. Moreover, the continuity of Lyapunov spectrum ensures the estimated $B$ could be reused for the nearby parameter during the inference.
- [37] arXiv:2104.07773 (replaced) [pdf, other]
-
Title: Jointly Modeling and Clustering Tensors in High DimensionsSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
We consider the problem of jointly modeling and clustering populations of tensors by introducing a high-dimensional tensor mixture model with heterogeneous covariances. To effectively tackle the high dimensionality of tensor objects, we employ plausible dimension reduction assumptions that exploit the intrinsic structures of tensors such as low-rankness in the mean and separability in the covariance. In estimation, we develop an efficient high-dimensional expectation-conditional-maximization (HECM) algorithm that breaks the intractable optimization in the M-step into a sequence of much simpler conditional optimization problems, each of which is convex, admits regularization and has closed-form updating formulas. Our theoretical analysis is challenged by both the non-convexity in the EM-type estimation and having access to only the solutions of conditional maximizations in the M-step, leading to the notion of dual non-convexity. We demonstrate that the proposed HECM algorithm, with an appropriate initialization, converges geometrically to a neighborhood that is within statistical precision of the true parameter. The efficacy of our proposed method is demonstrated through comparative numerical experiments and an application to a medical study, where our proposal achieves an improved clustering accuracy over existing benchmarking methods.
- [38] arXiv:2111.07650 (replaced) [pdf, html, other]
-
Title: Joint FCLT for Sample Quantile and Measures of Dispersion for Functionals of Mixing ProcessesComments: 26 pages, 1 figure, 3 tables; Reworking of Section 4.2, mainly due to a correction of Lemma 15 (main results itself remain unchanged)Subjects: Statistics Theory (math.ST)
In this paper, we establish a joint (bivariate) functional central limit theorem of the sample quantile and the $r$-th absolute centred sample moment for functionals of mixing processes. More precisely, we consider $L_2$-near epoch dependent processes that are functionals of either $\phi$-mixing or absolutely regular processes. The general results we obtain can be used for two classes of popular and important processes in applications: The class of augmented GARCH($p$,$q$) processes with independent and identically distributed innovations (including many GARCH variations used in practice) and the class of ARMA($p$,$q$) processes with mixing innovations (including, e.g., ARMA-GARCH processes). For selected examples, we provide exact conditions on the moments and parameters of the process for the joint asymptotics to hold.
- [39] arXiv:2210.05558 (replaced) [pdf, html, other]
-
Title: Causal and Counterfactual Views of Missing Data ModelsSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
It is often said that the fundamental problem of causal inference is a missing data problem -- the comparison of responses to two hypothetical treatment assignments is made difficult because for every experimental unit only one potential response is observed. In this paper, we consider the implications of the converse view: that missing data problems are a form of causal inference. We make explicit how the missing data problem of recovering the complete data law from the observed law can be viewed as identification of a joint distribution over counterfactual variables corresponding to values had we (possibly contrary to fact) been able to observe them. Drawing analogies with causal inference, we show how identification assumptions in missing data can be encoded in terms of graphical models defined over counterfactual and observed variables. We review recent results in missing data identification from this viewpoint. In doing so, we note interesting similarities and differences between missing data and causal identification theories.
- [40] arXiv:2211.12692 (replaced) [pdf, other]
-
Title: Empirical Bayes estimation: When does $g$-modeling beat $f$-modeling in theory (and in practice)?Comments: New results including general positive results on g-modeling and negative results on f-modeling; optimality of unregularized NPMLE; extension to compound settingSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
Empirical Bayes (EB) is a popular framework for large-scale inference that aims to find data-driven estimators to compete with the Bayesian oracle that knows the true prior. Two principled approaches to EB estimation have emerged over the years: $f$-modeling, which constructs an approximate Bayes rule by estimating the marginal distribution of the data, and $g$-modeling, which estimates the prior from data and then applies the learned Bayes rule. For the Poisson model, the prototypical examples are the celebrated Robbins estimator and the nonparametric MLE (NPMLE), respectively. It has long been recognized in practice that the Robbins estimator, while being conceptually appealing and computationally simple, lacks robustness and can be easily derailed by ``outliers'', unlike the NPMLE which provides more stable and interpretable fit thanks to its Bayes form. On the other hand, not only do the existing theories shed little light on this phenomenon, but they all point to the opposite, as both methods have recently been shown optimal in terms of regret (excess over the Bayes risk) for compactly supported and subexponential priors.
In this paper we provide a theoretical justification for the superiority of $g$-modeling over $f$-modeling for heavy-tailed data by considering priors with bounded $p>1$th moment. We show that with mild regularization, any $g$-modeling method that is Hellinger rate-optimal in density estimation achieves an optimal total regret $\tilde \Theta(n^{\frac{3}{2p+1}})$; in particular, the special case of NPMLE succeeds without regularization. In contrast, there exists an $f$-modeling estimator whose density estimation rate is optimal but whose EB regret is suboptimal by a polynomial factor. These results show that the proper Bayes form provides a ``general recipe of success'' for optimal EB estimation that applies to all $g$-modeling (but not $f$-modeling) methods. - [41] arXiv:2303.00301 (replaced) [pdf, html, other]
-
Title: Auxiliary MCMC and particle Gibbs samplers for parallelisable inference in latent dynamical systemsComments: 32 pages (incl. references and TOC) + 9 pages appendix, 10 figures. Code also updated at this https URL Additions include better background description and a discussion of failure modes of the different methodsSubjects: Computation (stat.CO); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
We study the problem of designing efficient exact MCMC algorithms for sampling from the full posterior distribution of high-dimensional (in the number of time steps and the dimension of the latent space) non-linear non-Gaussian latent dynamical models. Particle Gibbs, also known as conditional sequential Monte Carlo (SMC), constitutes the de facto golden standard to do so, but suffers from degeneracy problems when the dimension of the latent space increases. On the other hand, the routinely employed globally Gaussian-approximated (e.g., extended Kalman filtering) biased solutions are seldom used for this same purpose even though they are more robust than their SMC counterparts. In this article, we show how, by introducing auxiliary observation variables in the model, we can both implement efficient exact Kalman-based samplers for large state-space models, as well as dramatically improve the mixing speed of particle Gibbs algorithms when the dimension of the latent space increases. We demonstrate when and how we can parallelise these auxiliary samplers along the time dimension, resulting in algorithms that scale logarithmically with the number of time steps when implemented on graphics processing units (GPUs). Both algorithms are easily tuned and can be extended to accommodate sophisticated approximation techniques. We demonstrate the improved statistical and computational performance of our auxiliary samplers compared to state-of-the-art alternatives for high-dimensional (in both time and state space) latent dynamical systems.
- [42] arXiv:2307.16720 (replaced) [pdf, html, other]
-
Title: Clustering multivariate functional data using the epigraph and hypograph indices: a case study on Madrid air qualitySubjects: Methodology (stat.ME); Applications (stat.AP)
With the rapid growth of data generation, advancements in functional data analysis (FDA) have become essential, especially for approaches that handle multiple variables at the same time. This paper introduces a novel formulation of the epigraph and hypograph indices, along with their generalized expressions, specifically designed for multivariate functional data (MFD). These new definitions account for interrelationships between variables, enabling effective clustering of MFD based on the original data curves and their first two derivatives. The methodology developed here has been tested on simulated datasets, demonstrating strong performance compared to state-of-the-art methods. Its practical utility is further illustrated with two environmental datasets: the Canadian weather dataset and a 2023 air quality study in Madrid. These applications highlight the potential of the method as a great tool for analyzing complex environmental data, offering valuable insights for researchers and policymakers in climate and environmental research.
- [43] arXiv:2308.05858 (replaced) [pdf, other]
-
Title: Inconsistency and Acausality of Model Selection in Bayesian Inverse ProblemsComments: This article is withdrawn because it is obsolete after a new and significantly different article about Bayesian Inference by Klaus Mosegaard and Andrew Cutis, entitled "Inconsistency and Acausality in Bayesian Inference for Physical Problems", has been submitted. The new article reveals more profound and serious problems with Bayesian methods than exposed in the withdrawn articleSubjects: Methodology (stat.ME)
Bayesian inference paradigms are regarded as powerful tools for solution of inverse problems. However, when applied to inverse problems in physical sciences, Bayesian formulations suffer from a number of inconsistencies that are often overlooked. A well known, but mostly neglected, difficulty is connected to the notion of conditional probability densities. Borel, and later Kolmogorov's (1933/1956), found that the traditional definition of conditional densities is incomplete: In different parameterizations it leads to different results. We will show an example where two apparently correct procedures applied to the same problem lead to two widely different results. Another type of inconsistency involves violation of causality. This problem is found in model selection strategies in Bayesian inversion, such as Hierarchical Bayes and Trans-Dimensional Inversion where so-called hyperparameters are included as variables to control either the number (or type) of unknowns, or the prior uncertainties on data or model parameters. For Hierarchical Bayes we demonstrate that the calculated 'prior' distributions of data or model parameters are not prior-, but posterior information. In fact, the calculated 'standard deviations' of the data are a measure of the inability of the forward function to model the data, rather than uncertainties of the data. For trans-dimensional inverse problems we show that the so-called evidence is, in fact, not a measure of the success of fitting the data for the given choice (or number) of parameters, as often claimed. We also find that the notion of Natural Parsimony is ill-defined, because of its dependence on the parameter prior. Based on this study, we find that careful rethinking of Bayesian inversion practices is required, with special emphasis on ways of avoiding the Borel-Kolmogorov inconsistency, and on the way we interpret model selection results.
- [44] arXiv:2311.15485 (replaced) [pdf, html, other]
-
Title: Calibrated Generalized Bayesian InferenceComments: This paper is a substantially revised version of arXiv:2302.06031v1. This revised version has a slightly different focus, additional examples, and theoretical results, as well as different authorsSubjects: Methodology (stat.ME)
We provide a simple and general solution for accurate uncertainty quantification of Bayesian inference in misspecified or approximate models, and for generalized posteriors more generally. While existing solutions are based on explicit Gaussian posterior approximations, or post-processing procedures, we demonstrate that correct uncertainty quantification can be achieved by substituting the usual posterior with an intuitively appealing alternative posterior that conveys the same information. This solution applies to both likelihood-based and loss-based posteriors, and we formally demonstrate the reliable uncertainty quantification of this approach. The new approach is demonstrated through a range of examples, including linear models, and doubly intractable models.
- [45] arXiv:2404.00221 (replaced) [pdf, other]
-
Title: Robust Learning for Optimal Dynamic Treatment Regimes with Observational DataSubjects: Methodology (stat.ME); Econometrics (econ.EM); Statistics Theory (math.ST); Machine Learning (stat.ML)
Public policies and medical interventions often involve dynamics in their treatment assignments, where individuals receive a series of interventions over multiple stages. We study the statistical learning of optimal dynamic treatment regimes (DTRs) that guide the optimal treatment assignment for each individual at each stage based on the individual's evolving history. We propose a doubly robust, classification-based approach to learning the optimal DTR using observational data under the assumption of sequential ignorability. This approach learns the optimal DTR through backward induction. At each step, it constructs an augmented inverse probability weighting (AIPW) estimator of the policy value function and maximizes it to learn the optimal policy for the corresponding stage. We show that the resulting DTR can achieve an optimal convergence rate of $n^{-1/2}$ for welfare regret under mild convergence conditions on estimators of the nuisance components.
- [46] arXiv:2405.19312 (replaced) [pdf, html, other]
-
Title: Design-based Causal Inference for Incomplete Block DesignsSubjects: Methodology (stat.ME)
Researchers often turn to block randomization to increase the precision of their inference or due to practical considerations, such as in multi-site trials. However, if the number of treatments under consideration is large it might not be practical or even feasible to assign all treatments within each block. We develop novel inference results under the finite-population design-based framework for natural alternatives to the complete block design that do not require reducing the number of treatment arms, the incomplete block design (IBD) and the balanced incomplete block design (BIBD). This includes deriving the properties of two estimators and proposing conservative variance estimators. To assist practitioners in understanding the trade-offs of using these designs, precision comparisons are made to standard estimators for the complete block, cluster-randomized, and completely randomized designs. Simulations and a data illustration further demonstrate the trade-offs. This work highlights IBDs as practical and currently underutilized designs.
- [47] arXiv:2406.05607 (replaced) [pdf, html, other]
-
Title: HAL-based Plugin Estimation of the Causal Dose-Response CurveSubjects: Methodology (stat.ME); Applications (stat.AP)
Estimating the marginally adjusted dose-response curve for continuous treatments is a longstanding statistical challenge critical across multiple fields. In the context of parametric models, mis-specification may result in substantial bias, hindering the accurate discernment of the true data generating distribution and the associated dose-response curve. In contrast, non-parametric models face difficulties as the dose-response curve isn't pathwise differentiable, and then there is no $\sqrt{n}$-consistent estimator. The emergence of the Highly Adaptive Lasso (HAL) MLE by van der Laan [2015] and van der Laan [2017] and the subsequent theoretical evidence by van der Laan [2023] regarding its pointwise asymptotic normality and uniform convergence rates, have highlighted the asymptotic efficacy of the HAL-based plug-in estimator for this intricate problem. This paper delves into the HAL-based plug-in estimators, including those with cross-validation and undersmoothing selectors, and introduces the undersmoothed smoothness-adaptive HAL-based plug-in estimator. We assess these estimators through extensive simulations, employing detailed evaluation metrics. Building upon the theoretical proofs in van der Laan [2023], our empirical findings underscore the asymptotic effectiveness of the undersmoothed smoothness-adaptive HAL-based plug-in estimator in estimating the marginally adjusted dose-response curve.
- [48] arXiv:2407.07072 (replaced) [pdf, html, other]
-
Title: Assumption Smuggling in Intermediate Outcome Tests of Causal MechanismsComments: 43 pages, 5 figuresSubjects: Applications (stat.AP); Methodology (stat.ME)
Political scientists are increasingly interested in assessing causal mechanisms, or determining not just if a causal effect exists but also why it occurs. Even so, many researchers avoid formal causal mediation analyses due to their stringent assumptions, instead opting to explore causal mechanisms through what we call intermediate outcome tests. These tests estimate the effect of the treatment on one or more mediators and view such effects as suggestive evidence of a causal mechanism. In this paper, we use nonparametric bounding analysis to show that, without further assumptions, these tests can neither establish nor rule out the existence of a causal mechanism. To use intermediate outcome tests as a falsification test of causal mechanisms, researchers must make a very strong but rarely discussed monotonicity assumption. We develop a way to assess the plausibility of this monotonicity assumption and estimate our bounds for two recent experiments that use these tests.
- [49] arXiv:2410.00125 (replaced) [pdf, html, other]
-
Title: Relative Cumulative Residual Information MeasureSubjects: Methodology (stat.ME)
In this paper, we develop a relative cumulative residual information (RCRI) measure that intends to quantify the divergence between two survival functions. The dynamic relative cumulative residual information (DRCRI) measure is also introduced. We establish some characterization results under the proportional hazards model assumption. Additionally, we obtained the non-parametric estimators of RCRI and DRCRI measures based on the kernel density type estimator for the survival function. The effectiveness of the estimators are assessed through an extensive Monte Carlo simulation study. We consider the data from the third Gaia data release (Gaia DR3) for demonstrating the use of the proposed measure. For this study, we have collected epoch photometry data for the objects Gaia DR3 4111834567779557376 and Gaia DR3 5090605830056251776.
- [50] arXiv:2410.23525 (replaced) [pdf, other]
-
Title: On the consistency of bootstrap for matching estimatorsComments: This version simplifies the proof of Lemma 4.1, revises some notation, and corrects some minor typos in the proofSubjects: Statistics Theory (math.ST); Econometrics (econ.EM)
In a landmark paper, Abadie and Imbens (2008) showed that the naive bootstrap is inconsistent when applied to nearest neighbor matching estimators of the average treatment effect with a fixed number of matches. Since then, this finding has inspired numerous efforts to address the inconsistency issue, typically by employing alternative bootstrap methods. In contrast, this paper shows that the naive bootstrap is provably consistent for the original matching estimator, provided that the number of matches, $M$, diverges. The bootstrap inconsistency identified by Abadie and Imbens (2008) thus arises solely from the use of a fixed $M$.
- [51] arXiv:2411.12578 (replaced) [pdf, html, other]
-
Title: Robust Inference for High-dimensional Linear Models with Heavy-tailed Errors via Partial Gini CovarianceSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
This paper introduces the partial Gini covariance, a novel dependence measure that addresses the challenges of high-dimensional inference with heavy-tailed errors, often encountered in fields like finance, insurance, climate, and biology. Conventional high-dimensional regression inference methods suffer from inaccurate type I errors and reduced power in heavy-tailed contexts, limiting their effectiveness. Our proposed approach leverages the partial Gini covariance to construct a robust statistical inference framework that requires minimal tuning and does not impose restrictive moment conditions on error distributions. Unlike traditional methods, it circumvents the need for estimating the density of random errors and enhances the computational feasibility and robustness. Extensive simulations demonstrate the proposed method's superior power and robustness over standard high-dimensional inference approaches, such as those based on the debiased Lasso. The asymptotic relative efficiency analysis provides additional theoretical insight on the improved efficiency of the new approach in the heavy-tailed setting. Additionally, the partial Gini covariance extends to the multivariate setting, enabling chi-square testing for a group of coefficients. We illustrate the method's practical application with a real-world data example.
- [52] arXiv:2007.01930 (replaced) [pdf, html, other]
-
Title: Integrating Neural Networks and Dictionary Learning for Multidimensional Clinical Characterizations from Functional Connectomics DataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose a unified optimization framework that combines neural networks with dictionary learning to model complex interactions between resting state functional MRI and behavioral data. The dictionary learning objective decomposes patient correlation matrices into a collection of shared basis networks and subject-specific loadings. These subject-specific features are simultaneously input into a neural network that predicts multidimensional clinical information. Our novel optimization framework combines the gradient information from the neural network with that of a conventional matrix factorization objective. This procedure collectively estimates the basis networks, subject loadings, and neural network weights most informative of clinical severity. We evaluate our combined model on a multi-score prediction task using 52 patients diagnosed with Autism Spectrum Disorder (ASD). Our integrated framework outperforms state-of-the-art methods in a ten-fold cross validated setting to predict three different measures of clinical severity.
- [53] arXiv:2301.13105 (replaced) [pdf, html, other]
-
Title: Generalization on the Unseen, Logic Reasoning and Degree CurriculumComments: extended JMLR version of the original ICML 2023 paperSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper considers the learning of logical (Boolean) functions with a focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for sparse functions and a class of network models including instances of Transformers, random features models, and linear networks, a min-degree-interpolator is learned on the unseen. More specifically, this means an interpolator of the training data that has minimal Fourier mass on the higher degree basis elements. These findings lead to two implications: (1) we provide an explanation to the length generalization problem for Boolean functions (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports. Finally, we discuss extensions to other models or non-sparse regimes where the min-degree bias may still occur or fade, as well as how it can be potentially corrected when undesirable.
- [54] arXiv:2304.08184 (replaced) [pdf, other]
-
Title: Adjustment with Many Regressors Under Covariate-Adaptive RandomizationsComments: 92 pages, including appendixSubjects: Econometrics (econ.EM); Methodology (stat.ME)
Our paper discovers a new trade-off of using regression adjustments (RAs) in causal inference under covariate-adaptive randomizations (CARs). On one hand, RAs can improve the efficiency of causal estimators by incorporating information from covariates that are not used in the randomization. On the other hand, RAs can degrade estimation efficiency due to their estimation errors, which are not asymptotically negligible when the number of regressors is of the same order as the sample size. Ignoring the estimation errors of RAs may result in serious over-rejection of causal inference under the null hypothesis. To address the issue, we construct a new ATE estimator by optimally linearly combining the estimators with and without RAs. We then develop a unified inference theory for this estimator under CARs. It has two features: (1) the Wald test based on it achieves the exact asymptotic size under the null hypothesis, regardless of whether the number of covariates is fixed or diverges no faster than the sample size; and (2) it guarantees weak efficiency improvement over estimators both with and without RAs.
- [55] arXiv:2305.09957 (replaced) [pdf, html, other]
-
Title: Quantum neural networks form Gaussian processesComments: 14+37 pages, 4+6 figuresSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
It is well known that artificial neural networks initialized from independent and identically distributed priors converge to Gaussian processes in the limit of a large number of neurons per hidden layer. In this work we prove an analogous result for Quantum Neural Networks (QNNs). Namely, we show that the outputs of certain models based on Haar random unitary or orthogonal deep QNNs converge to Gaussian processes in the limit of large Hilbert space dimension $d$. The derivation of this result is more nuanced than in the classical case due to the role played by the input states, the measurement observable, and the fact that the entries of unitary matrices are not independent. Then, we show that the efficiency of predicting measurements at the output of a QNN using Gaussian process regression depends on the observable's bodyness. Furthermore, our theorems imply that the concentration of measure phenomenon in Haar random QNNs is worse than previously thought, as we prove that expectation values and gradients concentrate as $\mathcal{O}\left(\frac{1}{e^d \sqrt{d}}\right)$. Finally, we discuss how our results improve our understanding of concentration in $t$-designs.
- [56] arXiv:2309.01837 (replaced) [pdf, html, other]
-
Title: Delegating Data Collection in Decentralized Machine LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Motivated by the emergence of decentralized machine learning (ML) ecosystems, we study the delegation of data collection. Taking the field of contract theory as our starting point, we design optimal and near-optimal contracts that deal with two fundamental information asymmetries that arise in decentralized ML: uncertainty in the assessment of model quality and uncertainty regarding the optimal performance of any model. We show that a principal can cope with such asymmetry via simple linear contracts that achieve 1-1/e fraction of the optimal utility. To address the lack of a priori knowledge regarding the optimal performance, we give a convex program that can adaptively and efficiently compute the optimal contract. We also study linear contracts and derive the optimal utility in the more complex setting of multiple interactions.
- [57] arXiv:2404.13986 (replaced) [pdf, html, other]
-
Title: Stochastic Volatility in Mean: Efficient Analysis by a Generalized Mixture SamplerComments: 38 pages, 11 figures, 14 tablesSubjects: Econometrics (econ.EM); Mathematical Finance (q-fin.MF); Applications (stat.AP); Computation (stat.CO)
In this paper we consider the simulation-based Bayesian analysis of stochastic volatility in mean (SVM) models. Extending the highly efficient Markov chain Monte Carlo mixture sampler for the SV model proposed in Kim et al. (1998) and Omori et al. (2007), we develop an accurate approximation of the non-central chi-squared distribution as a mixture of thirty normal distributions. Under this mixture representation, we sample the parameters and latent volatilities in one block. We also detail a correction of the small approximation error by using additional Metropolis-Hastings steps. The proposed method is extended to the SVM model with leverage. The methodology and models are applied to excess holding yields and S&P500 returns in empirical studies, and the SVM models are shown to outperform other volatility models based on marginal likelihoods.
- [58] arXiv:2410.07117 (replaced) [pdf, other]
-
Title: Classification of Buried Objects from Ground Penetrating Radar Images by using Second Order Deep Learning ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
In this paper, a new classification model based on covariance matrices is built in order to classify buried objects. The inputs of the proposed models are the hyperbola thumbnails obtained with a classical Ground Penetrating Radar (GPR) system. These thumbnails are then inputs to the first layers of a classical CNN, which then produces a covariance matrix using the outputs of the convolutional filters. Next, the covariance matrix is given to a network composed of specific layers to classify Symmetric Positive Definite (SPD) matrices. We show in a large database that our approach outperform shallow networks designed for GPR data and conventional CNNs typically used in computer vision applications, particularly when the number of training data decreases and in the presence of mislabeled data. We also illustrate the interest of our models when training data and test sets are obtained from different weather modes or considerations.
- [59] arXiv:2410.19774 (replaced) [pdf, other]
-
Title: Copula-Linked Parallel ICA: A Method for Coupling Structural and Functional MRI brain NetworksOktay Agcaoglu, Rogers F. Silva, Deniz Alacam, Sergey Plis, Tulay Adali, Vince Calhoun (for the Alzheimers Disease Neuroimaging Initiative)Comments: 25 pages, 10 figures, journal articleSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO)
Different brain imaging modalities offer unique insights into brain function and structure. Combining them enhances our understanding of neural mechanisms. Prior multimodal studies fusing functional MRI (fMRI) and structural MRI (sMRI) have shown the benefits of this approach. Since sMRI lacks temporal data, existing fusion methods often compress fMRI temporal information into summary measures, sacrificing rich temporal dynamics. Motivated by the observation that covarying networks are identified in both sMRI and resting-state fMRI, we developed a novel fusion method, by combining deep learning frameworks, copulas and independent component analysis (ICA), named copula linked parallel ICA (CLiP-ICA). This method estimates independent sources for each modality and links the spatial sources of fMRI and sMRI using a copula-based model for more flexible integration of temporal and spatial data. We tested CLiP-ICA using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our results showed that CLiP-ICA effectively captures both strongly and weakly linked sMRI and fMRI networks, including the cerebellum, sensorimotor, visual, cognitive control, and default mode networks. It revealed more meaningful components and fewer artifacts, addressing the long-standing issue of optimal model order in ICA. CLiP-ICA also detected complex functional connectivity patterns across stages of cognitive decline, with cognitively normal subjects generally showing higher connectivity in sensorimotor and visual networks compared to patients with Alzheimer, along with patterns suggesting potential compensatory mechanisms.
- [60] arXiv:2410.22498 (replaced) [pdf, html, other]
-
Title: The VIX as Stochastic Volatility for Corporate BondsComments: 12 pages, 2 figures, 8 graphs. Keywords: stochastic volatility, ergodic Markov process, stationary distribution, autoregression, kurtosisSubjects: Statistical Finance (q-fin.ST); Applications (stat.AP)
Classic stochastic volatility models assume volatility is unobservable. We use the Volatility Index: S\&P 500 VIX to observe it, to easier fit the model. We apply it to corporate bonds. We fit autoregression for corporate rates and for risk spreads between these rates and Treasury rates. Next, we divide residuals by VIX. Our main idea is such division makes residuals closer to the ideal case of a Gaussian white noise. This is remarkable, since these residuals and VIX come from separate market segments. Similarly, we model corporate bond returns as a linear function of rates and rate changes. Our article has two main parts: Moody's AAA and BAA spreads; Bank of America investment-grade and high-yield rates, spreads, and returns. We analyze long-term stability of these models.
- [61] arXiv:2411.03699 (replaced) [pdf, html, other]
-
Title: Zero-Coupon Treasury Yield Curve with VIX as Stochastic VolatilityComments: 15 pages, 2 figures, 5 graphs. Keywords: total returns, Ornstein-Uhlenbeck process, ergodic Markov processes, autoregression, long-term stability, stationary distribution, principal component analysisSubjects: Statistical Finance (q-fin.ST); Probability (math.PR); Applications (stat.AP)
We study a multivariate autoregressive stochastic volatility model for the first 3 principal components (level, slope, curvature) of 10 series of zero-coupon Treasury bond rates with maturities from 1 to 10 years. We fit this model using monthly data from 1990. Next, we prove long-term stability for this discrete-time model and its continuous-time version. Unlike classic models with hidden stochastic volatility, here it is observed as VIX: the volatility index for the S\&P 500 stock market index. It is surprising that this volatility, created for the stock market, also works for Treasury bonds. Since total returns of zero-coupon bonds can be easily found from these principal components, we prove long-term stability for total returns in discrete time.
- [62] arXiv:2411.06406 (replaced) [pdf, html, other]
-
Title: Locally Adaptive One-Class Classifier Fusion with Dynamic $\ell$p-Norm Constraints for Robust Anomaly DetectionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper presents a novel approach to one-class classifier fusion through locally adaptive learning with dynamic $\ell$p-norm constraints. We introduce a framework that dynamically adjusts fusion weights based on local data characteristics, addressing fundamental challenges in ensemble-based anomaly detection. Our method incorporates an interior-point optimization technique that significantly improves computational efficiency compared to traditional Frank-Wolfe approaches, achieving up to 19-fold speed improvements in complex scenarios. The framework is extensively evaluated on standard UCI benchmark datasets and specialized temporal sequence datasets, demonstrating superior performance across diverse anomaly types. Statistical validation through Skillings-Mack tests confirms our method's significant advantages over existing approaches, with consistent top rankings in both pure and non-pure learning scenarios. The framework's ability to adapt to local data patterns while maintaining computational efficiency makes it particularly valuable for real-time applications where rapid and accurate anomaly detection is crucial.
- [63] arXiv:2411.10377 (replaced) [pdf, html, other]
-
Title: Generation of synthetic gait data: application to multiple sclerosis patients' gait patternsSubjects: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
Multiple sclerosis (MS) is the leading cause of severe non-traumatic disability in young adults and its incidence is increasing worldwide. The variability of gait impairment in MS necessitates the development of a non-invasive, sensitive, and cost-effective tool for quantitative gait evaluation. The eGait movement sensor, designed to characterize human gait through unit quaternion time series (QTS) representing hip rotations, is a promising approach. However, the small sample sizes typical of clinical studies pose challenges for the stability of gait data analysis tools. To address these challenges, this article presents two key scientific contributions. First, a comprehensive framework is proposed for transforming QTS data into a form that preserves the essential geometric properties of gait while enabling the use of any tabular synthetic data generation method. Second, a synthetic data generation method is introduced, based on nearest neighbors weighting, which produces high-fidelity synthetic QTS data suitable for small datasets and private data environments. The effectiveness of the proposed method, is demonstrated through its application to MS gait data, showing very good fidelity and respect of the initial geometry of the data. Thanks to this work, we are able to produce synthetic data sets and work on the stability of clustering methods.
- [64] arXiv:2411.10482 (replaced) [pdf, html, other]
-
Title: The Noisy Work of Uncertainty Visualisation Research: A ReviewComments: 52 pages with 7 figuresSubjects: Human-Computer Interaction (cs.HC); Applications (stat.AP)
Uncertainty visualisation is quickly becomming a hot topic in information visualisation. Exisiting reviews in the field take the definition and purpose of an uncertainty visualisation to be self evident which results in a large amout of conflicting information. This conflict largely stems from a conflation between uncertainty visualisations designed for decision making and those designed to prevent false conclusions. We coin the term "signal suppression" to describe a visualisation that is designed for preventing false conclusions, as the approach demands that the signal (i.e. the collective take away of the estimates) is suppressed by the noise (i.e. the variance on those estimates). We argue that the current standards in visualisation suggest that uncertainty visualisations designed for decision making should not be considered uncertainty visualisations at all. Therefore, future work should focus on signal suppression. Effective signal suppression requires us to communicate the signal and the noise as a single "validity of signal" variable, and doing so proves to be difficult with current methods. We illustrate current approaches to uncertainty visualisation by showing how they would change the visual apprearance of a choropleth map. These maps allow us to see why some methods succeed at signal suppression, while others fall short. Evaluating visualisations on how well they perform signal suppression also proves to be difficult, as it involves measuring the effect of noise, a variable we typically try to ignore. We suggest authors use qualitative studies or compare uncertainty visualisations to the relevant hypothesis tests.
- [65] arXiv:2411.10982 (replaced) [pdf, html, other]
-
Title: Towards a framework on tabular synthetic data generation: a minimalist approach: theory, use cases, and limitationsYueyang Shen, Agus Sudjianto, Arun Prakash R, Anwesha Bhattacharyya, Maorong Rao, Yaqun Wang, Joel Vaughan, Nengfeng ZhouSubjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
We propose and study a minimalist approach towards synthetic tabular data generation. The model consists of a minimalistic unsupervised SparsePCA encoder (with contingent clustering step or log transformation to handle nonlinearity) and XGboost decoder which is SOTA for structured data regression and classification tasks. We study and contrast the methodologies with (variational) autoencoders in several toy low dimensional scenarios to derive necessary intuitions. The framework is applied to high dimensional simulated credit scoring data which parallels real-life financial applications. We applied the method to robustness testing to demonstrate practical use cases. The case study result suggests that the method provides an alternative to raw and quantile perturbation for model robustness testing. We show that the method is simplistic, guarantees interpretability all the way through, does not require extra tuning and provide unique benefits.