Statistics
See recent articles
Showing new listings for Friday, 20 December 2024
- [1] arXiv:2412.14196 [pdf, other]
-
Title: Investigating Central England Temperature Variability: Statistical Analysis of Associations with North Atlantic Oscillation (NAO) and Pacific Decadal Oscillation (PDO)Comments: 15 pages, 6 figuresSubjects: Methodology (stat.ME); Atmospheric and Oceanic Physics (physics.ao-ph)
This study investigates the variability of the Central England Temperature (CET) series in relation to the North Atlantic Oscillation (NAO) and the Pacific Decadal Oscillation (PDO) using advanced time series modeling techniques. Leveraging the world's longest continuous instrumental temperature dataset (1723-2023), this research applies ARIMA and ARIMAX models to quantify the impact of climatic oscillations on regional temperature variability, while also accounting for long-term warming trends. Spectral and coherence analyses further explore the periodic interactions between CET and the oscillations. Results reveal that NAO exerts a stronger influence on CET variability compared to PDO, with significant coherence observed at cycles of 5 to 7.5 years and 2 to 2.5 years for NAO, while PDO shows no statistically significant coherence. The ARIMAX model effectively captures both the upward warming trend and the influence of climatic oscillations, with robust diagnostics confirming its reliability. This study contributes to understanding the interplay between regional temperature variability and large-scale climatic drivers, providing a framework for future research on climatic oscillations and their role in shaping regional climate dynamics. Limitations and potential future directions, including the integration of additional climatic indices and comparative regional analyses, are also discussed.
- [2] arXiv:2412.14263 [pdf, html, other]
-
Title: Evaluation of the linear mixing model in fluorescence spectroscopySubjects: Applications (stat.AP)
Analyses of spectral data often assume a linear mixing hypothesis, which states that the spectrum of a mixed substance is approximately the mixture of the individual spectra of its constituent parts. We evaluate this hypothesis in the context of dissolved organic matter (DOM) fluorescence spectroscopy for endmember abundance recovery from mixtures of three different DOM endmembers. We quantify two key sources of experimental variation, and statistically evaluate the linear mixing hypotheses in the context of this variation. We find that there is not strong statistical evidence against this hypothesis for high-fluorescence readings, and that true abundances of high-fluorescence endmembers are accurately recovered from the excitation-emission fluorescence spectra of mixed samples using linear methods. However, abundances of a low-fluorescence endmember are less well-estimated, in that the abundance coefficient estimates exhibit a high degree of variability across replicate experiments.
- [3] arXiv:2412.14284 [pdf, html, other]
-
Title: Optimal design of experiments for functional linear models with dynamic factorsComments: 15 figuresSubjects: Methodology (stat.ME)
In this work we build optimal experimental designs for precise estimation of the functional coefficient of a function-on-function linear regression model where both the response and the factors are continuous functions of time. After obtaining the variance-covariance matrix of the estimator of the functional coefficient which minimizes the integrated sum of square of errors, we extend the classical definition of optimal design to this estimator, and we provide the expression of the A-optimal and of the D-optimal designs. Examples of optimal designs for dynamic experimental factors are then computed through a suitable algorithm, and we discuss different scenarios in terms of the set of basis functions used for their representation. Finally, we present an example with simulated data to illustrate the feasibility of our methodology.
- [4] arXiv:2412.14315 [pdf, html, other]
-
Title: On the Robustness of Spectral Algorithms for Semirandom Stochastic Block ModelsAditya Bhaskara, Agastya Vibhuti Jha, Michael Kapralov, Naren Sarayu Manoj, Davide Mazzali, Weronika Wrzos-KaminskaComments: 45 pages. NeurIPS 2024Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
In a graph bisection problem, we are given a graph $G$ with two equally-sized unlabeled communities, and the goal is to recover the vertices in these communities. A popular heuristic, known as spectral clustering, is to output an estimated community assignment based on the eigenvector corresponding to the second smallest eigenvalue of the Laplacian of $G$. Spectral algorithms can be shown to provably recover the cluster structure for graphs generated from certain probabilistic models, such as the Stochastic Block Model (SBM). However, spectral clustering is known to be non-robust to model mis-specification. Techniques based on semidefinite programming have been shown to be more robust, but they incur significant computational overheads.
In this work, we study the robustness of spectral algorithms against semirandom adversaries. Informally, a semirandom adversary is allowed to ``helpfully'' change the specification of the model in a way that is consistent with the ground-truth solution. Our semirandom adversaries in particular are allowed to add edges inside clusters or increase the probability that an edge appears inside a cluster. Semirandom adversaries are a useful tool to determine the extent to which an algorithm has overfit to statistical assumptions on the input.
On the positive side, we identify classes of semirandom adversaries under which spectral bisection using the _unnormalized_ Laplacian is strongly consistent, i.e., it exactly recovers the planted partitioning. On the negative side, we show that in these classes spectral bisection with the _normalized_ Laplacian outputs a partitioning that makes a classification mistake on a constant fraction of the vertices. Finally, we demonstrate numerical experiments that complement our theoretical findings. - [5] arXiv:2412.14339 [pdf, html, other]
-
Title: Forecasting Influenza Hospitalizations Using a Bayesian Hierarchical Nonlinear Model with DiscrepancySubjects: Applications (stat.AP)
The annual influenza outbreak leads to significant public health and economic burdens making it desirable to have prompt and accurate probabilistic forecasts of the disease spread. The United States Centers for Disease Control and Prevention (CDC) hosts annually a national flu forecasting competition which has led to the development of a variety of flu forecast modeling methods. Beginning in 2013, the target to be forecast was weekly percentage of patients with an influenza-like illness (ILI), but in 2021 the target was changed to weekly hospitalizations. Reliable hospitalization data has only been available since 2021, but ILI data has been available since 2010 and has been successfully forecast for several seasons. In this manuscript, we introduce a two component modeling framework for forecasting hospitalizations utilizing both hospitalization and ILI data. The first component is for modeling ILI data using a nonlinear Bayesian model. The second component is for modeling hospitalizations as a function of ILI. For hospitalization forecasts, ILI is first forecast then hospitalizations are forecast with ILI forecasts used as a predictor. In a simulation study, the hospitalization forecast model is assessed and two previously successful ILI forecast models are compared. Also assessed is the usefulness of including a systematic model discrepancy term in the ILI model. Forecasts of state and national hospitalizations for the 2023-24 flu season are made, and different modeling decisions are compared. We found that including a discrepancy component in the ILI model tends to improve forecasts during certain weeks of the year. We also found that other modeling decisions such as the exact nonlinear function to be used in the ILI model or the error distribution for hospitalization models may or may not be better than other decisions, depending on the season, location, or week of the forecast.
- [6] arXiv:2412.14343 [pdf, html, other]
-
Title: Revisiting the Nowosi\'o{\l}ka skull with RMaCzekComments: Presented at the XXVII National Conference on Applications of Mathematics to Biology and Medicine, Wisła, Poland, 23-27 September 2022, this https URLJournal-ref: Mathematica Applicanda (Matematyka Stosowana) 50(2): 255-266, 2022Subjects: Applications (stat.AP); Populations and Evolution (q-bio.PE)
One of the first fully quantitative distance matrix visualization methods was proposed by Jan Czekanowski at the beginning of the previous century. Recently, a software package, RMaCzek, was made available that allows for producing such diagrams in R. Here we reanalyze the original data that Czekanowski used for introducing his method, and in the accompanying code show how the user can specify their own custom distance functions in the package.
- [7] arXiv:2412.14346 [pdf, html, other]
-
Title: Strong Gaussian approximations with random multipliersSubjects: Statistics Theory (math.ST)
One reason why standard formulations of the central limit theorems are not applicable in high-dimensional and non-stationary regimes is the lack of a suitable limit object. Instead, suitable distributional approximations can be used, where the approximating object is not constant, but a sequence as well. We extend Gaussian approximation results for the partial sum process by allowing each summand to be multiplied by a data-dependent matrix. The results allow for serial dependence of the data, and for high-dimensionality of both the data and the multipliers. In the finite-dimensional and locally-stationary setting, we obtain a functional central limit theorem as a direct consequence. An application to sequential testing in non-stationary environments is described.
- [8] arXiv:2412.14357 [pdf, html, other]
-
Title: Nonparametric Regression in Dirichlet Spaces: A Random Obstacle ApproachSubjects: Statistics Theory (math.ST)
In this paper, we consider nonparametric estimation over general Dirichlet metric measure spaces. Unlike the more commonly studied reproducing kernel Hilbert space, whose elements may be defined pointwise, a Dirichlet space typically only contain equivalence classes, i.e. its elements are only unique almost everywhere. This lack of pointwise definition presents significant challenges in the context of nonparametric estimation, for example the classical ridge regression problem is ill-posed. In this paper, we develop a new technique for renormalizing the ridge loss by replacing pointwise evaluations with certain \textit{local means} around the boundaries of obstacles centered at each data point. The resulting renormalized empirical risk functional is well-posed and even admits a representer theorem in terms of certain equilibrium potentials, which are truncated versions of the associated Green function, cut-off at a data-driven threshold. We study the global, out-of-sample consistency of the sample minimizer, and derive an adaptive upper bound on its convergence rate that highlights the interplay of the analytic, geometric, and probabilistic properties of the Dirichlet form. We also construct a simple regressogram type estimator that achieves the minimax optimal estimation rate over certain $L^p$ subsets of a Dirichlet ball with some knowledge of the geometry of the metric measure space. Our framework notably does not require the smoothness of the underlying space, and is applicable to both manifold and fractal settings. To the best of our knowledge, this is the first paper to obtain out-of-sample convergence guarantees in the framework of general metric measure Dirichlet spaces.
- [9] arXiv:2412.14391 [pdf, html, other]
-
Title: Randomization Tests for Conditional Group SymmetryComments: Theorems 2.2 and 4.1 appeared in arXiv:2307.15834, which is superseded by this articleSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
Symmetry plays a central role in the sciences, machine learning, and statistics. While statistical tests for the presence of distributional invariance with respect to groups have a long history, tests for conditional symmetry in the form of equivariance or conditional invariance are absent from the literature. This work initiates the study of nonparametric randomization tests for symmetry (invariance or equivariance) of a conditional distribution under the action of a specified locally compact group. We develop a general framework for randomization tests with finite-sample Type I error control and, using kernel methods, implement tests with finite-sample power lower bounds. We also describe and implement approximate versions of the tests, which are asymptotically consistent. We study their properties empirically on synthetic examples, and on applications to testing for symmetry in two problems from high-energy particle physics.
- [10] arXiv:2412.14423 [pdf, html, other]
-
Title: Cross-Validation with Antithetic Gaussian RandomizationSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We introduce a method for performing cross-validation without sample splitting. The method is well-suited for problems where traditional sample splitting is infeasible, such as when data are not assumed to be independently and identically distributed. Even in scenarios where sample splitting is possible, our method offers a computationally efficient alternative for estimating prediction error, achieving comparable or even lower error than standard cross-validation at a significantly reduced computational cost.
Our approach constructs train-test data pairs using externally generated Gaussian randomization variables, drawing inspiration from recent randomization techniques such as data-fission and data-thinning. The key innovation lies in a carefully designed correlation structure among these randomization variables, referred to as antithetic Gaussian randomization. This correlation is crucial in maintaining a bounded variance while allowing the bias to vanish, offering an additional advantage over standard cross-validation, whose performance depends heavily on the bias-variance tradeoff dictated by the number of folds. We provide a theoretical analysis of the mean squared error of the proposed estimator, proving that as the level of randomization decreases to zero, the bias converges to zero, while the variance remains bounded and decays linearly with the number of repetitions. This analysis highlights the benefits of the antithetic Gaussian randomization over independent randomization. Simulation studies corroborate our theoretical findings, illustrating the robust performance of our cross-validated estimator across various data types and loss functions. - [11] arXiv:2412.14478 [pdf, html, other]
-
Title: Time-Varying Functional Cox ModelSubjects: Methodology (stat.ME)
We propose two novel approaches for estimating time-varying effects of functional predictors within a linear functional Cox model framework. This model allows for time-varying associations of a functional predictor observed at baseline, estimated using penalized regression splines for smoothness across the functional domain and event time. The first approach, suitable for small-to-medium datasets, uses the Cox-Poisson likelihood connection for valid estimation and inference. The second, a landmark approach, significantly reduces computational burden for large datasets and high-dimensional functional predictors. Both methods address proportional hazards violations for functional predictors and model associations as a bivariate smooth coefficient. Motivated by analyzing diurnal motor activity patterns and all-cause mortality in NHANES (N=4445, functional predictor dimension=1440), we demonstrate the first method's computational limitations and the landmark approach's efficiency. These methods are implemented in stable, high-quality software using the mgcv package for penalized spline regression with automated smoothing parameter selection. Simulations show both methods achieve high accuracy in estimating functional coefficients, with the landmark approach being computationally faster but slightly less accurate. The Cox-Poisson method provides nominal coverage probabilities, while landmark inference was not assessed due to inherent bias. Sensitivity to landmark modeling choices was evaluated. Application to NHANES reveals an attenuation of diurnal effects on mortality over an 8-year follow-up.
- [12] arXiv:2412.14503 [pdf, html, other]
-
Title: dapper: Data Augmentation for Private Posterior Estimation in RSubjects: Computation (stat.CO)
This paper serves as a reference and introduction to using the R package dapper. dapper encodes a sampling framework which allows exact Markov chain Monte Carlo simulation of parameters and latent variables in a statistical model given privatized data. The goal of this package is to fill an urgent need by providing applied researchers with a flexible tool to perform valid Bayesian inference on data protected by differential privacy, allowing them to properly account for the noise introduced for privacy protection in their statistical analysis. dapper offers a significant step forward in providing general-purpose statistical inference tools for privatized data.
- [13] arXiv:2412.14527 [pdf, html, other]
-
Title: Statistical Undersampling with Mutual Information and Support PointsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications.
- [14] arXiv:2412.14563 [pdf, html, other]
-
Title: Transfer Learning Meets Functional Linear Regression: No Negative Transfer under Posterior DriftComments: 27 pages, 7 figures; accepted by AAAI-25Subjects: Methodology (stat.ME)
Posterior drift refers to changes in the relationship between responses and covariates while the distributions of the covariates remain unchanged. In this work, we explore functional linear regression under posterior drift with transfer learning. Specifically, we investigate when and how auxiliary data can be leveraged to improve the estimation accuracy of the slope function in the target model when posterior drift occurs. We employ the approximated least square method together with a lasso penalty to construct an estimator that transfers beneficial knowledge from source data. Theoretical analysis indicates that our method avoids negative transfer under posterior drift, even when the contrast between slope functions is quite large. Specifically, the estimator is shown to perform at least as well as the classical estimator using only target data, and it enhances the learning of the target model when the source and target models are sufficiently similar. Furthermore, to address scenarios where covariate distributions may change, we propose an adaptive algorithm using aggregation techniques. This algorithm is robust against non-informative source samples and effectively prevents negative transfer. Simulation and real data examples are provided to demonstrate the effectiveness of the proposed algorithm.
- [15] arXiv:2412.14720 [pdf, html, other]
-
Title: MICG-AI: A multidimensional index of child growth based on digital phenotyping with Bayesian artificial intelligenceComments: 15 pages, 0 figuresSubjects: Applications (stat.AP)
This document proposes an algorithm for a mobile application designed to monitor multidimensional child growth through digital phenotyping. Digital phenotyping offers a unique opportunity to collect and analyze high-frequency data in real time, capturing behavioral, psychological, and physiological states of children in naturalistic settings. Traditional models of child growth primarily focus on physical metrics, often overlooking multidimensional aspects such as emotional, social, and cognitive development. In this paper, we introduce a Bayesian artificial intelligence (AI) algorithm that leverages digital phenotyping to create a Multidimensional Index of Child Growth (MICG). This index integrates data from various dimensions of child development, including physical, emotional, cognitive, and environmental factors. By incorporating probabilistic modeling, the proposed algorithm dynamically updates its learning based on data collected by the mobile app used by mothers and children. The app also infers uncertainty from response times, adjusting the importance of each dimension of child growth accordingly. Our contribution applies state-of-the-art technology to track multidimensional child development, enabling families and healthcare providers to make more informed decisions in real time.
- [16] arXiv:2412.14745 [pdf, other]
-
Title: Union-Free Generic Depth for Non-Standard DataSubjects: Methodology (stat.ME)
Non-standard data, which fall outside classical statistical data formats, challenge state-of-the-art analysis. Examples of non-standard data include partial orders and mixed categorical-numeric-spatial data. Most statistical methods required to represent them by classical statistical spaces. However, this representation can distort their inherent structure and thus the results and interpretation. For applicants, this creates a dilemma: using standard statistical methods can risk misrepresenting the data, while preserving their true structure often lead these methods to be inapplicable. To address this dilemma, we introduce the union-free generic depth (ufg-depth) which is a novel framework that respects the true structure of non-standard data while enabling robust statistical analysis. The ufg-depth extends the concept of simplicial depth from normed vector spaces to a much broader range of data types, by combining formal concept analysis and data depth. We provide a systematic analysis of the theoretical properties of the ufg-depth and demonstrate its application to mixed categorical-numerical-spatial data and hierarchical-nominal data. The ufg-depth is a unified approach that bridges the gap between preserving the data structure and applying statistical methods. With this, we provide a new perspective for non-standard data analysis.
- [17] arXiv:2412.14800 [pdf, html, other]
-
Title: Asymptotic Equivalence for Nonparametric RegressionComments: 36 pages, 0 figuresJournal-ref: Mathematical Methods of Statistics, 2002, Vol. 11, No 1, pp. 1-36Subjects: Statistics Theory (math.ST)
We consider a nonparametric model $\mathcal{E}^{n},$ generated by independent observations $X_{i},$ $i=1,...,n,$ with densities $p(x,\theta_{i}),$ $i=1,...,n,$ the parameters of which $\theta _{i}=f(i/n)\in \Theta $ are driven by the values of an unknown function $f:[0,1]\rightarrow \Theta $ in a smoothness class. The main result of the paper is that, under regularity assumptions, this model can be approximated, in the sense of the Le Cam deficiency pseudodistance, by a nonparametric Gaussian shift model $Y_{i}=\Gamma (f(i/n))+\varepsilon _{i},$ where $\varepsilon_{1},...,\varepsilon _{n}$ are i.i.d. standard normal r.v.'s, the function $\Gamma (\theta ):\Theta \rightarrow \mathrm{R}$ satisfies $\Gamma ^{\prime}(\theta )=\sqrt{I(\theta )}$ and $I(\theta )$ is the Fisher information corresponding to the density $p(x,\theta ).$
- [18] arXiv:2412.14916 [pdf, html, other]
-
Title: From Point to probabilistic gradient boosting for claim frequency and severity predictionComments: 26 pages, 4 figures, 26 tables, 7 algorithmsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Gradient boosting for decision tree algorithms are increasingly used in actuarial applications as they show superior predictive performance over traditional generalized linear models. Many improvements and sophistications to the first gradient boosting machine algorithm exist. We present in a unified notation, and contrast, all the existing point and probabilistic gradient boosting for decision tree algorithms: GBM, XGBoost, DART, LightGBM, CatBoost, EGBM, PGBM, XGBoostLSS, cyclic GBM, and NGBoost. In this comprehensive numerical study, we compare their performance on five publicly available datasets for claim frequency and severity, of various size and comprising different number of (high cardinality) categorical variables. We explain how varying exposure-to-risk can be handled with boosting in frequency models. We compare the algorithms on the basis of computational efficiency, predictive performance, and model adequacy. LightGBM and XGBoostLSS win in terms of computational efficiency. The fully interpretable EGBM achieves competitive predictive performance compared to the black box algorithms considered. We find that there is no trade-off between model adequacy and predictive accuracy: both are achievable simultaneously.
- [19] arXiv:2412.14942 [pdf, html, other]
-
Title: Robust modestly weighted log-rank testsSubjects: Methodology (stat.ME)
The introduction of checkpoint inhibitors in immuno-oncology has raised questions about the suitability of the log-rank test as the default primary analysis method in confirmatory studies, particularly when survival curves exhibit non-proportional hazards. The log-rank test, while effective in controlling false positive rates, may lose power in scenarios where survival curves remain similar for extended periods before diverging. To address this, various weighted versions of the log-rank test have been proposed, including the MaxCombo test, which combines multiple weighted log-rank statistics to enhance power across a range of alternative hypotheses.
Despite its potential, the MaxCombo test has seen limited adoption, possibly owing to its proneness to produce counterintuitive results in situations where the hazard functions on the two arms cross. In response, the modestly weighted log-rank test was developed to provide a balanced approach, giving greater weight to later event times while avoiding undue influence from early detrimental effects. However, this test also faces limitations, particularly if the possibility of early separation of survival curves cannot be ruled out a priori.
We propose a novel test statistic that integrates the strengths of the standard log-rank test, the modestly weighted log-rank test, and the MaxCombo test. By considering the maximum of the standard log-rank statistic and a modestly weighted log-rank statistic, the new test aims to maintain power under delayed effect scenarios while minimizing power loss, relative to the log-rank test, in worst-case scenarios. Simulation studies and a case study demonstrate the efficiency and robustness of this approach, highlighting its potential as a robust alternative for primary analysis in immuno-oncology trials. - [20] arXiv:2412.14946 [pdf, html, other]
-
Title: Joint Models for Handling Non-Ignorable Missing Data using Bayesian Additive Regression Trees: Application to Leaf Photosynthetic Traits DataSubjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)
Dealing with missing data poses significant challenges in predictive analysis, often leading to biased conclusions when oversimplified assumptions about the missing data process are made. In cases where the data are missing not at random (MNAR), jointly modeling the data and missing data indicators is essential. Motivated by a real data application with partially missing multivariate outcomes related to leaf photosynthetic traits and several environmental covariates, we propose two methods under a selection model framework for handling data with missingness in the response variables suitable for recovering various missingness mechanisms. Both approaches use a multivariate extension of Bayesian additive regression trees (BART) to flexibly model the outcomes. The first approach simultaneously uses a probit regression model to jointly model the missingness. In scenarios where the relationship between the missingness and the data is more complex or non-linear, we propose a second approach using a probit BART model to characterize the missing data process, thereby employing two BART models simultaneously. Both models also effectively handle ignorable covariate missingness. The efficacy of both models compared to existing missing data approaches is demonstrated through extensive simulations, in both univariate and multivariate settings, and through the aforementioned application to the leaf photosynthetic trait data.
- [21] arXiv:2412.15012 [pdf, html, other]
-
Title: Assessing treatment effects in observational data with missing confounders: A comparative study of practical doubly-robust and traditional missing data methodsBrian D. Williamson, Chloe Krakauer, Eric Johnson, Susan Gruber, Bryan E. Shepherd, Mark J. van der Laan, Thomas Lumley, Hana Lee, Jose J. Hernandez Munoz, Fengyu Zhao, Sarah K. Dutcher, Rishi Desai, Gregory E. Simon, Susan M. Shortreed, Jennifer C. Nelson, Pamela A. ShawComments: 142 pages (27 main, 115 supplemental); 6 figures, 2 tablesSubjects: Methodology (stat.ME)
In pharmacoepidemiology, safety and effectiveness are frequently evaluated using readily available administrative and electronic health records data. In these settings, detailed confounder data are often not available in all data sources and therefore missing on a subset of individuals. Multiple imputation (MI) and inverse-probability weighting (IPW) are go-to analytical methods to handle missing data and are dominant in the biomedical literature. Doubly-robust methods, which are consistent under fewer assumptions, can be more efficient with respect to mean-squared error. We discuss two practical-to-implement doubly-robust estimators, generalized raking and inverse probability-weighted targeted maximum likelihood estimation (TMLE), which are both currently under-utilized in biomedical studies. We compare their performance to IPW and MI in a detailed numerical study for a variety of synthetic data-generating and missingness scenarios, including scenarios with rare outcomes and a high missingness proportion. Further, we consider plasmode simulation studies that emulate the complex data structure of a large electronic health records cohort in order to compare anti-depressant therapies in a rare-outcome setting where a key confounder is prone to more than 50\% missingness. We provide guidance on selecting a missing data analysis approach, based on which methods excelled with respect to the bias-variance trade-off across the different scenarios studied.
- [22] arXiv:2412.15041 [pdf, other]
-
Title: Boosting Distributional Copula Regression for Bivariate Right-Censored Time-to-Event DataSubjects: Methodology (stat.ME)
We propose a highly flexible distributional copula regression model for bivariate time-to-event data in the presence of right-censoring. The joint survival function of the response is constructed using parametric copulas, allowing for a separate specification of the dependence structure between the time-to-event outcome variables and their respective marginal survival distributions. The latter are specified using well-known parametric distributions such as the log-Normal, log-Logistic (proportional odds model), or Weibull (proportional hazards model) distributions. Hence, the marginal univariate event times can be specified as parametric (also known as Accelerated Failure Time, AFT) models. Embedding our model into the class of generalized additive models for location, scale and shape, possibly all distribution parameters of the joint survival function can depend on covariates. We develop a component-wise gradient-based boosting algorithm for estimation. This way, our approach is able to conduct data-driven variable selection. To the best of our knowledge, this is the first implementation of multivariate AFT models via distributional copula regression with automatic variable selection via statistical boosting. A special merit of our approach is that it works for high-dimensional (p>>n) settings. We illustrate the practical potential of our method on a high-dimensional application related to semi-competing risks responses in ovarian cancer. All of our methods are implemented in the open source statistical software R as add-on functions of the package gamboostLSS.
- [23] arXiv:2412.15049 [pdf, html, other]
-
Title: A linear regression model for quantile function data applied to paired pulmonary 3d CT scansComments: 36 pages, 10 figures, 3 tablesSubjects: Applications (stat.AP); Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
This paper introduces a new objective measure for assessing treatment response in asthmatic patients using computed tomography (CT) imaging data. For each patient, CT scans were obtained before and after one year of monoclonal antibody treatment. Following image segmentation, the Hounsfield unit (HU) values of the voxels were encoded through quantile functions. It is hypothesized that patients with improved conditions after treatment will exhibit better expiration, reflected in higher HU values and an upward shift in the quantile curve. To objectively measure treatment response, a novel linear regression model on quantile functions is developed, drawing inspiration from Verde and Irpino (2010). Unlike their framework, the proposed model is parametric and incorporates distributional assumptions on the errors, enabling statistical inference. The model allows for the explicit calculation of regression coefficient estimators and confidence intervals, similar to conventional linear regression. The corresponding data and R code are available on GitHub to facilitate the reproducibility of the analyses presented.
- [24] arXiv:2412.15057 [pdf, html, other]
-
Title: Asymptotic Equivalence for Nonparametric Generalized Linear ModelsComments: 39 pages, 0 figuresJournal-ref: Probab. Theory Relat. Fields 111, 167-214 (1998)Subjects: Statistics Theory (math.ST)
We establish that a non-Gaussian nonparametric regression model is asymptotically equivalent to a regression model with Gaussian noise. The approximation is in the sense of Le Cam's deficiency distance $\Delta $; the models are then asymptotically equivalent for all purposes of statistical decision with bounded loss. Our result concerns a sequence of independent but not identically distributed observations with each distribution in the same real-indexed exponential family. The canonical parameter is a value $f(t_i)$ of a regression function $f$ at a grid point $t_i$ (nonparametric GLM). When $f$ is in a Hölder ball with exponent $\beta >\frac 12 ,$ we establish global asymptotic equivalence to observations of a signal $\Gamma (f(t))$ in Gaussian white noise, where $\Gamma $ is related to a variance stabilizing transformation in the exponential family. The result is a regression analog of the recently established Gaussian approximation for the i.i.d. model. The proof is based on a functional version of the Hungarian construction for the partial sum process.
- [25] arXiv:2412.15076 [pdf, other]
-
Title: Digital N-of-1 Trials and their Application in Experimental PhysiologyComments: Accepted in Experimental PhysiologySubjects: Applications (stat.AP)
Traditionally, studies in experimental physiology have been conducted in small groups of human participants, animal models or cell lines. Important challenges include achieving sufficient statistical power in statistical hypothesis tests of small sample sizes and identifying optimal study designs. Here, we introduce N-of-1 trials as an innovative study design which can have high relevance to innovate and improve studies in experimental physiology. N-of-1 trials are multi-crossover trials in single participants that allow valid statistical inference on the individual level. Also, series of N-of-1 trials conducted on multiple study participants can be aggregated for population-level inference and provide a more efficient study design compared to standard randomized controlled trials. In this manuscript, we first introduce key components and design features of N-of-1 trials. Then we lay out how N-of-1 trials can be analyzed statistically and give different examples of their applicability in experimental physiological studies. In summary, we provide here an overview of all main components for designing N-of-1 trials, give direct examples in experimental physiology and practical recommendations on their proper use.
- [26] arXiv:2412.15128 [pdf, html, other]
-
Title: Estimating Heterogeneous Treatment Effects for Spatio-Temporal Causal Inference: How Economic Assistance Moderates the Effects of Airstrikes on Insurgent ViolenceSubjects: Methodology (stat.ME)
Scholars from diverse fields now increasingly rely on high-frequency spatio-temporal data. Yet, causal inference with these data remains challenging due to the twin threats of spatial spillover and temporal carryover effects. We develop methods to estimate heterogeneous treatment effects by allowing for arbitrary spatial and temporal causal dependencies. We focus on common settings where the treatment and outcomes are time-varying spatial point patterns and where moderators are either spatial or spatio-temporal in nature. We define causal estimands based on stochastic interventions where researchers specify counterfactual distributions of treatment events. We propose the Hajek-type estimator of the conditional average treatment effect (CATE) as a function of spatio-temporal moderator variables, and establish its asymptotic normality as the number of time periods increases. We then introduce a statistical test of no heterogeneous treatment effects. Through simulations, we evaluate the finite-sample performance of the proposed CATE estimator and its inferential properties. Our motivating application examines the heterogeneous effects of US airstrikes on insurgent violence in Iraq. Drawing on declassified spatio-temporal data, we examine how prior aid distributions moderate airstrike effects. Contrary to expectations from counterinsurgency theories, we find that prior aid distribution, along with greater amounts of aid per capita, is associated with increased insurgent attacks following airstrikes.
New submissions (showing 26 of 26 entries)
- [27] arXiv:2412.14222 (cross-list from cs.HC) [pdf, other]
-
Title: A Survey on Large Language Model-based Agents for Statistics and Data ScienceSubjects: Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Software Engineering (cs.SE); Applications (stat.AP)
In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.
- [28] arXiv:2412.14226 (cross-list from cs.LG) [pdf, html, other]
-
Title: FedSTaS: Client Stratification and Client Level Sampling for Efficient Federated LearningComments: 6 pages, 3 figures, to be submitted to ICMLSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Federated learning (FL) is a machine learning methodology that involves the collaborative training of a global model across multiple decentralized clients in a privacy-preserving way. Several FL methods are introduced to tackle communication inefficiencies but do not address how to sample participating clients in each round effectively and in a privacy-preserving manner. In this paper, we propose \textit{FedSTaS}, a client and data-level sampling method inspired by \textit{FedSTS} and \textit{FedSampling}. In each federated learning round, \textit{FedSTaS} stratifies clients based on their compressed gradients, re-allocate the number of clients to sample using an optimal Neyman allocation, and sample local data from each participating clients using a data uniform sampling strategy. Experiments on three datasets show that \textit{FedSTaS} can achieve higher accuracy scores than those of \textit{FedSTS} within a fixed number of training rounds.
- [29] arXiv:2412.14291 (cross-list from math.OC) [pdf, html, other]
-
Title: Projected gradient methods for nonconvex and stochastic optimization: new complexities and auto-conditioned stepsizesSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
We present a novel class of projected gradient (PG) methods for minimizing a smooth but not necessarily convex function over a convex compact set. We first provide a novel analysis of the "vanilla" PG method, achieving the best-known iteration complexity for finding an approximate stationary point of the problem. We then develop an "auto-conditioned" projected gradient (AC-PG) variant that achieves the same iteration complexity without requiring the input of the Lipschitz constant of the gradient or any line search procedure. The key idea is to estimate the Lipschitz constant using first-order information gathered from the previous iterations, and to show that the error caused by underestimating the Lipschitz constant can be properly controlled. We then generalize the PG methods to the stochastic setting, by proposing a stochastic projected gradient (SPG) method and a variance-reduced stochastic gradient (VR-SPG) method, achieving new complexity bounds in different oracle settings. We also present auto-conditioned stepsize policies for both stochastic PG methods and establish comparable convergence guarantees.
- [30] arXiv:2412.14297 (cross-list from cs.LG) [pdf, html, other]
-
Title: Distributionally Robust Policy Learning under Concept DriftsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-case joint distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studiesa more nuanced problem -- robust policy learning under the concept drift, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-$n$ rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class $\Pi$, and show that the sub-optimality gap of the proposed algorithm is of the order $\kappa(\Pi)n^{-1/2}$, with $\kappa(\Pi)$ is the entropy integral of $\Pi$ under the Hamming distance and $n$ is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.
- [31] arXiv:2412.14318 (cross-list from math.DS) [pdf, other]
-
Title: Long-time accuracy of ensemble Kalman filters for chaotic and machine-learned dynamical systemsComments: 40 pages, 4 figuresSubjects: Dynamical Systems (math.DS); Numerical Analysis (math.NA); Machine Learning (stat.ML)
Filtering is concerned with online estimation of the state of a dynamical system from partial and noisy observations. In applications where the state is high dimensional, ensemble Kalman filters are often the method of choice. This paper establishes long-time accuracy of ensemble Kalman filters. We introduce conditions on the dynamics and the observations under which the estimation error remains small in the long-time horizon. Our theory covers a wide class of partially-observed chaotic dynamical systems, which includes the Navier-Stokes equations and Lorenz models. In addition, we prove long-time accuracy of ensemble Kalman filters with surrogate dynamics, thus validating the use of machine-learned forecast models in ensemble data assimilation.
- [32] arXiv:2412.14421 (cross-list from q-bio.NC) [pdf, html, other]
-
Title: Comparing noisy neural population dynamics using optimal transport distancesSubjects: Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
Biological and artificial neural systems form high-dimensional neural representations that underpin their computational capabilities. Methods for quantifying geometric similarity in neural representations have become a popular tool for identifying computational principles that are potentially shared across neural systems. These methods generally assume that neural responses are deterministic and static. However, responses of biological systems, and some artificial systems, are noisy and dynamically unfold over time. Furthermore, these characteristics can have substantial influence on a system's computational capabilities. Here, we demonstrate that existing metrics can fail to capture key differences between neural systems with noisy dynamic responses. We then propose a metric for comparing the geometry of noisy neural trajectories, which can be derived as an optimal transport distance between Gaussian processes. We use the metric to compare models of neural responses in different regions of the motor system and to compare the dynamics of latent diffusion models for text-to-image synthesis.
- [33] arXiv:2412.14474 (cross-list from cs.LG) [pdf, other]
-
Title: Benign Overfitting in Out-of-Distribution Generalization of Linear ModelsComments: 58 pages, 1 figureSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Benign overfitting refers to the phenomenon where an over-parameterized model fits the training data perfectly, including noise in the data, but still generalizes well to the unseen test data. While prior work provides some theoretical understanding of this phenomenon under the in-distribution setup, modern machine learning often operates in a more challenging Out-of-Distribution (OOD) regime, where the target (test) distribution can be rather different from the source (training) distribution. In this work, we take an initial step towards understanding benign overfitting in the OOD regime by focusing on the basic setup of over-parameterized linear models under covariate shift. We provide non-asymptotic guarantees proving that benign overfitting occurs in standard ridge regression, even under the OOD regime when the target covariance satisfies certain structural conditions. We identify several vital quantities relating to source and target covariance, which govern the performance of OOD generalization. Our result is sharp, which provably recovers prior in-distribution benign overfitting guarantee [Tsigler and Bartlett, 2023], as well as under-parameterized OOD guarantee [Ge et al., 2024] when specializing to each setup. Moreover, we also present theoretical results for a more general family of target covariance matrix, where standard ridge regression only achieves a slow statistical rate of $O(1/\sqrt{n})$ for the excess risk, while Principal Component Regression (PCR) is guaranteed to achieve the fast rate $O(1/n)$, where $n$ is the number of samples.
- [34] arXiv:2412.14477 (cross-list from cs.LG) [pdf, html, other]
-
Title: Graph-Structured Topic Modeling for Documents with Spatial or Covariate DependenciesSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
We address the challenge of incorporating document-level metadata into topic modeling to improve topic mixture estimation. To overcome the computational complexity and lack of theoretical guarantees in existing Bayesian methods, we extend probabilistic latent semantic indexing (pLSI), a frequentist framework for topic modeling, by incorporating document-level covariates or known similarities between documents through a graph formalism. Modeling documents as nodes and edges denoting similarities, we propose a new estimator based on a fast graph-regularized iterative singular value decomposition (SVD) that encourages similar documents to share similar topic mixture proportions. We characterize the estimation error of our proposed method by deriving high-probability bounds and develop a specialized cross-validation method to optimize our regularization parameters. We validate our model through comprehensive experiments on synthetic datasets and three real-world corpora, demonstrating improved performance and faster inference compared to existing Bayesian methods.
- [35] arXiv:2412.14497 (cross-list from cs.LG) [pdf, html, other]
-
Title: Treatment Effects Estimation on Networked Observational Data using Disentangled Variational Graph AutoencoderComments: 21 pages, 6 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Estimating individual treatment effect (ITE) from observational data has gained increasing attention across various domains, with a key challenge being the identification of latent confounders affecting both treatment and outcome. Networked observational data offer new opportunities to address this issue by utilizing network information to infer latent confounders. However, most existing approaches assume observed variables and network information serve only as proxy variables for latent confounders, which often fails in practice, as some variables influence treatment but not outcomes, and vice versa. Recent advances in disentangled representation learning, which disentangle latent factors into instrumental, confounding, and adjustment factors, have shown promise for ITE estimation. Building on this, we propose a novel disentangled variational graph autoencoder that learns disentangled factors for treatment effect estimation on networked observational data. Our graph encoder further ensures factor independence using the Hilbert-Schmidt Independence Criterion. Extensive experiments on two semi-synthetic datasets derived from real-world social networks and one synthetic dataset demonstrate that our method achieves state-of-the-art performance.
- [36] arXiv:2412.14650 (cross-list from math.PR) [pdf, html, other]
-
Title: Permutation recovery of spikes in noisy high-dimensional tensor estimationComments: 29 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:2408.06401Subjects: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)
We study the dynamics of gradient flow in high dimensions for the multi-spiked tensor problem, where the goal is to estimate $r$ unknown signal vectors (spikes) from noisy Gaussian tensor observations. Specifically, we analyze the maximum likelihood estimation procedure, which involves optimizing a highly nonconvex random function. We determine the sample complexity required for gradient flow to efficiently recover all spikes, without imposing any assumptions on the separation of the signal-to-noise ratios (SNRs). More precisely, our results provide the sample complexity required to guarantee recovery of the spikes up to a permutation. Our work builds on our companion paper [Ben Arous, Gerbelot, Piccolo 2024], which studies Langevin dynamics and determines the sample complexity and separation conditions for the SNRs necessary for ensuring exact recovery of the spikes (where the recovered permutation matches the identity). During the recovery process, the correlations between the estimators and the hidden vectors increase in a sequential manner. The order in which these correlations become significant depends on their initial values and the corresponding SNRs, which ultimately determines the permutation of the recovered spikes.
- [37] arXiv:2412.14660 (cross-list from cs.CV) [pdf, html, other]
-
Title: Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language ModelsComments: Accepted to COLING 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering. Proper uncertainty calibration is crucial, yet challenging, for reliable use in areas like healthcare and autonomous driving. This paper investigates representative MLLMs, focusing on their calibration across various scenarios, including before and after visual fine-tuning, as well as before and after multimodal training of the base LLMs. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios. We also highlight how uncertainty differs between text and images and how their integration affects overall uncertainty. To better understand MLLMs' miscalibration and their ability to self-assess uncertainty, we construct the IDK (I don't know) dataset, which is key to evaluating how they handle unknowns. Our findings reveal that MLLMs tend to give answers rather than admit uncertainty, but this self-assessment improves with proper prompt adjustments. Finally, to calibrate MLLMs and enhance model reliability, we propose techniques such as temperature scaling and iterative prompt optimization. Our results provide insights into improving MLLMs for effective and responsible deployment in multimodal applications. Code and IDK dataset: \href{this https URL}{this https URL}.
- [38] arXiv:2412.14740 (cross-list from math.PR) [pdf, other]
-
Title: Recovering semipermeable barriers from reflected Brownian motionComments: 62 pages, 11 figuresSubjects: Probability (math.PR); Statistics Theory (math.ST)
We study the recovery of one-dimensional semipermeable barriers for a stochastic process in a planar domain. The considered process acts like Brownian motion when away from the barriers and is reflected upon contact until a sufficient but random amount of interaction has occurred, determined by the permeability, after which it passes through. Given a sequence of samples, we wonder when one can determine the location and shape of the barriers.
This paper identifies several different recovery regimes, determined by the available observation period and the time between samples, with qualitatively different behavior. The observation period $T$ dictates if the full barriers or only certain pieces can be recovered, and the sampling rate significantly influences the convergence rate as $T\to \infty$. This rate turns out polynomial for fixed-frequency data, but exponentially fast in a high-frequency regime.
Further, the environment's impact on the difficulty of the problem is quantified using interpretable parameters in the recovery guarantees, and is found to also be regime-dependent. For instance, the curvature of the barriers affects the convergence rate for fixed-frequency data, but becomes irrelevant when $T\to \infty$ with high-frequency data.
The results are accompanied by explicit algorithms, and we conclude by illustrating the application to real-life data. - [39] arXiv:2412.14753 (cross-list from quant-ph) [pdf, html, other]
-
Title: Opportunities and limitations of explaining quantum machine learningComments: 16+16 pages, 3+4 figuresSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
A common trait of many machine learning models is that it is often difficult to understand and explain what caused the model to produce the given output. While the explainability of neural networks has been an active field of research in the last years, comparably little is known for quantum machine learning models. Despite a few recent works analyzing some specific aspects of explainability, as of now there is no clear big picture perspective as to what can be expected from quantum learning models in terms of explainability. In this work, we address this issue by identifying promising research avenues in this direction and lining out the expected future results. We additionally propose two explanation methods designed specifically for quantum machine learning models, as first of their kind to the best of our knowledge. Next to our pre-view of the field, we compare both existing and novel methods to explain the predictions of quantum learning models. By studying explainability in quantum machine learning, we can contribute to the sustainable development of the field, preventing trust issues in the future.
- [40] arXiv:2412.15063 (cross-list from cond-mat.mtrl-sci) [pdf, html, other]
-
Title: Graph-neural-network predictions of solid-state NMR parameters from spherical tensor decompositionComments: 13 pages, 7 figuresSubjects: Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
Nuclear magnetic resonance (NMR) is a powerful spectroscopic technique that is sensitive to the local atomic structure of matter. Computational predictions of NMR parameters can help to interpret experimental data and validate structural models, and machine learning (ML) has emerged as an efficient route to making such predictions. Here, we systematically study graph-neural-network approaches to representing and learning tensor quantities for solid-state NMR -- specifically, the anisotropic magnetic shielding and the electric field gradient. We assess how the numerical accuracy of different ML models translates into prediction quality for experimentally relevant NMR properties: chemical shifts, quadrupolar coupling constants, tensor orientations, and even static 1D spectra. We apply these ML models to a structurally diverse dataset of amorphous SiO$_2$ configurations, spanning a wide range of density and local order, to larger configurations beyond the reach of traditional first-principles methods, and to the dynamics of the $\alpha\unicode{x2013}\beta$ inversion in cristobalite. Our work marks a step toward streamlining ML-driven NMR predictions for both static and dynamic behavior of complex materials, and toward bridging the gap between first-principles modeling and real-world experimental data.
Cross submissions (showing 14 of 14 entries)
- [41] arXiv:1812.05741 (replaced) [pdf, html, other]
-
Title: Posterior Projection for Inference in Constrained SpacesComments: Submitted to the Journal of Machine Learning ResearchSubjects: Methodology (stat.ME)
Estimation of parameters that obey specific constraints is crucial in statistics and machine learning; for example, when parameters are required to satisfy boundedness, monotonicity, or linear inequalities. Traditional approaches impose these constraints via constraint-specific transformations or by truncating the posterior distribution. Such methods often result in computational challenges, limited flexibility, and a lack of generality. We propose a generalized framework for constrained Bayesian inference by projecting the unconstrained posterior distribution into the space of the parameter constraints, providing a computationally efficient and easily implementable solution for a large class of problems. We rigorously establish the theoretical foundations of the projected posterior distribution, as well as providing asymptotic results for posterior consistency, posterior contraction, and optimal coverage properties. Our methodology is validated through both theoretical arguments and practical applications, including bounded-monotonic regression and emulation of a computer model with directional outputs.
- [42] arXiv:2202.06374 (replaced) [pdf, html, other]
-
Title: Holdouts set for safe predictive model updatingComments: Manuscript includes supplementary materials and figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Predictive risk scores for adverse outcomes are increasingly crucial in guiding health interventions. Such scores may need to be periodically updated due to change in the distributions they model. However, directly updating risk scores used to guide intervention can lead to biased risk estimates. To address this, we propose updating using a `holdout set' - a subset of the population that does not receive interventions guided by the risk score. Balancing the holdout set size is essential to ensure good performance of the updated risk score whilst minimising the number of held out samples. We prove that this approach reduces adverse outcome frequency to an asymptotically optimal level and argue that often there is no competitive alternative. We describe conditions under which an optimal holdout size (OHS) can be readily identified, and introduce parametric and semi-parametric algorithms for OHS estimation. We apply our methods to the ASPRE risk score for pre-eclampsia to recommend a plan for updating it in the presence of change in the underlying data distribution. We show that, in order to minimise the number of pre-eclampsia cases over time, this is best achieved using a holdout set of around 10,000 individuals.
- [43] arXiv:2302.09526 (replaced) [pdf, html, other]
-
Title: Mixed Semi-Supervised Generalized-Linear-Regression with Applications to Deep-Learning and InterpolatorsComments: 58 pages, 10 figuresSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods that improve the prediction performance of supervised learning for regression tasks. The main idea is to design different mechanisms for integrating the unlabeled data, and include in each of them a mixing parameter $\alpha$, controlling the weight given to the unlabeled data. Focusing on Generalized Linear Models (GLM) and linear interpolators classes of models, we analyze the characteristics of different mixing mechanisms, and prove that in all cases, it is invariably beneficial to integrate the unlabeled data with some nonzero mixing ratio $\alpha>0$, in terms of predictive performance. Moreover, we provide a rigorous framework to estimate the best mixing ratio $\alpha^*$ where mixed SSL delivers the best predictive performance, while using the labeled and unlabeled data on hand.
The effectiveness of our methodology in delivering substantial improvement compared to the standard supervised models, in a variety of settings, is demonstrated empirically through extensive simulation, in a manner that supports the theoretical analysis. We also demonstrate the applicability of our methodology (with some intuitive modifications) to improve more complex models, such as deep neural networks, in real-world regression tasks. - [44] arXiv:2305.03552 (replaced) [pdf, html, other]
-
Title: Designing Proposal Distributions for Particle Filters using Integrated Nested Laplace ApproximationSubjects: Computation (stat.CO)
State-space models are used to describe and analyse dynamical systems. They are ubiquitously used in many scientific fields such as signal processing, finance and ecology to name a few. Particle filters are popular inferential methods used for state-space methods. Integrated Nested Laplace Approximation (INLA), an approximate Bayesian inference method, can also be used for this kind of models in case the transition distribution is Gaussian. We present a way to use this framework in order to approximate the particle filter's proposal distribution that incorporates information about the observations, parameters and the previous latent variables. Further, we demonstrate the performance of this proposal on data simulated from a Poisson state-space model used for count data. We also show how INLA can be used to estimate the parameters of certain state-space models (a task that is often challenging) that would be used for Sequential Monte Carlo algorithms.
- [45] arXiv:2305.09565 (replaced) [pdf, html, other]
-
Title: Toward Falsifying Causal Graphs Using a Permutation-Based TestComments: Camera-ready version for AAAI 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Understanding causal relationships among the variables of a system is paramount to explain and control its behavior. For many real-world systems, however, the true causal graph is not readily available and one must resort to predictions made by algorithms or domain experts. Therefore, metrics that quantitatively assess the goodness of a causal graph provide helpful checks before using it in downstream tasks. Existing metrics provide an $\textit{absolute}$ number of inconsistencies between the graph and the observed data, and without a baseline, practitioners are left to answer the hard question of how many such inconsistencies are acceptable or expected. Here, we propose a novel consistency metric by constructing a baseline through node permutations. By comparing the number of inconsistencies with those on the baseline, we derive an interpretable metric that captures whether the graph is significantly better than random. Evaluating on both simulated and real data sets from various domains, including biology and cloud monitoring, we demonstrate that the true graph is not falsified by our metric, whereas the wrong graphs given by a hypothetical user are likely to be falsified.
- [46] arXiv:2306.06845 (replaced) [pdf, html, other]
-
Title: Information-Theoretic Limits and Strong Consistency on Binary Non-uniform Hypergraph Stochastic Block ModelsComments: The paper is written. New results concern the information-theoretic limits are added. Provide a new refinement algorithm to achieve strong consistency. Thus the title of the paper is changedSubjects: Statistics Theory (math.ST); Probability (math.PR); Methodology (stat.ME); Machine Learning (stat.ML)
Consider the unsupervised classification problem in random hypergraphs under the non-uniform Hypergraph Stochastic Block Model (HSBM) with two equal-sized communities, where each edge appears independently with some probability depending only on the labels of its vertices. In this paper, the information-theoretic limits on the clustering accuracy and the strong consistency threshold are established, expressed in terms of the generalized Hellinger distance. Below the threshold, it is impossible to assign all vertices to their own communities, and the lower bound of the expected mismatch ratio is derived. On the other hand, the problem space is (sometimes) divided into two disjoint subspaces when above the threshold. When only the contracted adjacency matrix is given, with high probability, one-stage spectral algorithms succeed in assigning every vertex correctly in the subspace far away from the threshold but fail in the other one. Two subsequent refinement algorithms are proposed to improve the clustering accuracy, which attain the lowest possible mismatch ratio, previously derived from the information-theoretical perspective. The failure of spectral algorithms in the second subspace arises from the loss of information induced by tensor contraction. The origin of this loss and possible solutions to minimize the impact are presented. Moreover, different from uniform hypergraphs, strong consistency is achievable by aggregating information from all uniform layers, even if it is impossible when each layer is considered alone.
- [47] arXiv:2311.07419 (replaced) [pdf, html, other]
-
Title: Diaconis-Ylvisaker prior penalized likelihood for $p/n \to \kappa \in (0,1)$ logistic regressionComments: 25 pages, 8 figures, pdf attachedSubjects: Statistics Theory (math.ST)
We characterise the behaviour of the maximum Diaconis-Ylvisaker prior penalized likelihood estimator in high-dimensional logistic regression, where the number of covariates is a fraction $\kappa \in (0,1)$ of the number of observations $n$, as $n \to \infty$. We derive the estimator's aggregate asymptotic behaviour under this proportional asymptotic regime, when covariates are independent normal random variables with mean zero and the linear predictor has asymptotic variance $\gamma^2$. From this foundation, we devise adjusted $Z$-statistics, penalized likelihood ratio statistics, and aggregate asymptotic results with arbitrary covariate covariance. While the maximum likelihood estimate asymptotically exists only for a narrow range of $(\kappa, \gamma)$ values, the maximum Diaconis-Ylvisaker prior penalized likelihood estimate not only exists always but is also directly computable using maximum likelihood routines. Thus, our asymptotic results also hold for $(\kappa, \gamma)$ values where results for maximum likelihood are not attainable, with no overhead in implementation or computation. We study the estimator's shrinkage properties, compare it to alternative estimation methods that can operate with proportional asymptotics, and present procedures for the estimation of unknown constants that describe the asymptotic behaviour of our estimator. We also provide a conjecture about the behaviour of our estimator when an intercept parameter is present in the model. We present results from extensive numerical studies to demonstrate the theoretical advances and strong evidence to support the conjecture, and illustrate the methodology we put forward through the analysis of a real-world data set on digit recognition.
- [48] arXiv:2311.17605 (replaced) [pdf, other]
-
Title: Improving the Balance of Unobserved Covariates From Information Theory in Multi-Arm Randomization with Unequal Allocation RatioComments: The article's structure and theoretical framework have undergone substantial revisions to improve clarity and rigor. Additionally, the numerical experiments have been entirely re-implemented to ensure consistency with the updated theoretical developments. We plan to resubmit the revised version after completing these improvementsSubjects: Applications (stat.AP); Methodology (stat.ME)
Multi-arm randomization has increasingly widespread applications recently and it is also crucial to ensure that the distributions of important observed covariates as well as the potential unobserved covariates are similar and comparable among all the treatment. However, the theoretical properties of unobserved covariates imbalance in multi-arm randomization with unequal allocation ratio remains unknown. In this paper, we give a general framework analysing the moments and distributions of unobserved covariates imbalance and apply them into different procedures including complete randomization (CR), stratified permuted block (STR-PB) and covariate-adaptive randomization (CAR). The general procedures of multi-arm STR-PB and CAR with unequal allocation ratio are also proposed. In addition, we introduce the concept of entropy to measure the correlation between discrete covariates and verify that we could utilize the correlation to select observed covariates to help better balance the unobserved covariates.
- [49] arXiv:2401.01500 (replaced) [pdf, html, other]
-
Title: Log-concave Density Estimation with Independent ComponentsComments: 44 pages, 10 figures. Various improvements over the previous version (v1), and substantial reorganization of Section 3. Some missing assumptions required by Theorem 3.10 of the previous version (v1) have now been made explicit (Lemma 3.13 of the current version)Subjects: Statistics Theory (math.ST); Methodology (stat.ME)
We propose a method for estimating a log-concave density on $\mathbb R^d$ from samples, under the assumption that there exists an orthogonal transformation that makes the components of the random vector independent. While log-concave density estimation is hard both computationally and statistically, the independent components assumption alleviates both issues, while still maintaining a large non-parametric class. We prove that under mild conditions, at most $\tilde{\mathcal{O}}(\epsilon^{-4})$ samples (suppressing constants and log factors) suffice for our proposed estimator to be within $\epsilon$ of the original density in squared Hellinger distance. On the computational front, while the usual log-concave maximum likelihood estimate can be obtained via a finite-dimensional convex program, it is slow to compute -- especially in higher dimensions. We demonstrate through numerical experiments that our estimator can be computed efficiently, making it more practical to use.
- [50] arXiv:2403.07728 (replaced) [pdf, html, other]
-
Title: CAP: A General Algorithm for Online Selective Conformal Prediction with FCR ControlSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
We study the problem of post-selection predictive inference in an online fashion. To avoid devoting resources to unimportant units, a preliminary selection of the current individual before reporting its prediction interval is common and meaningful in online predictive tasks. Since the online selection causes a temporal multiplicity in the selected prediction intervals, it is important to control the real-time false coverage-statement rate (FCR) which measures the overall miscoverage level. We develop a general framework named CAP (Calibration after Adaptive Pick) that performs an adaptive pick rule on historical data to construct a calibration set if the current individual is selected and then outputs a conformal prediction interval for the unobserved label. We provide tractable procedures for constructing the calibration set for popular online selection rules. We proved that CAP can achieve an exact selection-conditional coverage guarantee in the finite-sample and distribution-free regimes. To account for the distribution shift in online data, we also embed CAP into some recent dynamic conformal prediction algorithms and show that the proposed method can deliver long-run FCR control. Numerical results on both synthetic and real data corroborate that CAP can effectively control FCR around the target level and yield more narrowed prediction intervals over existing baselines across various settings.
- [51] arXiv:2404.17398 (replaced) [pdf, other]
-
Title: Online Policy Learning and Inference by Matrix CompletionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Is it possible to make online decisions when personalized covariates are unavailable? We take a collaborative-filtering approach for decision-making based on collective preferences. By assuming low-dimensional latent features, we formulate the covariate-free decision-making problem as a matrix completion bandit. We propose a policy learning procedure that combines an $\varepsilon$-greedy policy for decision-making with an online gradient descent algorithm for bandit parameter estimation. Our novel two-phase design balances policy learning accuracy and regret performance. For policy inference, we develop an online debiasing method based on inverse propensity weighting and establish its asymptotic normality. Our methods are applied to data from the San Francisco parking pricing project, revealing intriguing discoveries and outperforming the benchmark policy.
- [52] arXiv:2405.04269 (replaced) [pdf, other]
-
Title: An Analysis of Sea Level Spatial Variability by Topological Indicators and $k$-means Clustering AlgorithmComments: the paper contains errorSubjects: Applications (stat.AP)
The time-series data of sea level rise and fall contains crucial information on the variability of sea level patterns. Traditional $k$-means clustering is commonly used for categorizing regional variability of sea level, however, its results are not robust against a number of factors. This study analyzed fourteen datasets of monthly sea level in fourteen shoreline regions of Peninsular Malaysia. We applied a hybridization of clustering technique to analyze data categorization and topological data analysis method to enhance the performance of our clustering analysis. Specifically, our approach utilized the persistent homology and $k$-means/$k$-means++ clustering. The fourteen data sets from fourteen tide gauge stations were categorized in classes based on a prior categorization that was determined by topological information, and the probability of data points that belong to certain groups that is yielded by $k$-means/$k$-means++ clustering. Our results demonstrated that our method significantly improves the performance of traditional clustering techniques.
- [53] arXiv:2405.14492 (replaced) [pdf, html, other]
-
Title: Iterative Methods for Full-Scale Gaussian Process Approximations for Large Spatial DataSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Gaussian processes are flexible probabilistic regression models which are widely used in statistics and machine learning. However, a drawback is their limited scalability to large data sets. To alleviate this, we consider full-scale approximations (FSAs) that combine predictive process methods and covariance tapering, thus approximating both global and local structures. We show how iterative methods can be used to reduce the computational costs for calculating likelihoods, gradients, and predictive distributions with FSAs. We introduce a novel preconditioner and show that it accelerates the conjugate gradient method's convergence speed and mitigates its sensitivity with respect to the FSA parameters and the eigenvalue structure of the original covariance matrix, and we demonstrate empirically that it outperforms a state-of-the-art pivoted Cholesky preconditioner. Further, we present a novel, accurate, and fast way to calculate predictive variances relying on stochastic estimations and iterative methods. In both simulated and real-world data experiments, we find that our proposed methodology achieves the same accuracy as Cholesky-based computations with a substantial reduction in computational time. Finally, we also compare different approaches for determining inducing points in predictive process and FSA models. All methods are implemented in a free C++ software library with high-level Python and R packages.
- [54] arXiv:2407.13267 (replaced) [pdf, html, other]
-
Title: A Partially Pooled NSUM Model: Detailed estimation of CSEM trafficking prevalence in Philippine municipalitiesAlbert Nyarko-Agyei, Scott Moser, Rowland G Seymour, Ben Brewster, Sabrina Li, Esther Weir, Todd Landman, Emily Wyman, Christine Belle Torres, Imogen Fell, Doreen BoydSubjects: Applications (stat.AP)
Effective policy and intervention strategies to combat human trafficking for child sexual exploitation material (CSEM) production require accurate prevalence estimates. Traditional Network Scale Up Method (NSUM) models often necessitate standalone surveys for each geographic region, escalating costs and complexity. This study introduces a partially pooled NSUM model, using a hierarchical Bayesian framework that efficiently aggregates and utilizes data across multiple regions without increasing sample sizes. We developed this model for a novel national survey dataset from the Philippines and we demonstrate its ability to produce detailed municipal-level prevalence estimates of trafficking for CSEM production. Our results not only underscore the model's precision in estimating hidden populations but also highlight its potential for broader application in other areas of social science and public health research, offering significant implications for resource allocation and intervention planning.
- [55] arXiv:2407.15461 (replaced) [pdf, html, other]
-
Title: Forecasting mortality rates with functional signaturesComments: 40 pages, 26 figures, 9 tablesSubjects: Methodology (stat.ME)
This study introduces an innovative methodology for mortality forecasting, which integrates signature-based methods within the functional data framework of the Hyndman-Ullah (HU) model. This new approach, termed the Hyndman-Ullah with truncated signatures (HUts) model, aims to enhance the accuracy and robustness of mortality predictions. By utilizing signature regression, the HUts model is able to capture complex, nonlinear dependencies in mortality data which enhances forecasting accuracy across various demographic conditions. The model is applied to mortality data from 12 countries, comparing its forecasting performance against variants of the HU models across multiple forecast horizons. Our findings indicate that overall the HUts model not only provides more precise point forecasts but also shows robustness against data irregularities, such as those observed in countries with historical outliers. The integration of signature-based methods enables the HUts model to capture complex patterns in mortality data, making it a powerful tool for actuaries and demographers. Prediction intervals are also constructed with bootstrapping methods
- [56] arXiv:2408.06401 (replaced) [pdf, other]
-
Title: Langevin dynamics for high-dimensional optimization: the case of multi-spiked tensor PCAComments: 65 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
We study nonconvex optimization in high dimensions through Langevin dynamics, focusing on the multi-spiked tensor PCA problem. This tensor estimation problem involves recovering $r$ hidden signal vectors (spikes) from noisy Gaussian tensor observations using maximum likelihood estimation. We study the number of samples required for Langevin dynamics to efficiently recover the spikes and determine the necessary separation condition on the signal-to-noise ratios (SNRs) for exact recovery, distinguishing the cases $p \ge 3$ and $p=2$, where $p$ denotes the order of the tensor. In particular, we show that the sample complexity required for recovering the spike associated with the largest SNR matches the well-known algorithmic threshold for the single-spike case, while this threshold degrades when recovering all $r$ spikes. As a key step, we provide a detailed characterization of the trajectory and interactions of low-dimensional projections that capture the high-dimensional dynamics.
- [57] arXiv:2409.01519 (replaced) [pdf, other]
-
Title: Hybridization of Persistent Homology with Neural Networks for Time-Series Prediction: A Case Study in Wave HeightComments: the paper contain errorsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Time-series prediction is an active area of research across various fields, often challenged by the fluctuating influence of short-term and long-term factors. In this study, we introduce a feature engineering method that enhances the predictive performance of neural network models. Specifically, we leverage computational topology techniques to derive valuable topological features from input data, boosting the predictive accuracy of our models. Our focus is on predicting wave heights, utilizing models based on topological features within feedforward neural networks (FNNs), recurrent neural networks (RNNs), long short-term memory networks (LSTM), and RNNs with gated recurrent units (GRU). For time-ahead predictions, the enhancements in $R^2$ score were significant for FNNs, RNNs, LSTM, and GRU models. Additionally, these models also showed significant reductions in maximum errors and mean squared errors.
- [58] arXiv:2410.03581 (replaced) [pdf, html, other]
-
Title: Nonstationary Sparse Spectral Permanental ProcessSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Existing permanental processes often impose constraints on kernel types or stationarity, limiting the model's expressiveness. To overcome these limitations, we propose a novel approach utilizing the sparse spectral representation of nonstationary kernels. This technique relaxes the constraints on kernel types and stationarity, allowing for more flexible modeling while reducing computational complexity to the linear level. Additionally, we introduce a deep kernel variant by hierarchically stacking multiple spectral feature mappings, further enhancing the model's expressiveness to capture complex patterns in data. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of our approach, particularly in scenarios with pronounced data nonstationarity. Additionally, ablation studies are conducted to provide insights into the impact of various hyperparameters on model performance.
- [59] arXiv:2411.02231 (replaced) [pdf, html, other]
-
Title: Sharp Bounds for Continuous-Valued Treatment Effects with Unobserved ConfoundersSubjects: Methodology (stat.ME)
In causal inference, treatment effects are typically estimated under the ignorability, or unconfoundedness, assumption, which is often unrealistic in observational data. By relaxing this assumption and conducting a sensitivity analysis, we introduce novel bounds and derive confidence intervals for the Average Potential Outcome (APO) - a standard metric for evaluating continuous-valued treatment or exposure effects. We demonstrate that these bounds are sharp under a continuous sensitivity model, in the sense that they give the smallest possible interval under this model, and propose a doubly robust version of our estimators. In a comparative analysis with the method of Jesson et al. (2022) (arXiv:2204.10022), using both simulated and real datasets, we show that our approach not only yields sharper bounds but also achieves good coverage of the true APO, with significantly reduced computation times.
- [60] arXiv:2412.12014 (replaced) [pdf, other]
-
Title: Generalization Analysis for Deep Contrastive Representation LearningComments: Accepted at AAAI 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
In this paper, we present generalization bounds for the unsupervised risk in the Deep Contrastive Representation Learning framework, which employs deep neural networks as representation functions. We approach this problem from two angles. On the one hand, we derive a parameter-counting bound that scales with the overall size of the neural networks. On the other hand, we provide a norm-based bound that scales with the norms of neural networks' weight matrices. Ignoring logarithmic factors, the bounds are independent of $k$, the size of the tuples provided for contrastive learning. To the best of our knowledge, this property is only shared by one other work, which employed a different proof strategy and suffers from very strong exponential dependence on the depth of the network which is due to a use of the peeling technique. Our results circumvent this by leveraging powerful results on covering numbers with respect to uniform norms over samples. In addition, we utilize loss augmentation techniques to further reduce the dependency on matrix norms and the implicit dependence on network depth. In fact, our techniques allow us to produce many bounds for the contrastive learning setting with similar architectural dependencies as in the study of the sample complexity of ordinary loss functions, thereby bridging the gap between the learning theories of contrastive learning and DNNs.
- [61] arXiv:2105.00879 (replaced) [pdf, other]
-
Title: Identification and Estimation of Average Causal Effects in Fixed Effects Logit ModelsComments: 93 pages (online appendix starting at p.46). Major rewriting compared to v3. In particular, addition of a literature review, study of general parameters (not only the AME) in the identification, estimation and inferenceSubjects: Econometrics (econ.EM); Methodology (stat.ME)
This paper studies identification and estimation of average causal effects, such as average marginal or treatment effects, in fixed effects logit models with short panels. Relating the identified set of these effects to an extremal moment problem, we first show how to obtain sharp bounds on such effects simply, without any optimization. We also consider even simpler outer bounds, which, contrary to the sharp bounds, do not require any first-step nonparametric estimators. We build confidence intervals based on these two approaches and show their asymptotic validity. Monte Carlo simulations suggest that both approaches work well in practice, the second being typically competitive in terms of interval length. Finally, we show that our method is also useful to measure treatment effect heterogeneity.
- [62] arXiv:2203.08224 (replaced) [pdf, html, other]
-
Title: Predicting Value at Risk for Cryptocurrencies With Generalized Random ForestsSubjects: Statistical Finance (q-fin.ST); Applications (stat.AP)
We study the prediction of Value at Risk (VaR) for cryptocurrencies. In contrast to classic assets, returns of cryptocurrencies are often highly volatile and characterized by large fluctuations around single events. Analyzing a comprehensive set of 105 major cryptocurrencies, we show that Generalized Random Forests (GRF) (Athey, Tibshirani & Wager, 2019) adapted to quantile prediction have superior performance over other established methods such as quantile regression, GARCH-type and CAViaR models. This advantage is especially pronounced in unstable times and for classes of highly-volatile cryptocurrencies. Furthermore, we identify important predictors during such times and show their influence on forecasting over time. Moreover, a comprehensive simulation study also indicates that the GRF methodology is at least on par with existing methods in VaR predictions for standard types of financial returns and clearly superior in the cryptocurrency setup.
- [63] arXiv:2302.09826 (replaced) [pdf, html, other]
-
Title: On the Expressivity of Persistent Homology in Graph LearningComments: Accepted at the 3rd Learning on Graphs Conference (LoG) 2024Subjects: Machine Learning (cs.LG); Algebraic Topology (math.AT); Machine Learning (stat.ML)
Persistent homology, a technique from computational topology, has recently shown strong empirical performance in the context of graph classification. Being able to capture long range graph properties via higher-order topological features, such as cycles of arbitrary length, in combination with multi-scale topological descriptors, has improved predictive performance for data sets with prominent topological structures, such as molecules. At the same time, the theoretical properties of persistent homology have not been formally assessed in this context. This paper intends to bridge the gap between computational topology and graph machine learning by providing a brief introduction to persistent homology in the context of graphs, as well as a theoretical discussion and empirical analysis of its expressivity for graph learning tasks.
- [64] arXiv:2307.09423 (replaced) [pdf, html, other]
-
Title: Scaling Laws for Imitation Learning in Single-Agent GamesComments: Accepted at TMLR 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Imitation Learning (IL) is one of the most widely used methods in machine learning. Yet, many works find it is often unable to fully recover the underlying expert behavior, even in constrained environments like single-agent games. However, none of these works deeply investigate the role of scaling up the model and data size. Inspired by recent work in Natural Language Processing (NLP) where "scaling up" has resulted in increasingly more capable LLMs, we investigate whether carefully scaling up model and data size can bring similar improvements in the imitation learning setting for single-agent games. We first demonstrate our findings on a variety of Atari games, and thereafter focus on the extremely challenging game of NetHack. In all games, we find that IL loss and mean return scale smoothly with the compute budget (FLOPs) and are strongly correlated, resulting in power laws for training compute-optimal IL agents. Finally, we forecast and train several NetHack agents with IL and find they outperform prior state-of-the-art by 1.5x in all settings. Our work both demonstrates the scaling behavior of imitation learning in a variety of single-agent games, as well as the viability of scaling up current approaches for increasingly capable agents in NetHack, a game that remains elusively hard for current AI systems.
- [65] arXiv:2311.04686 (replaced) [pdf, html, other]
-
Title: Robust and Communication-Efficient Federated Domain Adaptation via Random FeaturesComments: 22 pages, 7 figures, 17 tables, accepted by IEEE Trans. KDESubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
Modern machine learning (ML) models have grown to a scale where training them on a single machine becomes impractical. As a result, there is a growing trend to leverage federated learning (FL) techniques to train large ML models in a distributed and collaborative manner. These models, however, when deployed on new devices, might struggle to generalize well due to domain shifts. In this context, federated domain adaptation (FDA) emerges as a powerful approach to address this challenge.
Most existing FDA approaches typically focus on aligning the distributions between source and target domains by minimizing their (e.g., MMD) distance. Such strategies, however, inevitably introduce high communication overheads and can be highly sensitive to network reliability.
In this paper, we introduce RF-TCA, an enhancement to the standard Transfer Component Analysis approach that significantly accelerates computation without compromising theoretical and empirical performance. Leveraging the computational advantage of RF-TCA, we further extend it to FDA setting with FedRF-TCA. The proposed FedRF-TCA protocol boasts communication complexity that is independent of the sample size, while maintaining performance that is either comparable to or even surpasses state-of-the-art FDA methods. We present extensive experiments to showcase the superior performance and robustness (to network condition) of FedRF-TCA. - [66] arXiv:2404.04549 (replaced) [pdf, html, other]
-
Title: Stable Learning Using Spiking Neural Networks Equipped With Affine Encoders and DecodersSubjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Functional Analysis (math.FA); Machine Learning (stat.ML)
We study the learning problem associated with spiking neural networks. Specifically, we focus on spiking neural networks composed of simple spiking neurons having only positive synaptic weights, equipped with an affine encoder and decoder. These neural networks are shown to depend continuously on their parameters, which facilitates classical covering number-based generalization statements and supports stable gradient-based training. We demonstrate that the positivity of the weights continues to enable a wide range of expressivity results, including rate-optimal approximation of smooth functions and dimension-independent approximation of Barron regular functions. In particular, we show in theory and simulations that affine spiking neural networks are capable of approximating shallow ReLU neural networks. Furthermore, we apply these neural networks to standard machine learning benchmarks, reaching competitive results. Finally, and remarkably, we observe that from a generalization perspective, contrary to feedforward neural networks or previous results for general spiking neural networks, the depth has little to no adverse effect on the generalization capabilities.
- [67] arXiv:2406.02507 (replaced) [pdf, other]
-
Title: Guiding a Diffusion Model with a Bad Version of ItselfComments: NeurIPS 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at the cost of reduced variation. These effects seem inherently entangled, and thus hard to control. We make the surprising observation that it is possible to obtain disentangled control over image quality without compromising the amount of variation by guiding generation using a smaller, less-trained version of the model itself rather than an unconditional model. This leads to significant improvements in ImageNet generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using publicly available networks. Furthermore, the method is also applicable to unconditional diffusion models, drastically improving their quality.
- [68] arXiv:2406.14742 (replaced) [pdf, html, other]
-
Title: Latent Variable Sequence Identification for Cognitive Models with Neural Network EstimatorsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Extracting time-varying latent variables from computational cognitive models is a key step in model-based neural analysis, which aims to understand the neural correlates of cognitive processes. However, existing methods only allow researchers to infer latent variables that explain subjects' behavior in a relatively small class of cognitive models. For example, a broad class of relevant cognitive models with analytically intractable likelihood is currently out of reach from standard techniques, based on Maximum a Posteriori parameter estimation. Here, we present an approach that extends neural Bayes estimation to learn a direct mapping between experimental data and the targeted latent variable space using recurrent neural networks and simulated datasets. We show that our approach achieves competitive performance in inferring latent variable sequences in both tractable and intractable models. Furthermore, the approach is generalizable across different computational models and is adaptable for both continuous and discrete latent spaces. We then demonstrate its applicability in real world datasets. Our work underscores that combining recurrent neural networks and simulation-based inference to identify latent variable sequences can enable researchers to access a wider class of cognitive models for model-based neural analyses, and thus test a broader set of theories.
- [69] arXiv:2407.02419 (replaced) [pdf, html, other]
-
Title: Quantum Curriculum LearningComments: main 6 pages, supplementary materials 11 pages (update the supplementary materials with more explanation on data-based Q-CurL)Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
Quantum machine learning (QML) requires significant quantum resources to address practical real-world problems. When the underlying quantum information exhibits hierarchical structures in the data, limitations persist in training complexity and generalization. Research should prioritize both the efficient design of quantum architectures and the development of learning strategies to optimize resource usage. We propose a framework called quantum curriculum learning (Q-CurL) for quantum data, where the curriculum introduces simpler tasks or data to the learning model before progressing to more challenging ones. Q-CurL exhibits robustness to noise and data limitations, which is particularly relevant for current and near-term noisy intermediate-scale quantum devices. We achieve this through a curriculum design based on quantum data density ratios and a dynamic learning schedule that prioritizes the most informative quantum data. Empirical evidence shows that Q-CurL significantly enhances training convergence and generalization for unitary learning and improves the robustness of quantum phase recognition tasks. Q-CurL is effective with broad physical learning applications in condensed matter physics and quantum chemistry.
- [70] arXiv:2407.17401 (replaced) [pdf, html, other]
-
Title: Estimation of bid-ask spreads in the presence of serial dependenceSubjects: Statistical Finance (q-fin.ST); Mathematical Finance (q-fin.MF); Trading and Market Microstructure (q-fin.TR); Applications (stat.AP); Methodology (stat.ME)
Starting from a basic model in which the dynamic of the transaction prices is a geometric Brownian motion disrupted by a microstructure white noise, corresponding to the random alternation of bids and asks, we propose moment-based estimators along with their statistical properties. We then make the model more realistic by considering serial dependence: we assume a geometric fractional Brownian motion for the price, then an Ornstein-Uhlenbeck process for the microstructure noise. In these two cases of serial dependence, we propose again consistent and asymptotically normal estimators. All our estimators are compared on simulated data with existing approaches, such as Roll, Corwin-Schultz, Abdi-Ranaldo, or Ardia-Guidotti-Kroencke estimators.
- [71] arXiv:2408.05428 (replaced) [pdf, html, other]
-
Title: Generalized Encouragement-Based Instrumental Variables for Counterfactual RegressionSubjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
In causal inference, encouragement designs (EDs) are widely used to analyze causal effects, when randomized controlled trials (RCTs) are impractical or compliance to treatment cannot be perfectly enforced. Unlike RCTs, which directly allocate treatments, EDs randomly assign encouragement policies that positively motivate individuals to engage in a specific treatment. These random encouragements act as instrumental variables (IVs), facilitating the identification of causal effects through leveraging exogenous perturbations in discrete treatment scenarios. However, real-world applications of encouragement designs often face challenges such as incomplete randomization, limited experimental data, and significantly fewer encouragements compared to treatments, hindering precise causal effect estimation. To address this, this paper introduces novel theories and algorithms for identifying the Conditional Average Treatment Effect (CATE) using variations in encouragement. Further, by leveraging both observational and encouragement data, we propose a generalized IV estimator, named Encouragement-based Counterfactual Regression (EnCounteR), to effectively estimate the causal effects. Extensive experiments on both synthetic and real-world datasets demonstrate the superiority of EnCounteR over existing methods.
- [72] arXiv:2410.05016 (replaced) [pdf, html, other]
-
Title: T-JEPA: Augmentation-Free Self-Supervised Learning for Tabular DataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Self-supervision is often used for pre-training to foster performance on a downstream task by constructing meaningful representations of samples. Self-supervised learning (SSL) generally involves generating different views of the same sample and thus requires data augmentations that are challenging to construct for tabular data. This constitutes one of the main challenges of self-supervision for structured data. In the present work, we propose a novel augmentation-free SSL method for tabular data. Our approach, T-JEPA, relies on a Joint Embedding Predictive Architecture (JEPA) and is akin to mask reconstruction in the latent space. It involves predicting the latent representation of one subset of features from the latent representation of a different subset within the same sample, thereby learning rich representations without augmentations. We use our method as a pre-training technique and train several deep classifiers on the obtained representation. Our experimental results demonstrate a substantial improvement in both classification and regression tasks, outperforming models trained directly on samples in their original data space. Moreover, T-JEPA enables some methods to consistently outperform or match the performance of traditional methods likes Gradient Boosted Decision Trees. To understand why, we extensively characterize the obtained representations and show that T-JEPA effectively identifies relevant features for downstream tasks without access to the labels. Additionally, we introduce regularization tokens, a novel regularization method critical for training of JEPA-based models on structured data.
- [73] arXiv:2411.01757 (replaced) [pdf, html, other]
-
Title: Mitigating Spurious Correlations via Disagreement ProbabilitySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Models trained with empirical risk minimization (ERM) are prone to be biased towards spurious correlations between target labels and bias attributes, which leads to poor performance on data groups lacking spurious correlations. It is particularly challenging to address this problem when access to bias labels is not permitted. To mitigate the effect of spurious correlations without bias labels, we first introduce a novel training objective designed to robustly enhance model performance across all data samples, irrespective of the presence of spurious correlations. From this objective, we then derive a debiasing method, Disagreement Probability based Resampling for debiasing (DPR), which does not require bias labels. DPR leverages the disagreement between the target label and the prediction of a biased model to identify bias-conflicting samples-those without spurious correlations-and upsamples them according to the disagreement probability. Empirical evaluations on multiple benchmarks demonstrate that DPR achieves state-of-the-art performance over existing baselines that do not use bias labels. Furthermore, we provide a theoretical analysis that details how DPR reduces dependency on spurious correlations.
- [74] arXiv:2411.02664 (replaced) [pdf, other]
-
Title: Explanations that reveal all through the definition of encodingComments: 36 pages, 7 figures, 6 tables, 38th Conference on Neural Information Processing Systems (NeurIPS 2024)Journal-ref: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Feature attributions attempt to highlight what inputs drive predictive power. Good attributions or explanations are thus those that produce inputs that retain this predictive power; accordingly, evaluations of explanations score their quality of prediction. However, evaluations produce scores better than what appears possible from the values in the explanation for a class of explanations, called encoding explanations. Probing for encoding remains a challenge because there is no general characterization of what gives the extra predictive power. We develop a definition of encoding that identifies this extra predictive power via conditional dependence and show that the definition fits existing examples of encoding. This definition implies, in contrast to encoding explanations, that non-encoding explanations contain all the informative inputs used to produce the explanation, giving them a "what you see is what you get" property, which makes them transparent and simple to use. Next, we prove that existing scores (ROAR, FRESH, EVAL-X) do not rank non-encoding explanations above encoding ones, and develop STRIPE-X which ranks them correctly. After empirically demonstrating the theoretical insights, we use STRIPE-X to show that despite prompting an LLM to produce non-encoding explanations for a sentiment analysis task, the LLM-generated explanations encode.
- [75] arXiv:2412.01763 (replaced) [pdf, other]
-
Title: The Data-Driven Censored Newsvendor ProblemComments: 72 pages, 9 tables, 7 figuresSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
We study a censored variant of the data-driven newsvendor problem, where the decision-maker must select an ordering quantity that minimizes expected overage and underage costs based only on offline censored sales data, rather than historical demand realizations. Our goal is to understand how the degree of historical demand censoring affects the performance of any learning algorithm for this problem. To isolate this impact, we adopt a distributionally robust optimization framework, evaluating policies according to their worst-case regret over an ambiguity set of distributions. This set is defined by the largest historical order quantity (the observable boundary of the dataset), and contains all distributions matching the true demand distribution up to this boundary, while allowing them to be arbitrary afterwards. We demonstrate a spectrum of achievability under demand censoring by deriving a natural necessary and sufficient condition under which vanishing regret is an achievable goal. In regimes in which it is not, we exactly characterize the information loss due to censoring: an insurmountable lower bound on the performance of any policy, even when the decision-maker has access to infinitely many demand samples. We then leverage these sharp characterizations to propose a natural robust algorithm that adapts to the historical level of demand censoring. We derive finite-sample guarantees for this algorithm across all possible censoring regimes and show its near-optimality with matching lower bounds (up to polylogarithmic factors). We moreover demonstrate its robust performance via extensive numerical experiments on both synthetic and real-world datasets.
- [76] arXiv:2412.09265 (replaced) [pdf, html, other]
-
Title: Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched DistillationBofang Jia, Pengxiang Ding, Can Cui, Mingyang Sun, Pengfang Qian, Siteng Huang, Zhaoxin Fan, Donglin WangSubjects: Robotics (cs.RO); Machine Learning (cs.LG); Machine Learning (stat.ML)
Visual-motor policy learning has advanced with architectures like diffusion-based policies, known for modeling complex robotic trajectories. However, their prolonged inference times hinder high-frequency control tasks requiring real-time feedback. While consistency distillation (CD) accelerates inference, it introduces errors that compromise action quality. To address these limitations, we propose the Score and Distribution Matching Policy (SDM Policy), which transforms diffusion-based policies into single-step generators through a two-stage optimization process: score matching ensures alignment with true action distributions, and distribution matching minimizes KL divergence for consistency. A dual-teacher mechanism integrates a frozen teacher for stability and an unfrozen teacher for adversarial training, enhancing robustness and alignment with target distributions. Evaluated on a 57-task simulation benchmark, SDM Policy achieves a 6x inference speedup while having state-of-the-art action quality, providing an efficient and reliable framework for high-frequency robotic tasks.
- [77] arXiv:2412.13574 (replaced) [pdf, other]
-
Title: Revisiting Interactions of Multiple Driver States in Heterogenous Population and Cognitive TasksSubjects: Human-Computer Interaction (cs.HC); Applications (stat.AP)
In real-world driving scenarios, multiple states occur simultaneously due to individual differences and environmental factors, complicating the analysis and estimation of driver states. Previous studies, limited by experimental design and analytical methods, may not be able to disentangle the relationships among multiple driver states and environmental factors. This paper introduces the Double Machine Learning (DML) analysis method to the field of driver state analysis to tackle this challenge. To train and test the DML model, a driving simulator experiment with 42 participants was conducted. All participants drove SAE level-3 vehicles and conducted three types of cognitive tasks in a 3-hour driving experiment. Drivers' subjective cognitive load and drowsiness levels were collected throughout the experiment. Then, we isolated individual and environmental factors affecting driver state variations and the factors affecting drivers' physiological and eye-tracking metrics when they are under specific states. The results show that our approach successfully decoupled and inferred the complex causal relationships between multiple types of drowsiness and cognitive load. Additionally, we identified key physiological and eye-tracking indicators in the presence of multiple driver states and under the influence of a single state, excluding the influence of other driver states, environmental factors, and individual characteristics. Our causal inference analytical framework can offer new insights for subsequent analysis of drivers' states. Further, the updated causal relation graph based on the DML analysis can provide theoretical bases for driver state monitoring based on physiological and eye-tracking measures.
- [78] arXiv:2412.14031 (replaced) [pdf, html, other]
-
Title: Gauss-Newton Dynamics for Neural Networks: A Riemannian Optimization PerspectiveSubjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
We analyze the convergence of Gauss-Newton dynamics for training neural networks with smooth activation functions. In the underparameterized regime, the Gauss-Newton gradient flow induces a Riemannian gradient flow on a low-dimensional, smooth, embedded submanifold of the Euclidean output space. Using tools from Riemannian optimization, we prove \emph{last-iterate} convergence of the Riemannian gradient flow to the optimal in-class predictor at an \emph{exponential rate} that is independent of the conditioning of the Gram matrix, \emph{without} requiring explicit regularization. We further characterize the critical impacts of the neural network scaling factor and the initialization on the convergence behavior. In the overparameterized regime, we show that the Levenberg-Marquardt dynamics with an appropriately chosen damping factor yields robustness to ill-conditioned kernels, analogous to the underparameterized regime. These findings demonstrate the potential of Gauss-Newton methods for efficiently optimizing neural networks, particularly in ill-conditioned problems where kernel and Gram matrices have small singular values.