Statistics
See recent articles
Showing new listings for Friday, 7 February 2025
- [1] arXiv:2502.03479 [pdf, html, other]
-
Title: A Tutorial on Markov Renewal and Semi-Markov Proportional Hazards ModelSubjects: Applications (stat.AP); Computation (stat.CO)
Transition probability estimation plays a critical role in multi-state modeling, especially in clinical research. This paper investigates the application of semi-Markov and Markov renewal frameworks to the EBMT dataset, focusing on six clinical states encountered during hematopoietic stem cell transplantation. By comparing Aalen-Johansen (AJ) and Dabrowska-Sun-Horowitz (DSH) estimators, we demonstrate that semi-Markov models, which incorporate sojourn times, provide a more nuanced and temporally sensitive depiction of patient trajectories compared to memoryless Markov models. The DSH estimator consistently yields smoother probability curves, particularly for transitions involving prolonged states. These findings underscore the importance of selecting appropriate models and estimators in multi-state analysis. Future work includes extending the framework to accommodate advanced covariate structures and non-Markovian dynamics.
- [2] arXiv:2502.03480 [pdf, html, other]
-
Title: Foundation for unbiased cross-validation of spatio-temporal models for species distribution modelingSubjects: Applications (stat.AP); Machine Learning (cs.LG)
Species Distribution Models (SDMs) often suffer from spatial autocorrelation (SAC), leading to biased performance estimates. We tested cross-validation (CV) strategies - random splits, spatial blocking with varied distances, environmental (ENV) clustering, and a novel spatio-temporal method - under two proposed training schemes: LAST FOLD, widely used in spatial CV at the cost of data loss, and RETRAIN, which maximizes data usage but risks reintroducing SAC. LAST FOLD consistently yielded lower errors and stronger correlations. Spatial blocking at an optimal distance (SP 422) and ENV performed best, achieving Spearman and Pearson correlations of 0.485 and 0.548, respectively, although ENV may be unsuitable for long-term forecasts involving major environmental shifts. A spatio-temporal approach yielded modest benefits in our moderately variable dataset, but may excel with stronger temporal changes. These findings highlight the need to align CV approaches with the spatial and temporal structure of SDM data, ensuring rigorous validation and reliable predictive outcomes.
- [3] arXiv:2502.03503 [pdf, html, other]
-
Title: Two in context learning tasks with complex functionsSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We examine two in context learning (ICL) tasks with mathematical functions in several train and test settings for transformer models. Our study generalizes work on linear functions by showing that small transformers, even models with attention layers only, can approximate arbitrary polynomial functions and hence continuous functions under certain conditions. Our models also can approximate previously unseen classes of polynomial functions, as well as the zeros of complex functions. Our models perform far better on this task than LLMs like GPT4 and involve complex reasoning when provided with suitable training data and methods. Our models also have important limitations; they fail to generalize outside of training distributions and so don't learn class forms of functions. We explain why this is so.
- [4] arXiv:2502.03551 [pdf, html, other]
-
Title: Online Learning Algorithms in Hilbert Spaces with $\beta-$ and $\phi-$Mixing SequencesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA)
In this paper, we study an online algorithm in a reproducing kernel Hilbert spaces (RKHS) based on a class of dependent processes, called the mixing process. For such a process, the degree of dependence is measured by various mixing coefficients. As a representative example, we analyze a strictly stationary Markov chain, where the dependence structure is characterized by the \(\beta-\) and \(\phi-\)mixing coefficients. For these dependent samples, we derive nearly optimal convergence rates. Our findings extend existing error bounds for i.i.d. observations, demonstrating that the i.i.d. case is a special instance of our framework. Moreover, we explicitly account for an additional factor introduced by the dependence structure in the Markov chain.
- [5] arXiv:2502.03609 [pdf, html, other]
-
Title: Multivariate Conformal Prediction using Optimal TransportSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Conformal prediction (CP) quantifies the uncertainty of machine learning models by constructing sets of plausible outputs. These sets are constructed by leveraging a so-called conformity score, a quantity computed using the input point of interest, a prediction model, and past observations. CP sets are then obtained by evaluating the conformity score of all possible outputs, and selecting them according to the rank of their scores. Due to this ranking step, most CP approaches rely on a score functions that are univariate. The challenge in extending these scores to multivariate spaces lies in the fact that no canonical order for vectors exists. To address this, we leverage a natural extension of multivariate score ranking based on optimal transport (OT). Our method, OTCP, offers a principled framework for constructing conformal prediction sets in multidimensional settings, preserving distribution-free coverage guarantees with finite data samples. We demonstrate tangible gains in a benchmark dataset of multivariate regression problems and address computational \& statistical trade-offs that arise when estimating conformity scores through OT maps.
- [6] arXiv:2502.03650 [pdf, html, other]
-
Title: Rule-based Evolving Fuzzy System for Time Series Forecasting: New Perspectives Based on Type-2 Fuzzy Sets Measures ApproachEduardo Santos de Oliveira Marques, Arthur Caio Vargas Pinto, Kaike Sa Teles Rocha Alves, Eduardo Pestana de AguiarSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Real-world data contain uncertainty and variations that can be correlated to external variables, known as randomness. An alternative cause of randomness is chaos, which can be an important component of chaotic time series. One of the existing methods to deal with this type of data is the use of the evolving Fuzzy Systems (eFSs), which have been proven to be a powerful class of models for time series forecasting, due to their autonomy to handle the data and highly complex problems in real-world applications. However, due to its working structure, type-2 fuzzy sets can outperform type-1 fuzzy sets for highly uncertain scenarios. We then propose ePL-KRLS-FSM+, an enhanced class of evolving fuzzy modeling approach that combines participatory learning (PL), a kernel recursive least squares method (KRLS), type-2 fuzzy logic and data transformation into fuzzy sets (FSs). This improvement allows to create and measure type-2 fuzzy sets for better handling uncertainties in the data, generating a model that can predict chaotic data with increased accuracy. The model is evaluated using two complex datasets: the chaotic time series Mackey-Glass delay differential equation with different degrees of chaos, and the main stock index of the Taiwan Capitalization Weighted Stock Index - TAIEX. Model performance is compared to related state-of-the-art rule-based eFS models and classical approaches and is analyzed in terms of error metrics, runtime and the number of final rules. Forecasting results show that the proposed model is competitive and performs consistently compared with type-1 models, also outperforming other forecasting methods by showing the lowest error metrics and number of final rules.
- [7] arXiv:2502.03792 [pdf, html, other]
-
Title: Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate ConstraintsComments: 26 pages, 8 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.
- [8] arXiv:2502.03809 [pdf, html, other]
-
Title: Bayesian Time-Varying Meta-Analysis via Hierarchical Mean-Variance Random-effects ModelsComments: 25 pages (Main document)Subjects: Methodology (stat.ME)
Meta-analysis is widely used to integrate results from multiple experiments to obtain generalized insights. Since meta-analysis datasets are often heteroscedastic due to varying subgroups and temporal heterogeneity arising from experiments conducted at different time points, the typical meta-analysis approach, which assumes homoscedasticity, fails to adequately address this heteroscedasticity among experiments. This paper proposes a new Bayesian estimation method that simultaneously shrinks estimates of the means and variances of experiments using a hierarchical Bayesian approach while accounting for time effects through a Gaussian process. This method connects experiments via the hierarchical framework, enabling "borrowing strength" between experiments to achieve high-precision estimates of each experiment's mean. The method can flexibly capture potential time trends in datasets by modeling time effects with the Gaussian process. We demonstrate the effectiveness of the proposed method through simulation studies and illustrate its practical utility using a real marketing promotions dataset.
- [9] arXiv:2502.03846 [pdf, html, other]
-
Title: On the limits of some Bayesian model evaluation statisticsSubjects: Statistics Theory (math.ST)
Model selection and order selection problems frequently arise in statistical practice. A popular approach to addressing these problems in the frequentist setting involves information criteria based on penalized maxima of log-likelihoods for competing models. In the Bayesian context, similar criteria are employed, replacing the maxima of log-likelihoods with their posterior expectations. Despite their popularity in applications, the large-sample behavior of these criteria -- such as the deviance information criterion (DIC), Bayesian predictive information criterion (BPIC), and widely-applicable Bayesian information criterion (WBIC) -- has received relatively little attention. In this work, we investigate the almost sure limits of these criteria and establish novel results on posterior and generalized posterior consistency, which are of independent interest. The utility of our theoretical findings is demonstrated via illustrative technical and numerical examples.
- [10] arXiv:2502.03848 [pdf, other]
-
Title: Consistent model selection in a collection of stochastic block modelsLucie Arts (LPSM)Subjects: Statistics Theory (math.ST)
We introduce the penalized Krichevsky-Trofimov (KT) estimator as a convergent method for estimating the number of nodes clusters when observing multiple networks within both multi-layer and dynamic Stochastic Block Models. We establish the consistency of the KT estimator, showing that it converges to the correct number of clusters in both types of models when the number of nodes in the networks increases. Our estimator does not require a known upper bound on this number to be consistent. Furthermore, we show that these consistency results hold in both dense and sparse regimes, making the penalized KT estimator robust across various network configurations. We illustrate its performance on synthetic datasets.
- [11] arXiv:2502.03849 [pdf, other]
-
Title: A fast algorithm to compute a curve of confidence upper bounds for the False Discovery Proportion using a reference family with a forest structureSubjects: Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
This paper presents a new algorithm (and an additional trick) that allows to compute fastly an entire curve of post hoc bounds for the False Discovery Proportion when the underlying bound $V^*_{\mathfrak{R}}$ construction is based on a reference family $\mathfrak{R}$ with a forest structure {à} la Durand et al. (2020). By an entire curve, we mean the values $V^*_{\mathfrak{R}}(S_1),\dotsc,V^*_{\mathfrak{R}}(S_m)$ computed on a path of increasing selection sets $S_1\subsetneq\dotsb\subsetneq S_m$, $|S_t|=t$. The new algorithm leverages the fact that going from $S_t$ to $S_{t+1}$ is done by adding only one hypothesis.
- [12] arXiv:2502.03920 [pdf, html, other]
-
Title: Unbiased Parameter Estimation for Bayesian Inverse ProblemsSubjects: Methodology (stat.ME)
In this paper we consider the estimation of unknown parameters in Bayesian inverse problems. In most cases of practical interest, there are several barriers to performing such estimation, This includes a numerical approximation of a solution of a differential equation and, even if exact solutions are available, an analytical intractability of the marginal likelihood and its associated gradient, which is used for parameter estimation. The focus of this article is to deliver unbiased estimates of the unknown parameters, that is, stochastic estimators that, in expectation, are equal to the maximize of the marginal likelihood, and possess no numerical approximation error. Based upon the ideas of [4] we develop a new approach for unbiased parameter estimation for Bayesian inverse problems. We prove unbiasedness and establish numerically that the associated estimation procedure is faster than the current state-of-the-art methodology for this problem. We demonstrate the performance of our methodology on a range of problems which include a PDE and ODE.
- [13] arXiv:2502.03942 [pdf, html, other]
-
Title: A retake on the analysis of scores truncated by terminal eventsSubjects: Methodology (stat.ME)
Analysis of data from randomized controlled trials in vulnerable populations requires special attention when assessing treatment effect by a score measuring, e.g., disease stage or activity together with onset of prevalent terminal events. In reality, it is impossible to disentangle a disease score from the terminal event, since the score is not clinically meaningful after this event. In this work, we propose to assess treatment interventions simultaneously on disease score and the terminal event. Our proposal is based on a natural data-generating mechanism respecting that a disease score does not exist beyond the terminal event. We use modern semi-parametric statistical methods to provide robust and efficient estimation of the risk of terminal event and expected disease score conditional on no terminal event at a pre-specified landmark time. We also use the simultaneous asymptotic behavior of our estimators to develop a powerful closed testing procedure for confirmatory assessment of treatment effect on both onset of terminal event and level of disease score. A simulation study mimicking a large-scale outcome trial in chronic kidney patients as well as an analysis of that trial is provided to assess performance.
- [14] arXiv:2502.03969 [pdf, html, other]
-
Title: Spectrally Deconfounded Random ForestsSubjects: Computation (stat.CO)
We introduce a modification of Random Forests to estimate functions when unobserved confounding variables are present. The technique is tailored for high-dimensional settings with many observed covariates. We use spectral deconfounding techniques to minimize a deconfounded version of the least squares objective, resulting in the Spectrally Deconfounded Random Forests (SDForests). We show how the omitted variable bias gets small given some assumptions. We compare the performance of SDForests to classical Random Forests in a simulation study and a semi-synthetic setting using single-cell gene expression data. Empirical results suggest that SDForests outperform classical Random Forests in estimating the direct regression function, even if the theoretical assumptions, requiring linear and dense confounding, are not perfectly met, and that SDForests have comparable performance in the non-confounded case.
- [15] arXiv:2502.04027 [pdf, html, other]
-
Title: High-Frequency Market Manipulation Detection with a Markov-modulated Hawkes processComments: 35 pages, 15 figuresSubjects: Methodology (stat.ME); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR)
This work focuses on a self-exciting point process defined by a Hawkes-like intensity and a switching mechanism based on a hidden Markov chain. Previous works in such a setting assume constant intensities between consecutive events. We extend the model to general Hawkes excitation kernels that are piecewise constant between events. We develop an expectation-maximization algorithm for the statistical inference of the Hawkes intensities parameters as well as the state transition probabilities. The numerical convergence of the estimators is extensively tested on simulated data. Using high-frequency cryptocurrency data on a top centralized exchange, we apply the model to the detection of anomalous bursts of trades. We benchmark the goodness-of-fit of the model with the Markov-modulated Poisson process and demonstrate the relevance of the model in detecting suspicious activities.
- [16] arXiv:2502.04046 [pdf, html, other]
-
Title: A method for sparse and robust independent component analysisSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
This work presents sparse invariant coordinate analysis, SICS, a new method for sparse and robust independent component analysis. SICS is based on classical invariant coordinate analysis, which is presented in such a form that a LASSO-type penalty can be applied to promote sparsity. Robustness is achieved by using robust scatter matrices. In the first part of the paper, the background and building blocks: scatter matrices, measures of robustness, ICS and independent component analysis, are carefully introduced. Then the proposed new method and its algorithm are derived and presented. This part also includes a consistency result for a general case of sparse ICS-like methods. The performance of SICS in identifying sparse independent component loadings is investigated with simulations. The method is also illustrated with example in constructing sparse causal graphs.
- [17] arXiv:2502.04082 [pdf, html, other]
-
Title: Market-based insurance ratemaking: application to pet insuranceSubjects: Applications (stat.AP)
This paper introduces a method for pricing insurance policies using market data. The approach is designed for scenarios in which the insurance company seeks to enter a new market, in our case: pet insurance, lacking historical data. The methodology involves an iterative two-step process. First, a suitable parameter is proposed to characterize the underlying risk. Second, the resulting pure premium is linked to the observed commercial premium using an isotonic regression model. To validate the method, comprehensive testing is conducted on synthetic data, followed by its application to a dataset of actual pet insurance rates. To facilitate practical implementation, we have developed an R package called IsoPriceR. By addressing the challenge of pricing insurance policies in the absence of historical data, this method helps enhance pricing strategies in emerging markets.
- [18] arXiv:2502.04085 [pdf, html, other]
-
Title: Accurate Estimates of Ultimate 100-Meter RecordsSubjects: Applications (stat.AP)
We employ the novel theory of heterogeneous extreme value statistics to accurately estimate the ultimate world records for the 100-m running race, for men and for women. For this aim we collected data from 1991 through 2023 from thousands of top athletes, using multiple fast times per athlete. We consider the left endpoint of the probability distribution of the running times of a top athlete and define the ultimate world record as the minimum, over all top athletes, of all these endpoints. For men we estimate the ultimate world record to be 9.56 seconds. More prudently, employing this heterogeneous extreme value theory we construct an accurate asymptotic 95% lower confidence bound on the ultimate world record of 9.49 seconds, still quite close to the present world record of 9.58. For the women's 100-meter dash our point estimate of the ultimate world record is 10.34 seconds, somewhat lower than the world record of 10.49. The more prudent 95% lower confidence bound on the women's ultimate world record is 10.20.
- [19] arXiv:2502.04112 [pdf, other]
-
Title: Quasi maximum likelihood estimation of high-dimensional approximate dynamic matrix factor models via the EM algorithmSubjects: Methodology (stat.ME); Econometrics (econ.EM)
This paper considers an approximate dynamic matrix factor model that accounts for the time series nature of the data by explicitly modelling the time evolution of the factors. We study Quasi Maximum Likelihood estimation of the model parameters based on the Expectation Maximization (EM) algorithm, implemented jointly with the Kalman smoother which gives estimates of the factors. This approach allows to easily handle arbitrary patterns of missing data. We establish the consistency of the estimated loadings and factor matrices as the sample size $T$ and the matrix dimensions $p_1$ and $p_2$ diverge to infinity. The finite sample properties of the estimators are assessed through a large simulation study and an application to a financial dataset of volatility proxies.
- [20] arXiv:2502.04118 [pdf, html, other]
-
Title: Maximum Likelihood Estimation of the Parameters of Matrix Variate Symmetric Laplace DistributionSubjects: Statistics Theory (math.ST)
This paper considers an extension of the multivariate symmetric Laplace distribution to matrix variate case. The symmetric Laplace distribution is a scale mixture of normal distribution. The maximum likelihood estimators (MLE) of the parameters of multivariate and matrix variate symmetric Laplace distribution are proposed, which are not explicitly obtainable, as the density function involves the modified Bessel function of the third kind. Thus, the EM algorithm is applied to find the maximum likelihood estimators. The parameters and their maximum likelihood estimators of matrix variate symmetric Laplace distribution are defined up to a positive multiplicative constant with their Kronecker product uniquely defined. The condition for the existence of the MLE is given, and the stability of the estimators is discussed. The empirical bias and the dispersion of the Kronecker product of the estimators for different sample sizes are discussed using simulated data.
- [21] arXiv:2502.04122 [pdf, html, other]
-
Title: How many unseen species are in multiple areas?Subjects: Methodology (stat.ME)
In ecology, the description of species composition and biodiversity calls for statistical methods that involve estimating features of interest in unobserved samples based on an observed one. In the last decade, the Bayesian nonparametrics literature has thoroughly investigated the case where data arise from a homogeneous population. In this work, we propose a novel framework to address heterogeneous populations, specifically dealing with scenarios where data arise from two areas. This setting significantly increases the mathematical complexity of the problem and, as a consequence, it received limited attention in the literature. While early approaches leverage on computational methods, we provide a distributional theory for the in-sample analysis of any observed sample and we enable out-of-sample prediction for the number of unseen distinct and shared species in additional samples of arbitrary sizes. The latter also extends the frequentist estimators which solely deal with the one-step ahead prediction. Furthermore, our results can be applied to address the sample size determination in sampling problems aimed at detecting shared species. Our results are illustrated in a real-world dataset concerning a population of ants in the city of Trieste.
- [22] arXiv:2502.04162 [pdf, html, other]
-
Title: A Pseudo Markov-Chain Model and Time-Elapsed Measures of Mobility from Collective DataComments: 27 pages, 11 figuresSubjects: Applications (stat.AP); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
In this paper we develop a pseudo Markov-chain model to understand time-elapsed flows, over multiple intervals, from time and space aggregated collective inter-location trip data, given as a time-series. Building on the model, we develop measures of mobility that parallel those known for individual mobility data, such as the radius of gyration. We apply these measures to the NetMob 2024 Data Challenge data, and obtain interesting results that are consistent with published statistics and commuting patterns in cities. Besides building a new framework, we foresee applications of this approach to an improved understanding of human mobility in the context of environmental changes and sustainable development.
- [23] arXiv:2502.04163 [pdf, html, other]
-
Title: Multi-task Online Learning for Probabilistic Load ForecastingComments: 2024 IEEE Sustainable Power and Energy ConferenceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Load forecasting is essential for the efficient, reliable, and cost-effective management of power systems. Load forecasting performance can be improved by learning the similarities among multiple entities (e.g., regions, buildings). Techniques based on multi-task learning obtain predictions by leveraging consumption patterns from the historical load demand of multiple entities and their relationships. However, existing techniques cannot effectively assess inherent uncertainties in load demand or account for dynamic changes in consumption patterns. This paper proposes a multi-task learning technique for online and probabilistic load forecasting. This technique provides accurate probabilistic predictions for the loads of multiple entities by leveraging their dynamic similarities. The method's performance is evaluated using datasets that register the load demand of multiple entities and contain diverse and dynamic consumption patterns. The experimental results show that the proposed method can significantly enhance the effectiveness of current multi-task learning approaches across a wide variety of load consumption scenarios.
- [24] arXiv:2502.04171 [pdf, other]
-
Title: Cyclic functional causal models beyond unique solvability with a graph separation theoremComments: 33+16 pages. A companion paper by the same authors, focussing on cyclic quantum causal models has been submitted to the arXiv concurrently with primary class [quant-ph]. Comments are welcomeSubjects: Statistics Theory (math.ST); Quantum Physics (quant-ph); Machine Learning (stat.ML)
Functional causal models (fCMs) specify functional dependencies between random variables associated to the vertices of a graph. In directed acyclic graphs (DAGs), fCMs are well-understood: a unique probability distribution on the random variables can be easily specified, and a crucial graph-separation result called the d-separation theorem allows one to characterize conditional independences between the variables. However, fCMs on cyclic graphs pose challenges due to the absence of a systematic way to assign a unique probability distribution to the fCM's variables, the failure of the d-separation theorem, and lack of a generalization of this theorem that is applicable to all consistent cyclic fCMs. In this work, we develop a causal modeling framework applicable to all cyclic fCMs involving finite-cardinality variables, except inconsistent ones admitting no solutions. Our probability rule assigns a unique distribution even to non-uniquely solvable cyclic fCMs and reduces to the known rule for uniquely solvable fCMs. We identify a class of fCMs, called averagely uniquely solvable, that we show to be the largest class where the probabilities admit a Markov factorization. Furthermore, we introduce a new graph-separation property, p-separation, and prove this to be sound and complete for all consistent finite-cardinality cyclic fCMs while recovering the d-separation theorem for DAGs. These results are obtained by considering classical post-selected teleportation protocols inspired by analogous protocols in quantum information theory. We discuss further avenues for exploration, linking in particular problems in cyclic fCMs and in quantum causality.
- [25] arXiv:2502.04179 [pdf, html, other]
-
Title: The Maximum Likelihood Degree of Gumbel's Type-I Bivariate Exponential DistributionSubjects: Statistics Theory (math.ST); Commutative Algebra (math.AC)
In algebraic statistics, the maximum likelihood degree of a statistical model refers to the number of solutions (counted with multiplicity) of the score equations over the complex field. In this paper, the maximum likelihood degree of the association parameter of Gumbels Type-I bivariate exponential distribution is investigated using algebraic techniques.
- [26] arXiv:2502.04208 [pdf, html, other]
-
Title: Supermartingales for One-Sided Tests: Sufficient Monotone Likelihood Ratios are SufficientSubjects: Statistics Theory (math.ST)
The t-statistic is a widely-used scale-invariant statistic for testing the null hypothesis that the mean is zero. Martingale methods enable sequential testing with the t-statistic at every sample size, while controlling the probability of falsely rejecting the null. For one-sided sequential tests, which reject when the t-statistic is too positive, a natural question is whether they also control false rejection when the true mean is negative. We prove that this is the case using monotone likelihood ratios and sufficient statistics. We develop applications to the scale-invariant t-test, the location-invariant $\chi^2$-test and sequential linear regression with nuisance covariates.
- [27] arXiv:2502.04220 [pdf, html, other]
-
Title: Dimension estimation in PCA model using high-dimensional data augmentationComments: 15 pages, 3 figuresSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
We propose a modified, high-dimensional version of a recent dimension estimation procedure that determines the dimension via the introduction of augmented noise variables into the data. Our asymptotic results show that the proposal is consistent in wide high-dimensional scenarios, and further shed light on why the original method breaks down when the dimension of either the data or the augmentation becomes too large. Simulations are used to demonstrate the superiority of the proposal to competitors both under and outside of the theoretical model.
- [28] arXiv:2502.04247 [pdf, html, other]
-
Title: Student-t processes as infinite-width limits of posterior Bayesian neural networksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
The asymptotic properties of Bayesian Neural Networks (BNNs) have been extensively studied, particularly regarding their approximations by Gaussian processes in the infinite-width limit. We extend these results by showing that posterior BNNs can be approximated by Student-t processes, which offer greater flexibility in modeling uncertainty. Specifically, we show that, if the parameters of a BNN follow a Gaussian prior distribution, and the variance of both the last hidden layer and the Gaussian likelihood function follows an Inverse-Gamma prior distribution, then the resulting posterior BNN converges to a Student-t process in the infinite-width limit. Our proof leverages the Wasserstein metric to establish control over the convergence rate of the Student-t process approximation.
- [29] arXiv:2502.04258 [pdf, html, other]
-
Title: Detecting Mild Traumatic Brain Injury with MEG Scan Data: One-vs-K-Sample TestsComments: 47 pages, 14 figures, 4 tablesSubjects: Methodology (stat.ME)
Magnetoencephalography (MEG) scanner has been shown to be more accurate than other medical devices in detecting mild traumatic brain injury (mTBI). However, MEG scan data in certain spectrum ranges can be skewed, multimodal and heterogeneous which can mislead the conventional case-control analysis that requires the data to be homogeneous and normally distributed within the control group. To meet this challenge, we propose a flexible one-vs-K-sample testing procedure for detecting brain injury for a single-case versus heterogeneous controls. The new procedure begins with source magnitude imaging using MEG scan data in frequency domain, followed by region-wise contrast tests for abnormality between the case and controls. The critical values for these tests are automatically determined by cross-validation. We adjust the testing results for heterogeneity effects by similarity analysis. An asymptotic theory is established for the proposed test statistic. By simulated and real data analyses in the context of neurotrauma, we show that the proposed test outperforms commonly used nonparametric methods in terms of overall accuracy and ability in accommodating data non-normality and subject-heterogeneity.
- [30] arXiv:2502.04276 [pdf, html, other]
-
Title: Gaussian Process Regression for Inverse Problems in Linear PDEsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Commutative Algebra (math.AC)
This paper introduces a computationally efficient algorithm in system theory for solving inverse problems governed by linear partial differential equations (PDEs). We model solutions of linear PDEs using Gaussian processes with priors defined based on advanced commutative algebra and algebraic analysis. The implementation of these priors is algorithmic and achieved using the Macaulay2 computer algebra software. An example application includes identifying the wave speed from noisy data for classical wave equations, which are widely used in physics. The method achieves high accuracy while enhancing computational efficiency.
- [31] arXiv:2502.04294 [pdf, html, other]
-
Title: Prediction-Powered E-ValuesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Quality statistical inference requires a sufficient amount of data, which can be missing or hard to obtain. To this end, prediction-powered inference has risen as a promising methodology, but existing approaches are largely limited to Z-estimation problems such as inference of means and quantiles. In this paper, we apply ideas of prediction-powered inference to e-values. By doing so, we inherit all the usual benefits of e-values -- such as anytime-validity, post-hoc validity and versatile sequential inference -- as well as greatly expand the set of inferences achievable in a prediction-powered manner. In particular, we show that every inference procedure that can be framed in terms of e-values has a prediction-powered counterpart, given by our method. We showcase the effectiveness of our framework across a wide range of inference tasks, from simple hypothesis testing and confidence intervals to more involved procedures for change-point detection and causal discovery, which were out of reach of previous techniques. Our approach is modular and easily integrable into existing algorithms, making it a compelling choice for practical applications.
New submissions (showing 31 of 31 entries)
- [32] arXiv:2502.03495 (cross-list from math.PR) [pdf, html, other]
-
Title: Capacity Constraints in Ball and Urn Distribution ProblemsComments: This is a preprint version of the manuscriptSubjects: Probability (math.PR); Methodology (stat.ME)
This paper explores the distribution of indistinguishable balls into distinct urns with varying capacity constraints, a foundational issue in combinatorial mathematics with applications across various disciplines. We present a comprehensive theoretical framework that addresses both upper and lower capacity constraints under different distribution conditions, elaborating on the combinatorial implications of such variations. Through rigorous analysis, we derive analytical solutions that cater to different constrained environments, providing a robust theoretical basis for future empirical and theoretical investigations. These solutions are pivotal for advancing research in fields that rely on precise distribution strategies, such as physics and parallel processing. The paper not only generalizes classical distribution problems but also introduces novel methodologies for tackling capacity variations, thereby broadening the utility and applicability of distribution theory in practical and theoretical contexts.
- [33] arXiv:2502.03500 (cross-list from eess.IV) [pdf, html, other]
-
Title: Efficient Image Restoration via Latent Consistency Flow MatchingComments: 21 pages, 11 figuresSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Applications (stat.AP)
Recent advances in generative image restoration (IR) have demonstrated impressive results. However, these methods are hindered by their substantial size and computational demands, rendering them unsuitable for deployment on edge devices. This work introduces ELIR, an Efficient Latent Image Restoration method. ELIR operates in latent space by first predicting the latent representation of the minimum mean square error (MMSE) estimator and then transporting this estimate to high-quality images using a latent consistency flow-based model. Consequently, ELIR is more than 4x faster compared to the state-of-the-art diffusion and flow-based approaches. Moreover, ELIR is also more than 4x smaller, making it well-suited for deployment on resource-constrained edge devices. Comprehensive evaluations of various image restoration tasks show that ELIR achieves competitive results, effectively balancing distortion and perceptual quality metrics while offering improved efficiency in terms of memory and computation.
- [34] arXiv:2502.03587 (cross-list from cs.LG) [pdf, html, other]
-
Title: Stein Discrepancy for Unsupervised Domain AdaptationComments: 24 pages, 9 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Unsupervised domain adaptation (UDA) leverages information from a labeled source dataset to improve accuracy on a related but unlabeled target dataset. A common approach to UDA is aligning representations from the source and target domains by minimizing the distance between their data distributions. Previous methods have employed distances such as Wasserstein distance and maximum mean discrepancy. However, these approaches are less effective when the target data is significantly scarcer than the source data. Stein discrepancy is an asymmetric distance between distributions that relies on one distribution only through its score function. In this paper, we propose a novel \ac{uda} method that uses Stein discrepancy to measure the distance between source and target domains. We develop a learning framework using both non-kernelized and kernelized Stein discrepancy. Theoretically, we derive an upper bound for the generalization error. Numerical experiments show that our method outperforms existing methods using other domain discrepancy measures when only small amounts of target data are available.
- [35] arXiv:2502.03600 (cross-list from econ.EM) [pdf, html, other]
-
Title: Type 2 Tobit Sample Selection Models with Bayesian Additive Regression TreesSubjects: Econometrics (econ.EM); Machine Learning (stat.ML)
This paper introduces Type 2 Tobit Bayesian Additive Regression Trees (TOBART-2). BART can produce accurate individual-specific treatment effect estimates. However, in practice estimates are often biased by sample selection. We extend the Type 2 Tobit sample selection model to account for nonlinearities and model uncertainty by including sums of trees in both the selection and outcome equations. A Dirichlet Process Mixture distribution for the error terms allows for departure from the assumption of bivariate normally distributed errors. Soft trees and a Dirichlet prior on splitting probabilities improve modeling of smooth and sparse data generating processes. We include a simulation study and an application to the RAND Health Insurance Experiment data set.
- [36] arXiv:2502.03644 (cross-list from math.NA) [pdf, html, other]
-
Title: Quasi-Monte Carlo Methods: What, Why, and How?Subjects: Numerical Analysis (math.NA); Computation (stat.CO)
Many questions in quantitative finance, uncertainty quantification, and other disciplines are answered by computing the population mean, $\mu := \mathbb{E}(Y)$, where instances of $Y:=f(\boldsymbol{X})$ may be generated by numerical simulation and $\boldsymbol{X}$ has a simple probability distribution. The population mean can be approximated by the sample mean, $\hat{\mu}_n := n^{-1} \sum_{i=0}^{n-1} f(\boldsymbol{x}_i)$ for a well chosen sequence of nodes, $\{\boldsymbol{x}_0, \boldsymbol{x}_1, \ldots\}$ and a sufficiently large sample size, $n$. Computing $\mu$ is equivalent to computing a $d$-dimensional integral, $\int f(\boldsymbol{x}) \varrho(\boldsymbol{x}) \, \mathrm{d} \boldsymbol{x}$, where $\varrho$ is the probability density for $\boldsymbol{X}$.
Quasi-Monte Carlo methods replace independent and identically distributed sequences of random vector nodes, $\{\boldsymbol{x}_i \}_{i = 0}^{\infty}$, by low discrepancy sequences. This accelerates the convergence of $\hat{\mu}_n$ to $\mu$ as $n \to \infty$.
This tutorial describes low discrepancy sequences and their quality measures. We demonstrate the performance gains possible with quasi-Monte Carlo methods. Moreover, we describe how to formulate problems to realize the greatest performance gains using quasi-Monte Carlo. We also briefly describe the use of quasi-Monte Carlo methods for problems beyond computing the mean, $\mu$. - [37] arXiv:2502.03669 (cross-list from cs.LG) [pdf, html, other]
-
Title: Unrealized Expectations: Comparing AI Methods vs Classical Algorithms for Maximum Independent SetComments: 24 pages, 7 figures, 8 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Optimization and Control (math.OC); Machine Learning (stat.ML)
AI methods, such as generative models and reinforcement learning, have recently been applied to combinatorial optimization (CO) problems, especially NP-hard ones. This paper compares such GPU-based methods with classical CPU-based methods on Maximum Independent Set (MIS). Experiments on standard graph families show that AI-based algorithms fail to outperform and, in many cases, to match the solution quality of the state-of-art classical solver KaMIS running on a single CPU. Some GPU-based methods even perform similarly to the simplest heuristic, degree-based greedy. Even with post-processing techniques like local search, AI-based methods still perform worse than CPU-based solvers.
We develop a new mode of analysis to reveal that non-backtracking AI methods, e.g. LTFT (which is based on GFlowNets), end up reasoning similarly to the simplest degree-based greedy approach, and thus worse than KaMIS. We also find that CPU-based algorithms, notably KaMIS, have strong performance on sparse random graphs, which appears to refute a well-known conjectured upper bound for efficient algorithms from Coja-Oghlan & Efthymiou (2015). - [38] arXiv:2502.03685 (cross-list from cs.CL) [pdf, other]
-
Title: Controlled LLM Decoding via Discrete Auto-regressive BiasingSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Controlled text generation allows for enforcing user-defined constraints on large language model outputs, an increasingly important field as LLMs become more prevalent in everyday life. One common approach uses energy-based decoding, which defines a target distribution through an energy function that combines multiple constraints into a weighted average. However, these methods often struggle to balance fluency with constraint satisfaction, even with extensive tuning of the energy function's coefficients. In this paper, we identify that this suboptimal balance arises from sampling in continuous space rather than the natural discrete space of text tokens. To address this, we propose Discrete Auto-regressive Biasing, a controlled decoding algorithm that leverages gradients while operating entirely in the discrete text domain. Specifically, we introduce a new formulation for controlled text generation by defining a joint distribution over the generated sequence and an auxiliary bias sequence. To efficiently sample from this joint distribution, we propose a Langevin-within-Gibbs sampling algorithm using gradient-based discrete MCMC. Our method significantly improves constraint satisfaction while maintaining comparable or better fluency, all with even lower computational costs. We demonstrate the advantages of our controlled decoding method on sentiment control, language detoxification, and keyword-guided generation.
- [39] arXiv:2502.03686 (cross-list from cs.LG) [pdf, html, other]
-
Title: Variational Control for Guidance in Diffusion ModelsComments: 8 pages in main text. Total of 20 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Diffusion models exhibit excellent sample quality, but existing guidance methods often require additional model training or are limited to specific tasks. We revisit guidance in diffusion models from the perspective of variational inference and control, introducing Diffusion Trajectory Matching (DTM) that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear and (blind) non-linear inverse problems without requiring additional model training or modifications. For instance, in ImageNet non-linear deblurring, our model achieves an FID score of 34.31, significantly improving over the best pretrained-method baseline (FID 78.07). We will make the code available in a future update.
- [40] arXiv:2502.03708 (cross-list from cs.CL) [pdf, html, other]
-
Title: Aggregate and conquer: detecting and steering LLM concepts by combining nonlinear predictors over multiple layersSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always ``know what they know'' and may even be actively misleading. In this work, we give a general method for detecting semantic concepts in the internal activations of LLMs. Furthermore, we show that our methodology can be easily adapted to steer LLMs toward desirable outputs. Our innovations are the following: (1) we use a nonlinear feature learning method to identify important linear directions for predicting concepts from each layer; (2) we aggregate features across layers to build powerful concept detectors and steering mechanisms. We showcase the power of our approach by attaining state-of-the-art results for detecting hallucinations, harmfulness, toxicity, and untruthful content on seven benchmarks. We highlight the generality of our approach by steering LLMs towards new concepts that, to the best of our knowledge, have not been previously considered in the literature, including: semantic disambiguation, human languages, programming languages, hallucinated responses, science subjects, poetic/Shakespearean English, and even multiple concepts simultaneously. Moreover, our method can steer concepts with numerical attributes such as product reviews. We provide our code (including a simple API for our methods) at this https URL .
- [41] arXiv:2502.03795 (cross-list from cs.LG) [pdf, html, other]
-
Title: Distribution learning via neural differential equations: minimal energy regularization and approximation theorySubjects: Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA); Methodology (stat.ME); Machine Learning (stat.ML)
Neural ordinary differential equations (ODEs) provide expressive representations of invertible transport maps that can be used to approximate complex probability distributions, e.g., for generative modeling, density estimation, and Bayesian inference. We show that for a large class of transport maps $T$, there exists a time-dependent ODE velocity field realizing a straight-line interpolation $(1-t)x + tT(x)$, $t \in [0,1]$, of the displacement induced by the map. Moreover, we show that such velocity fields are minimizers of a training objective containing a specific minimum-energy regularization. We then derive explicit upper bounds for the $C^k$ norm of the velocity field that are polynomial in the $C^k$ norm of the corresponding transport map $T$; in the case of triangular (Knothe--Rosenblatt) maps, we also show that these bounds are polynomial in the $C^k$ norms of the associated source and target densities. Combining these results with stability arguments for distribution approximation via ODEs, we show that Wasserstein or Kullback--Leibler approximation of the target distribution to any desired accuracy $\epsilon > 0$ can be achieved by a deep neural network representation of the velocity field whose size is bounded explicitly in terms of $\epsilon$, the dimension, and the smoothness of the source and target densities. The same neural network ansatz yields guarantees on the value of the regularized training objective.
- [42] arXiv:2502.03802 (cross-list from cs.LG) [pdf, html, other]
-
Title: MXMap: A Multivariate Cross Mapping Framework for Causal Discovery in Dynamical SystemsComments: Accepted by CLeaR 2025; Main manuscript 18 pages, appendix 24 pages, 30 tablesSubjects: Machine Learning (cs.LG); Dynamical Systems (math.DS); Methodology (stat.ME)
Convergent Cross Mapping (CCM) is a powerful method for detecting causality in coupled nonlinear dynamical systems, providing a model-free approach to capture dynamic causal interactions. Partial Cross Mapping (PCM) was introduced as an extension of CCM to address indirect causality in three-variable systems by comparing cross-mapping quality between direct cause-effect mapping and indirect mapping through an intermediate conditioning variable. However, PCM remains limited to univariate delay embeddings in its cross-mapping processes. In this work, we extend PCM to the multivariate setting, introducing multiPCM, which leverages multivariate embeddings to more effectively distinguish indirect causal relationships. We further propose a multivariate cross-mapping framework (MXMap) for causal discovery in dynamical systems. This two-phase framework combines (1) pairwise CCM tests to establish an initial causal graph and (2) multiPCM to refine the graph by pruning indirect causal connections. Through experiments on simulated data and the ERA5 Reanalysis weather dataset, we demonstrate the effectiveness of MXMap. Additionally, MXMap is compared against several baseline methods, showing advantages in accuracy and causal graph refinement.
- [43] arXiv:2502.03952 (cross-list from cs.LG) [pdf, html, other]
-
Title: Bridging the inference gap in Mutimodal Variational AutoencodersSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
From medical diagnosis to autonomous vehicles, critical applications rely on the integration of multiple heterogeneous data modalities. Multimodal Variational Autoencoders offer versatile and scalable methods for generating unobserved modalities from observed ones. Recent models using mixturesof-experts aggregation suffer from theoretically grounded limitations that restrict their generation quality on complex datasets. In this article, we propose a novel interpretable model able to learn both joint and conditional distributions without introducing mixture aggregation. Our model follows a multistage training process: first modeling the joint distribution with variational inference and then modeling the conditional distributions with Normalizing Flows to better approximate true posteriors. Importantly, we also propose to extract and leverage the information shared between modalities to improve the conditional coherence of generated samples. Our method achieves state-of-the-art results on several benchmark datasets.
- [44] arXiv:2502.04067 (cross-list from q-bio.PE) [pdf, html, other]
-
Title: Generalised Bayesian distance-based phylogenetics for the genomics eraMatthew J. Penn, Neil Scheidwasser, Mark P. Khurana, Christl A. Donnelly, David A. Duchêne, Samir BhattComments: 53 pages, 6 figuresSubjects: Populations and Evolution (q-bio.PE); Statistics Theory (math.ST)
As whole genomes become widely available, maximum likelihood and Bayesian phylogenetic methods are demonstrating their limits in meeting the escalating computational demands. Conversely, distance-based phylogenetic methods are efficient, but are rarely favoured due to their inferior performance. Here, we extend distance-based phylogenetics using an entropy-based likelihood of the evolution among pairs of taxa, allowing for fast Bayesian inference in genome-scale datasets. We provide evidence of a close link between the inference criteria used in distance methods and Felsenstein's likelihood, such that the methods are expected to have comparable performance in practice. Using the entropic likelihood, we perform Bayesian inference on three phylogenetic benchmark datasets and find that estimates closely correspond with previous inferences. We also apply this rapid inference approach to a 60-million-site alignment from 363 avian taxa, covering most avian families. The method has outstanding performance and reveals substantial uncertainty in the avian diversification events immediately after the K-Pg transition event. The entropic likelihood allows for efficient Bayesian phylogenetic inference, accommodating the analysis demands of the genomic era.
- [45] arXiv:2502.04168 (cross-list from quant-ph) [pdf, other]
-
Title: Cyclic quantum causal modelling with a graph separation theoremComments: 41+41 pages. A companion paper by the same authors, focussing on cyclic classical (functional) causal models has been submitted to the arXiv concurrently with primary class [math.ST]. Comments are welcomeSubjects: Quantum Physics (quant-ph); Statistics Theory (math.ST); Machine Learning (stat.ML)
Causal modelling frameworks link observable correlations to causal explanations, which is a crucial aspect of science. These models represent causal relationships through directed graphs, with vertices and edges denoting systems and transformations within a theory. Most studies focus on acyclic causal graphs, where well-defined probability rules and powerful graph-theoretic properties like the d-separation theorem apply. However, understanding complex feedback processes and exotic fundamental scenarios with causal loops requires cyclic causal models, where such results do not generally hold. While progress has been made in classical cyclic causal models, challenges remain in uniquely fixing probability distributions and identifying graph-separation properties applicable in general cyclic models. In cyclic quantum scenarios, existing frameworks have focussed on a subset of possible cyclic causal scenarios, with graph-separation properties yet unexplored. This work proposes a framework applicable to all consistent quantum and classical cyclic causal models on finite-dimensional systems. We address these challenges by introducing a robust probability rule and a novel graph-separation property, p-separation, which we prove to be sound and complete for all such models. Our approach maps cyclic causal models to acyclic ones with post-selection, leveraging the post-selected quantum teleportation protocol. We characterize these protocols and their success probabilities along the way. We also establish connections between this formalism and other classical and quantum frameworks to inform a more unified perspective on causality. This provides a foundation for more general cyclic causal discovery algorithms and to systematically extend open problems and techniques from acyclic informational networks (e.g., certification of non-classicality) to cyclic causal structures and networks.
- [46] arXiv:2502.04172 (cross-list from cs.LG) [pdf, html, other]
-
Title: Archetypal Analysis for Binary DataComments: 5 pages, Accepted at ICASSP 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Archetypal analysis (AA) is a matrix decomposition method that identifies distinct patterns using convex combinations of the data points denoted archetypes with each data point in turn reconstructed as convex combinations of the archetypes. AA thereby forms a polytope representing trade-offs of the distinct aspects in the data. Most existing methods for AA are designed for continuous data and do not exploit the structure of the data distribution. In this paper, we propose two new optimization frameworks for archetypal analysis for binary data. i) A second order approximation of the AA likelihood based on the Bernoulli distribution with efficient closed-form updates using an active set procedure for learning the convex combinations defining the archetypes, and a sequential minimal optimization strategy for learning the observation specific reconstructions. ii) A Bernoulli likelihood based version of the principal convex hull analysis (PCHA) algorithm originally developed for least squares optimization. We compare these approaches with the only existing binary AA procedure relying on multiplicative updates and demonstrate their superiority on both synthetic and real binary data. Notably, the proposed optimization frameworks for AA can easily be extended to other data distributions providing generic efficient optimization frameworks for AA based on tailored likelihood functions reflecting the underlying data distribution.
- [47] arXiv:2502.04204 (cross-list from cs.LG) [pdf, html, other]
-
Title: "Short-length" Adversarial Training Helps LLMs Defend "Long-length" Jailbreak Attacks: Theoretical and Empirical EvidenceSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Jailbreak attacks against large language models (LLMs) aim to induce harmful behaviors in LLMs through carefully crafted adversarial prompts. To mitigate attacks, one way is to perform adversarial training (AT)-based alignment, i.e., training LLMs on some of the most adversarial prompts to help them learn how to behave safely under attacks. During AT, the length of adversarial prompts plays a critical role in the robustness of aligned LLMs. This paper focuses on adversarial suffix jailbreak attacks and unveils that to defend against a jailbreak attack with an adversarial suffix of length $\Theta(M)$, it is enough to align LLMs on prompts with adversarial suffixes of length $\Theta(\sqrt{M})$. Theoretically, we analyze the adversarial in-context learning of linear transformers on linear regression tasks and prove a robust generalization bound for trained transformers. The bound depends on the term $\Theta(\sqrt{M_{\text{test}}}/M_{\text{train}})$, where $M_{\text{train}}$ and $M_{\text{test}}$ are the number of adversarially perturbed in-context samples during training and testing. Empirically, we conduct AT on popular open-source LLMs and evaluate their robustness against jailbreak attacks of different adversarial suffix lengths. Results confirm a positive correlation between the attack success rate and the ratio of the square root of the adversarial suffix during jailbreaking to the length during AT. Our findings show that it is practical to defend "long-length" jailbreak attacks via efficient "short-length" AT. The code is available at this https URL.
- [48] arXiv:2502.04226 (cross-list from cs.CV) [pdf, html, other]
-
Title: Keep It Light! Simplifying Image Clustering Via Text-Free AdaptersSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computation (stat.CO); Machine Learning (stat.ML)
Many competitive clustering pipelines have a multi-modal design, leveraging large language models (LLMs) or other text encoders, and text-image pairs, which are often unavailable in real-world downstream applications. Additionally, such frameworks are generally complicated to train and require substantial computational resources, making widespread adoption challenging. In this work, we show that in deep clustering, competitive performance with more complex state-of-the-art methods can be achieved using a text-free and highly simplified training pipeline. In particular, our approach, Simple Clustering via Pre-trained models (SCP), trains only a small cluster head while leveraging pre-trained vision model feature representations and positive data pairs. Experiments on benchmark datasets including CIFAR-10, CIFAR-20, CIFAR-100, STL-10, ImageNet-10, and ImageNet-Dogs, demonstrate that SCP achieves highly competitive performance. Furthermore, we provide a theoretical result explaining why, at least under ideal conditions, additional text-based embeddings may not be necessary to achieve strong clustering performance in vision.
- [49] arXiv:2502.04249 (cross-list from cs.AI) [pdf, html, other]
-
Title: Free Energy Risk Metrics for Systemically Safe AI: Gatekeeping Multi-Agent StudyComments: 9 pages, 1 figureSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
We investigate the Free Energy Principle as a foundation for measuring risk in agentic and multi-agent systems. From these principles we introduce a Cumulative Risk Exposure metric that is flexible to differing contexts and needs. We contrast this to other popular theories for safe AI that hinge on massive amounts of data or describing arbitrarily complex world models. In our framework, stakeholders need only specify their preferences over system outcomes, providing straightforward and transparent decision rules for risk governance and mitigation. This framework naturally accounts for uncertainty in both world model and preference model, allowing for decision-making that is epistemically and axiologically humble, parsimonious, and future-proof. We demonstrate this novel approach in a simplified autonomous vehicle environment with multi-agent vehicles whose driving policies are mediated by gatekeepers that evaluate, in an online fashion, the risk to the collective safety in their neighborhood, and intervene through each vehicle's policy when appropriate. We show that the introduction of gatekeepers in an AV fleet, even at low penetration, can generate significant positive externalities in terms of increased system safety.
- [50] arXiv:2502.04262 (cross-list from cs.LG) [pdf, html, other]
-
Title: Efficient Randomized Experiments Using Foundation ModelsPiersilvio De Bartolomeis, Javier Abad, Guanbo Wang, Konstantin Donhauser, Raymond M. Duch, Fanny Yang, Issa J. DahabrehSubjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Randomized experiments are the preferred approach for evaluating the effects of interventions, but they are costly and often yield estimates with substantial uncertainty. On the other hand, in silico experiments leveraging foundation models offer a cost-effective alternative that can potentially attain higher statistical precision. However, the benefits of in silico experiments come with a significant risk: statistical inferences are not valid if the models fail to accurately predict experimental responses to interventions. In this paper, we propose a novel approach that integrates the predictions from multiple foundation models with experimental data while preserving valid statistical inference. Our estimator is consistent and asymptotically normal, with asymptotic variance no larger than the standard estimator based on experimental data alone. Importantly, these statistical properties hold even when model predictions are arbitrarily biased. Empirical results across several randomized experiments show that our estimator offers substantial precision gains, equivalent to a reduction of up to 20% in the sample size needed to match the same precision as the standard estimator based on experimental data alone.
- [51] arXiv:2502.04270 (cross-list from cs.LG) [pdf, html, other]
-
Title: PILAF: Optimal Human Preference Sampling for Reward ModelingSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.
- [52] arXiv:2502.04290 (cross-list from cs.LG) [pdf, html, other]
-
Title: Every Call is Precious: Global Optimization of Black-Box Functions with Unknown Lipschitz ConstantsComments: Accepted at AISTATS 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
Optimizing expensive, non-convex, black-box Lipschitz continuous functions presents significant challenges, particularly when the Lipschitz constant of the underlying function is unknown. Such problems often demand numerous function evaluations to approximate the global optimum, which can be prohibitive in terms of time, energy, or resources. In this work, we introduce Every Call is Precious (ECP), a novel global optimization algorithm that minimizes unpromising evaluations by strategically focusing on potentially optimal regions. Unlike previous approaches, ECP eliminates the need to estimate the Lipschitz constant, thereby avoiding additional function evaluations. ECP guarantees no-regret performance for infinite evaluation budgets and achieves minimax-optimal regret bounds within finite budgets. Extensive ablation studies validate the algorithm's robustness, while empirical evaluations show that ECP outperforms 10 benchmark algorithms including Lipschitz, Bayesian, bandits, and evolutionary methods across 30 multi-dimensional non-convex synthetic and real-world optimization problems, which positions ECP as a competitive approach for global optimization.
- [53] arXiv:2502.04297 (cross-list from cs.LG) [pdf, other]
-
Title: Statistical guarantees for continuous-time policy evaluation: blessing of ellipticity and new tradeoffsSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
We study the estimation of the value function for continuous-time Markov diffusion processes using a single, discretely observed ergodic trajectory. Our work provides non-asymptotic statistical guarantees for the least-squares temporal-difference (LSTD) method, with performance measured in the first-order Sobolev norm. Specifically, the estimator attains an $O(1 / \sqrt{T})$ convergence rate when using a trajectory of length $T$; notably, this rate is achieved as long as $T$ scales nearly linearly with both the mixing time of the diffusion and the number of basis functions employed.
A key insight of our approach is that the ellipticity inherent in the diffusion process ensures robust performance even as the effective horizon diverges to infinity. Moreover, we demonstrate that the Markovian component of the statistical error can be controlled by the approximation error, while the martingale component grows at a slower rate relative to the number of basis functions. By carefully balancing these two sources of error, our analysis reveals novel trade-offs between approximation and statistical errors. - [54] arXiv:2502.04309 (cross-list from cs.LG) [pdf, html, other]
-
Title: Targeted Learning for Data FairnessSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Data and algorithms have the potential to produce and perpetuate discrimination and disparate treatment. As such, significant effort has been invested in developing approaches to defining, detecting, and eliminating unfair outcomes in algorithms. In this paper, we focus on performing statistical inference for fairness. Prior work in fairness inference has largely focused on inferring the fairness properties of a given predictive algorithm. Here, we expand fairness inference by evaluating fairness in the data generating process itself, referred to here as data fairness. We perform inference on data fairness using targeted learning, a flexible framework for nonparametric inference. We derive estimators demographic parity, equal opportunity, and conditional mutual information. Additionally, we find that our estimators for probabilistic metrics exploit double robustness. To validate our approach, we perform several simulations and apply our estimators to real data.
Cross submissions (showing 23 of 23 entries)
- [55] arXiv:2202.13689 (replaced) [pdf, html, other]
-
Title: Bayesian Hierarchical Copula Models with a Dirichlet-Laplace PriorSubjects: Methodology (stat.ME)
We discuss a Bayesian hierarchical copula model for clusters of financial time series. A similar approach has been developed in recent paper. However, the prior distributions proposed there do not always provide a proper posterior. In order to circumvent the problem, we adopt a proper global-local shrinkage prior, which is also able to account for potential dependence structures among different clusters. The performance of the proposed model is presented via simulations and a real data analysis.
- [56] arXiv:2301.07476 (replaced) [pdf, html, other]
-
Title: Negative Moment Bounds for Sample Autocovariance Matrices of Stationary Processes Driven by Conditional Heteroscedastic Errors and Their ApplicationsSubjects: Statistics Theory (math.ST)
We establish a negative moment bound for the sample autocovariance matrix of a stationary process driven by conditional heteroscedastic errors. This moment bound enables us to asymptotically express the mean squared prediction error (MSPE) of the least squares predictor as the sum of three terms related to model complexity, model misspecification, and conditional heteroscedasticity. A direct application of this expression is the development of a model selection criterion that can asymptotically identify the best (in the sense of MSPE) subset AR model in the presence of misspecification and conditional heteroscedasticity. Finally, numerical simulations are conducted to confirm our theoretical results.
- [57] arXiv:2302.13828 (replaced) [pdf, html, other]
-
Title: Random forests for binary geospatial dataSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
The manuscript develops new method and theory for non-linear regression for binary dependent data using random forests. Existing implementations of random forests for binary data cannot explicitly account for data correlation common in geospatial and time-series settings. For continuous outcomes, recent work has extended random forests (RF) to RF-GLS that incorporate spatial covariance using the generalized least squares (GLS) loss. However, adoption of this idea for binary data is challenging due to the use of the Gini impurity measure in classification trees, which has no known extension to model dependence. We show that for binary data, the GLS loss is also an extension of the Gini impurity measure, as the latter is exactly equivalent to the ordinary least squares (OLS) loss. This justifies using RF-GLS for non-parametric mean function estimation for binary dependent data. We then consider the special case of generalized mixed effects models, the traditional statistical model for binary geospatial data, which models the spatial random effects as a Gaussian process (GP). We propose a novel link-inversion technique that embeds the RF-GLS estimate of the mean function from the first step within the generalized mixed effects model framework, enabling estimation of non-linear covariate effects and offering spatial predictions. We establish consistency of our method, RF-GP, for both mean function and covariate effect estimation. The theory holds for a general class of stationary absolutely regular dependent processes that includes common choices like Gaussian processes with Matérn or compactly supported covariances and autoregressive processes. The theory relaxes the common assumption of additive mean functions and accounts for the non-linear link. We demonstrate that RF-GP outperforms competing methods for estimation and prediction in both simulated and real-world data.
- [58] arXiv:2303.01186 (replaced) [pdf, html, other]
-
Title: Discrete-time Competing-Risks Regression with or without PenalizationSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
Many studies employ the analysis of time-to-event data that incorporates competing risks and right censoring. Most methods and software packages are geared towards analyzing data that comes from a continuous failure time distribution. However, failure-time data may sometimes be discrete either because time is inherently discrete or due to imprecise measurement. This paper introduces a new estimation procedure for discrete-time survival analysis with competing events. The proposed approach offers a major key advantage over existing procedures and allows for straightforward integration and application of widely used regularized regression and screening-features methods. We illustrate the benefits of our proposed approach by a comprehensive simulation study. Additionally, we showcase the utility of the proposed procedure by estimating a survival model for the length of stay of patients hospitalized in the intensive care unit, considering three competing events: discharge to home, transfer to another medical facility, and in-hospital death. A Python package, PyDTS, is available for applying the proposed method with additional features.
- [59] arXiv:2304.00059 (replaced) [pdf, html, other]
-
Title: Resolving power: A general approach to compare the distinguishing ability of threshold-free evaluation metricsComments: 23 pages, 9 figures, 3 tablesJournal-ref: Machine Learning, 114(1), 9 (2025)Subjects: Methodology (stat.ME)
Selecting an evaluation metric is fundamental to model development, but uncertainty remains about when certain metrics are preferable and why. This paper introduces the concept of *resolving power* to describe the ability of an evaluation metric to distinguish between binary classifiers of similar quality. This ability depends on two attributes: 1. The metric's response to improvements in classifier quality (its signal), and 2. The metric's sampling variability (its noise). The paper defines resolving power generically as a metric's sampling uncertainty scaled by its signal. A simulation study compares the area under the receiver operating characteristic curve (AUROC) and the and the area under the precision-recall curve (AUPRC) in a variety of contexts. It finds that the AUROC generally has greater resolving power, but that the AUPRC is better when searching among high-quality classifiers applied to low prevalence outcomes. The paper also proposes an empirical method to estimate resolving power that can be applied to any dataset and any initial classification model. The AUROC is useful for developing the resolving power concept, but it has been criticized for being misleading. Newer metrics developed to address its interpretative issues can be easily incorporated into the resolving power framework. The best metrics for model search will be both interpretable and high in resolving power. Sometimes these objectives will conflict and how to address this tension remains an open question.
- [60] arXiv:2306.09518 (replaced) [pdf, html, other]
-
Title: Conditional variable screening for ultra-high dimensional longitudinal data with time interactionsSubjects: Methodology (stat.ME)
In recent years we have been able to gather large amounts of genomic data at a fast rate, creating situations where the number of variables greatly exceeds the number of observations. In these situations, most models that can handle a moderately high dimension will now become computationally infeasible or unstable. Hence, there is a need for a pre-screening of variables to reduce the dimension efficiently and accurately to a more moderate scale. There has been much work to develop such screening procedures for independent outcomes. However, much less work has been done for high-dimensional longitudinal data in which the observations can no longer be assumed to be independent. In addition, it is of interest to capture possible interactions between the genomic variable and time in many of these longitudinal studies. In this work, we propose a novel conditional screening procedure that ranks variables according to the likelihood value at the maximum likelihood estimates in a marginal linear mixed model, where the genomic variable and its interaction with time are included in the model. This is to our knowledge the first conditional screening approach for clustered data. We prove that this approach enjoys the sure screening property, and assess the finite sample performance of the method through simulations.
- [61] arXiv:2310.00107 (replaced) [pdf, other]
-
Title: Linear classification methods for multivariate repeated measures data -- a simulation studySubjects: Methodology (stat.ME)
Researchers in the behavioral and social sciences use linear discriminant analysis (LDA) for predictions of group membership (classification) and for identifying the variables most relevant to group separation among a set of continuous correlated variables (description). \\ In these and other disciplines, longitudinal data are often collected which provide additional temporal information. Linear classification methods for repeated measures data are more sensitive to actual group differences by taking the complex correlations between time points and variables into account, but are rarely discussed in the literature. Moreover, psychometric data rarely fulfill the multivariate normality assumption.\\ In this paper, we compare existing linear classification algorithms for nonnormally distributed multivariate repeated measures data in a simulation study based on psychological questionnaire data comprising Likert scales. The results show that in data without any specific assumed structure and larger sample sizes, the robust alternatives to standard repeated measures LDA may not be needed. To our knowledge, this is one of the few studies discussing repeated measures classification techniques, and the first one comparing multiple alternatives among each other.
- [62] arXiv:2401.07111 (replaced) [pdf, html, other]
-
Title: Bayesian Signal Matching for Transfer Learning in ERP-Based Brain Computer InterfaceComments: 35 pages, 6 figures, 2 tablesSubjects: Applications (stat.AP); Computation (stat.CO)
An Event-Related Potential (ERP)-based Brain-Computer Interface (BCI) Speller System assists people with disabilities to communicate by decoding electroencephalogram (EEG) signals. A P300-ERP embedded in EEG signals arises in response to a rare, but relevant event (target) among a series of irrelevant events (non-target). Different machine learning methods have constructed binary classifiers to detect target events, known as calibration. The existing calibration strategy uses data from participants themselves with lengthy training time. Participants feel bored and distracted, which causes biased P300 estimation and decreased prediction accuracy. To resolve this issue, we propose a Bayesian signal matching (BSM) framework to calibrate EEG signals from a new participant using data from source participants. BSM specifies the joint distribution of stimulus-specific EEG signals among source participants via a Bayesian hierarchical mixture model. We apply the inference strategy. If source and new participants are similar, they share the same set of model parameters; otherwise, they keep their own sets of model parameters; we predict on the testing data using parameters of the baseline cluster directly. Our hierarchical framework can be generalized to other base classifiers with parametric forms. We demonstrate the advantages of BSM using simulations and focus on the real data analysis among participants with neuro-degenerative diseases.
- [63] arXiv:2403.17132 (replaced) [pdf, html, other]
-
Title: A Personalized Predictive Model that Jointly Optimizes Discrimination and CalibrationSubjects: Methodology (stat.ME); Applications (stat.AP)
Precision medicine is accelerating rapidly in the field of health research. This includes fitting predictive models for individual patients based on patient similarity in an attempt to improve model performance. We propose an algorithm which fits a personalized predictive model (PPM) using an optimal size of a similar subpopulation that jointly optimizes model discrimination and calibration, as it is criticized that calibration is not assessed nearly as often as discrimination despite poorly calibrated models being potentially misleading. We define a mixture loss function that considers model discrimination and calibration, and allows for flexibility in emphasizing one performance measure over another. We empirically show that the relationship between the size of subpopulation and calibration is quadratic, which motivates the development of our jointly optimized model. We also investigate the effect of within-population patient weighting on performance and conclude that the size of subpopulation has a larger effect on the predictive performance of the PPM compared to the choice of weight function.
- [64] arXiv:2403.18353 (replaced) [pdf, html, other]
-
Title: Early Stopping for Ensemble Kalman-Bucy InversionSubjects: Statistics Theory (math.ST)
Bayesian linear inverse problems aim to recover an unknown signal from noisy observations, incorporating prior knowledge. This paper analyses a data dependent method to choose the scale parameter of a Gaussian prior. The method we study arises from early stopping methods, which have been successfully applied to a range of problems for statistical inverse problems in the frequentist setting. These results are extended to the Bayesian setting. We study the use of a discrepancy based stopping rule in the setting of random noise. Our proposed stopping rule results in optimal rates under certain conditions on the prior covariance operator. We furthermore derive for which class of signals this method is adaptive. It is also shown that the associated posterior contracts at the optimal rate and provides a conservative measure of uncertainty. We implement the proposed stopping rule using the continuous-time ensemble Kalman--Bucy filter (EnKBF). The fictitious time parameter replaces the scale parameter, and the ensemble size is appropriately adjusted in order to not lose statistical optimality of the computed estimator. The EnKBF, then, gives a continuous process from the prior distribution to the posterior which is terminated using the proposed stopping rule.
- [65] arXiv:2404.16490 (replaced) [pdf, html, other]
-
Title: On Neighbourhood Cross ValidationComments: Further improved covariance matrix estimation under short range autocorrelation (section 6), fuller discussion of NCV motivation (section 5) under autocorrelation, some referencing improvements (sections 1 and 2)Subjects: Methodology (stat.ME); Computation (stat.CO)
Many varieties of cross validation would be statistically appealing for the estimation of smoothing and other penalized regression hyperparameters, were it not for the high cost of evaluating such criteria. Here it is shown how to efficiently and accurately compute and optimize a broad variety of cross validation criteria for a wide range of models estimated by minimizing a quadratically penalized loss. The leading order computational cost of hyperparameter estimation is made comparable to the cost of a single model fit given hyperparameters. In many cases this represents an $O(n)$ computational saving when modelling $n$ data. This development makes if feasible, for the first time, to use leave-out-neighbourhood cross validation to deal with the wide spread problem of un-modelled short range autocorrelation which otherwise leads to underestimation of smoothing parameters. It is also shown how to accurately quantifying uncertainty in this case, despite the un-modelled autocorrelation. Practical examples are provided including smooth quantile regression, generalized additive models for location scale and shape, and focussing particularly on dealing with un-modelled autocorrelation.
- [66] arXiv:2404.18370 (replaced) [pdf, html, other]
-
Title: Out-of-distribution generalization under random, dense distributional shiftsSubjects: Methodology (stat.ME)
Many existing approaches for estimating parameters in settings with distributional shifts operate under an invariance assumption. For example, under covariate shift, it is assumed that $p(y|x)$ remains invariant. We refer to such distribution shifts as sparse, since they may be substantial but affect only a part of the data generating system. In contrast, in various real-world settings, shifts might be dense. More specifically, these dense distributional shifts may arise through numerous small and random changes in the population and environment. First, we discuss empirical evidence for such random dense distributional shifts. Then, we develop tools to infer parameters and make predictions for partially observed, shifted distributions. Finally, we apply the framework to several real-world datasets and discuss diagnostics to evaluate the fit of the distributional uncertainty model.
- [67] arXiv:2405.13690 (replaced) [pdf, other]
-
Title: Observable asymptotics of regularized Cox regression models with standard Gaussian designs: a statistical mechanics approachSubjects: Statistics Theory (math.ST); Disordered Systems and Neural Networks (cond-mat.dis-nn)
We study the asymptotic behaviour of the Regularized Maximum Partial Likelihood Estimator (RMPLE) in the proportional limit, considering an arbitrary convex regularizer and assuming that the covariates $\mathbf{X}_i\in\mathbb{R}^{p}$ follow a multivariate Gaussian law with covariance $\mathbf{I}_p/p$ for each $i=1, \dots, n$. In order to efficiently compute the estimator under investigation, we propose a modified Approximate Message Passing (AMP) algorithm, that we name COX-AMP, and compare its performance with the Coordinate-wise Descent (CD) algorithm, which is taken as reference. By means of the Replica method, we derive a set of six Replica Symmetric (RS) equations that we show to correctly describe the average behaviour of the estimators when the sample size and the number of covariates is large and commensurate. These equations cannot be solved in practice, as the data generating process (that we are trying to estimate) is not known. However, the update equations of COX-AMP suggest the construction of a local field that can in turn be used to accurately estimate all the RS order parameters of the theory \emph{solely from the data}, \emph{without} actually solving the RS equations. We emphasize that this approach can be applied when the estimator is computed via any method and is not restricted to COX-AMP. Once the RS order parameters are estimated, we have access to the amount of signal and noise in the RMPLE, but also its generalization error, directly from the data. Although we focus on the Partial Likelihood objective, we envisage broader application of the methodology proposed here, for instance to GLMs with nuisance parameters, which include some non-proportional hazards models, e.g. Accelerated Failure Time models.
- [68] arXiv:2405.19920 (replaced) [pdf, other]
-
Title: The ARR2 prior: flexible predictive prior definition for Bayesian auto-regressionsSubjects: Computation (stat.CO); Econometrics (econ.EM)
We present the ARR2 prior, a joint prior over the auto-regressive components in Bayesian time-series models and their induced $R^2$. Compared to other priors designed for times-series models, the ARR2 prior allows for flexible and intuitive shrinkage. We derive the prior for pure auto-regressive models, and extend it to auto-regressive models with exogenous inputs, and state-space models. Through both simulations and real-world modelling exercises, we demonstrate the efficacy of the ARR2 prior in improving sparse and reliable inference, while showing greater inference quality and predictive performance than other shrinkage priors. An open-source implementation of the prior is provided.
- [69] arXiv:2406.03258 (replaced) [pdf, other]
-
Title: Relaxed Quantile Regression: Prediction Intervals for Asymmetric NoiseComments: Accepted at International Conference on Machine Learning (ICML) 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Constructing valid prediction intervals rather than point estimates is a well-established approach for uncertainty quantification in the regression setting. Models equipped with this capacity output an interval of values in which the ground truth target will fall with some prespecified probability. This is an essential requirement in many real-world applications where simple point predictions' inability to convey the magnitude and frequency of errors renders them insufficient for high-stakes decisions. Quantile regression is a leading approach for obtaining such intervals via the empirical estimation of quantiles in the (non-parametric) distribution of outputs. This method is simple, computationally inexpensive, interpretable, assumption-free, and effective. However, it does require that the specific quantiles being learned are chosen a priori. This results in (a) intervals that are arbitrarily symmetric around the median which is sub-optimal for realistic skewed distributions, or (b) learning an excessive number of intervals. In this work, we propose Relaxed Quantile Regression (RQR), a direct alternative to quantile regression based interval construction that removes this arbitrary constraint whilst maintaining its strengths. We demonstrate that this added flexibility results in intervals with an improvement in desirable qualities (e.g. mean width) whilst retaining the essential coverage guarantees of quantile regression.
- [70] arXiv:2409.04412 (replaced) [pdf, html, other]
-
Title: Robust Elicitable FunctionalsSubjects: Methodology (stat.ME); Mathematical Finance (q-fin.MF); Risk Management (q-fin.RM)
Elicitable functionals and (strictly) consistent scoring functions are of interest due to their utility of determining (uniquely) optimal forecasts, and thus the ability to effectively backtest predictions. However, in practice, assuming that a distribution is correctly specified is too strong a belief to reliably hold. To remediate this, we incorporate a notion of statistical robustness into the framework of elicitable functionals, meaning that our robust functional accounts for "small" misspecifications of a baseline distribution. Specifically, we propose a robustified version of elicitable functionals by using the Kullback-Leibler divergence to quantify potential misspecifications from a baseline distribution. We show that the robust elicitable functionals admit unique solutions lying at the boundary of the uncertainty region, and provide conditions for existence and uniqueness. Since every elicitable functional possesses infinitely many scoring functions, we propose the class of b-homogeneous strictly consistent scoring functions, for which the robust functionals maintain desirable statistical properties. We show the applicability of the robust elicitable functional in several examples: in a reinsurance setting and in robust regression problems.
- [71] arXiv:2409.06473 (replaced) [pdf, html, other]
-
Title: Some statistical aspects of the Covid-19 responseSimon N. Wood, Ernst C. Wit, Paul M. McKeigue, Danshu Hu, Beth Flood, Lauren Corcoran, Thea Abou JawadComments: Version finally accepted by Journal of the Royal Statistical Society (Series A) as a discussion paperSubjects: Applications (stat.AP)
This paper discusses some statistical aspects of the U.K. Covid-19 pandemic response, focussing particularly on cases where we believe that a statistically questionable approach or presentation has had a substantial impact on public perception, or government policy, or both. We discuss the presentation of statistics relating to Covid risk, and the risk of the response measures, arguing that biases tended to operate in opposite directions, overplaying Covid risk and underplaying the response risks. We also discuss some issues around presentation of life loss data, excess deaths and the use of case data. The consequences of neglect of most individual variability from epidemic models, alongside the consequences of some other statistically important omissions are also covered. Finally the evidence for full stay at home lockdowns having been necessary to reverse waves of infection is examined, with new analyses provided for a number of European countries.
- [72] arXiv:2410.19190 (replaced) [pdf, html, other]
-
Title: A novel longitudinal rank-sum test for multiple primary endpoints in clinical trials: Applications to neurodegenerative disordersComments: Accepted by Statistics in Biopharmaceutical ResearchSubjects: Methodology (stat.ME); Applications (stat.AP)
Neurodegenerative disorders such as Alzheimer's disease (AD) present a significant global health challenge, characterized by cognitive decline, functional impairment, and other debilitating effects. Current AD clinical trials often assess multiple longitudinal primary endpoints to comprehensively evaluate treatment efficacy. Traditional methods, however, may fail to capture global treatment effects, require larger sample sizes due to multiplicity adjustments, and may not fully exploit multivariate longitudinal data. To address these limitations, we introduce the Longitudinal Rank Sum Test (LRST), a novel nonparametric rank-based omnibus test statistic. The LRST enables a comprehensive assessment of treatment efficacy across multiple endpoints and time points without multiplicity adjustments, effectively controlling Type I error while enhancing statistical power. It offers flexibility against various data distributions encountered in AD research and maximizes the utilization of longitudinal data. Extensive simulations and real-data applications demonstrate the LRST's performance, underscoring its potential as a valuable tool in AD clinical trials.
- [73] arXiv:2412.00412 (replaced) [pdf, other]
-
Title: Functional worst risk minimizationSubjects: Statistics Theory (math.ST); Probability (math.PR)
The aim of this paper is to extend worst risk minimization, also called worst average loss minimization, to the functional realm. This means finding a functional regression representation that will be robust to future distribution shifts on the basis of data from two environments. In the classical non-functional realm, structural equations are based on a transfer matrix $B$. In section~\ref{sec:sfr}, we generalize this to consider a linear operator $\mathcal{T}$ on square integrable processes that plays the the part of $B$. By requiring that $(I-\mathcal{T})^{-1}$ is bounded -- as opposed to $\mathcal{T}$ -- this will allow for a large class of unbounded operators to be considered. Section~\ref{sec:worstrisk} considers two separate cases that both lead to the same worst-risk decomposition. Remarkably, this decomposition has the same structure as in the non-functional case. We consider any operator $\mathcal{T}$ that makes $(I-\mathcal{T})^{-1}$ bounded and define the future shift set in terms of the covariance functions of the shifts. In section~\ref{sec:minimizer}, we prove a necessary and sufficient condition for existence of a minimizer to this worst risk in the space of square integrable kernels. Previously, such minimizers were expressed in terms of the unknown eigenfunctions of the target and covariate integral operators (see for instance \cite{HeMullerWang} and \cite{YaoAOS}). This means that in order to estimate the minimizer, one must first estimate these unknown eigenfunctions. In contrast, the solution provided here will be expressed in any arbitrary ON-basis. This completely removes any necessity of estimating eigenfunctions. This pays dividends in section~\ref{sec:estimation}, where we provide a family of estimators, that are consistent with a large sample bound. Proofs of all the results are provided in the appendix.
- [74] arXiv:2412.08606 (replaced) [pdf, html, other]
-
Title: Enhancing the use of family planning service statistics using a Bayesian modelling approach to inform estimates of modern contraceptive use in low- and middle-income countriesSubjects: Applications (stat.AP)
Monitoring family planning indicators, such as modern contraceptive prevalence rate (mCPR), is essential for family planning programming. The Family Planning Estimation Tool (FPET) uses survey data to estimate and forecast family planning indicators, including mCPR, over time. However, sole reliance on large-scale surveys, carried out on average every 3-5 years, can lead to data gaps. Service statistics are a readily available data source, routinely collected in conjunction with service delivery. Various service statistics data types can be used to derive a family planning indicator called Estimated Modern Use (EMU). In a number of countries, annual rates of change in EMU have been found to be predictive of true rates of change in mCPR. However, it has been challenging to capture the varying levels of uncertainty associated with the EMU indicator across different countries and service statistics data types and to subsequently quantify this uncertainty when using EMU in FPET. We present a new approach to using EMUs in FPET to inform mCPR estimates, using annual EMU rates of change as input, and accounting for uncertainty associated with the EMU derivation process. The approach also considers additional country-type-specific uncertainty. We assess the EMU type-specific uncertainty at the country level, via a Bayesian hierarchical modelling approach. Validation results and anonymised country-level case studies highlight improved predictive performance and provide insights into the impact of including EMU data on mCPR estimates compared to using survey data alone. Together, they demonstrate that EMUs can help countries monitor progress toward their family planning goals more effectively.
- [75] arXiv:2501.01935 (replaced) [pdf, html, other]
-
Title: On robust recovery of signals from indirect observationsSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
We consider an uncertain linear inverse problem as follows. Given observation $\omega=Ax_*+\zeta$ where $A\in {\bf R}^{m\times p}$ and $\zeta\in {\bf R}^{m}$ is observation noise, we want to recover unknown signal $x_*$, known to belong to a convex set ${\cal X}\subset{\bf R}^{n}$. As opposed to the "standard" setting of such problem, we suppose that the model noise $\zeta$ is "corrupted" -- contains an uncertain (deterministic dense or singular) component. Specifically, we assume that $\zeta$ decomposes into $\zeta=N\nu_*+\xi$ where $\xi$ is the random noise and $N\nu_*$ is the "adversarial contamination" with known $\cal N\subset {\bf R}^n$ such that $\nu_*\in \cal N$ and $N\in {\bf R}^{m\times n}$. We consider two "uncertainty setups" in which $\cal N$ is either a convex bounded set or is the set of sparse vectors (with at most $s$ nonvanishing entries). We analyse the performance of "uncertainty-immunized" polyhedral estimates -- a particular class of nonlinear estimates as introduced in [15, 16] -- and show how "presumably good" estimates of the sort may be constructed in the situation where the signal set is an ellitope (essentially, a symmetric convex set delimited by quadratic surfaces) by means of efficient convex optimization routines.
- [76] arXiv:2501.07061 (replaced) [pdf, html, other]
-
Title: A Beta Cauchy-Cauchy (BECCA) shrinkage prior for Bayesian variable selectionSubjects: Methodology (stat.ME); Computation (stat.CO)
This paper introduces a novel Bayesian approach for variable selection in high-dimensional and potentially sparse regression settings. Our method replaces the indicator variables in the traditional spike and slab prior with continuous, Beta-distributed random variables and places half Cauchy priors over the parameters of the Beta distribution, which significantly improves the predictive and inferential performance of the technique. Similar to shrinkage methods, our continuous parameterization of the spike and slab prior enables us explore the posterior distributions of interest using fast gradient-based methods, such as Hamiltonian Monte Carlo (HMC), while at the same time explicitly allowing for variable selection in a principled framework. We study the frequentist properties of our model via simulation and show that our technique outperforms the latest Bayesian variable selection methods in both linear and logistic regression. The efficacy, applicability and performance of our approach, are further underscored through its implementation on real datasets.
- [77] arXiv:2501.14805 (replaced) [pdf, html, other]
-
Title: Sequential Methods for Error Correction of Probabilistic Wind Power ForecastsSubjects: Methodology (stat.ME); Applications (stat.AP)
Reliable probabilistic production forecasts are required to better manage the uncertainty that the rapid build-out of wind power capacity adds to future energy systems. In this article, we consider sequential methods to correct errors in power production forecast ensembles derived from numerical weather predictions. We propose combining neural networks with time-adaptive quantile regression to enhance the accuracy of wind power forecasts. We refer to this approach as Neural Adaptive Basis for (time-adaptive) Quantile Regression or NABQR. First, we use NABQR to correct power production ensembles with neural networks. We find that Long Short-Term Memory networks are the most effective architecture for this purpose. Second, we apply time-adaptive quantile regression to the corrected ensembles to obtain optimal median predictions along with quantiles of the forecast distribution. With the suggested method we achieve accuracy improvements up to 40% in mean absolute terms in an application to day-ahead forecasting of on- and offshore wind power production in Denmark. In addition, we explore the value of our method for applications in energy trading. We have implemented the NABQR method as an open-source Python package to support applications in renewable energy forecasting and future research.
- [78] arXiv:2501.15753 (replaced) [pdf, html, other]
-
Title: Scale-Insensitive Neural Network Significance TestsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
This paper develops a scale-insensitive framework for neural network significance testing, substantially generalizing existing approaches through three key innovations. First, we replace metric entropy calculations with Rademacher complexity bounds, enabling the analysis of neural networks without requiring bounded weights or specific architectural constraints. Second, we weaken the regularity conditions on the target function to require only Sobolev space membership $H^s([-1,1]^d)$ with $s > d/2$, significantly relaxing previous smoothness assumptions while maintaining optimal approximation rates. Third, we introduce a modified sieve space construction based on moment bounds rather than weight constraints, providing a more natural theoretical framework for modern deep learning practices. Our approach achieves these generalizations while preserving optimal convergence rates and establishing valid asymptotic distributions for test statistics. The technical foundation combines localization theory, sharp concentration inequalities, and scale-insensitive complexity measures to handle unbounded weights and general Lipschitz activation functions. This framework better aligns theoretical guarantees with contemporary deep learning practice while maintaining mathematical rigor.
- [79] arXiv:2104.10751 (replaced) [pdf, html, other]
-
Title: Rule Generation for Classification: Scalability, Interpretability, and FairnessSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce a new rule-based optimization method for classification with constraints. The proposed method leverages column generation for linear programming, and hence, is scalable to large datasets. The resulting pricing subproblem is shown to be NP-Hard. We recourse to a decision tree-based heuristic and solve a proxy pricing subproblem for acceleration. The method returns a set of rules along with their optimal weights indicating the importance of each rule for learning. We address interpretability and fairness by assigning cost coefficients to the rules and introducing additional constraints. In particular, we focus on local interpretability and generalize a separation criterion in fairness to multiple sensitive attributes and classes. We test the performance of the proposed methodology on a collection of datasets and present a case study to elaborate on its different aspects. The proposed rule-based learning method exhibits a good compromise between local interpretability and fairness on the one side, and accuracy on the other side.
- [80] arXiv:2309.08517 (replaced) [pdf, html, other]
-
Title: On the Forgetting of Particle FiltersComments: 33 pagesSubjects: Probability (math.PR); Computation (stat.CO)
We study the forgetting properties of the particle filter when its state - the collection of particles - is regarded as a Markov chain. Under a strong mixing assumption on the particle filter's underlying Feynman-Kac model, we find that the particle filter is exponentially mixing, and forgets its initial state in $O(\log N )$ 'time', where $N$ is the number of particles and time refers to the number of particle filter algorithm steps, each comprising a selection (or resampling) and mutation (or prediction) operation. We present an example which shows that this rate is optimal. In contrast to our result, available results to-date are extremely conservative, suggesting $O(\alpha^N)$ time steps are needed, for some $\alpha>1$, for the particle filter to forget its initialisation. We also study the conditional particle filter (CPF) and extend our forgetting result to this context. We establish a similar conclusion, namely, CPF is exponentially mixing and forgets its initial state in $O(\log N )$ time. To support this analysis, we establish new time-uniform $L^p$ error estimates for CPF, which can be of independent interest. We also establish new propagation of chaos type results using our proof techniques, discuss implications to couplings of particle filters and an application to processing out-of-sequence measurements.
- [81] arXiv:2402.07770 (replaced) [pdf, other]
-
Title: Had enough of experts? Quantitative knowledge retrieval from large language modelsDavid Selby, Kai Spriestersbach, Yuichiro Iwashita, Mohammad Saad, Dennis Bappert, Archana Warrier, Sumantrak Mukherjee, Koichi Kise, Sebastian VollmerSubjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Applications (stat.AP)
Large language models (LLMs) have been extensively studied for their abilities to generate convincing natural language sequences, however their utility for quantitative information retrieval is less well understood. Here we explore the feasibility of LLMs as a mechanism for quantitative knowledge retrieval to aid two data analysis tasks: elicitation of prior distributions for Bayesian models and imputation of missing data. We introduce a framework that leverages LLMs to enhance Bayesian workflows by eliciting expert-like prior knowledge and imputing missing data. Tested on diverse datasets, this approach can improve predictive accuracy and reduce data requirements, offering significant potential in healthcare, environmental science and engineering applications. We discuss the implications and challenges of treating LLMs as 'experts'.
- [82] arXiv:2405.10618 (replaced) [pdf, other]
-
Title: Distributed Event-Based Learning via ADMMComments: 35 pages, 12 figuresSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We consider a distributed learning problem, where agents minimize a global objective function by exchanging information over a network. Our approach has two distinct features: (i) It substantially reduces communication by triggering communication only when necessary, and (ii) it is agnostic to the data-distribution among the different agents. We therefore guarantee convergence even if the local data-distributions of the agents are arbitrarily distinct. We analyze the convergence rate of the algorithm both in convex and nonconvex settings and derive accelerated convergence rates for the convex case. We also characterize the effect of communication failures and demonstrate that our algorithm is robust to these. The article concludes by presenting numerical results from distributed learning tasks on the MNIST and CIFAR-10 datasets. The experiments underline communication savings of 35% or more due to the event-based communication strategy, show resilience towards heterogeneous data-distributions, and highlight that our approach outperforms common baselines such as FedAvg, FedProx, SCAFFOLD and FedADMM.
- [83] arXiv:2405.14425 (replaced) [pdf, html, other]
-
Title: When predict can also explain: few-shot prediction to select better neural latentsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Latent variable models serve as powerful tools to infer underlying dynamics from observed neural activity. Ideally, the inferred dynamics should align with true ones. However, due to the absence of ground truth data, prediction benchmarks are often employed as proxies. One widely-used method, *co-smoothing*, involves jointly estimating latent variables and predicting observations along held-out channels to assess model performance. In this study, we reveal the limitations of the co-smoothing prediction framework and propose a remedy. In a student-teacher setup with Hidden Markov Models, we demonstrate that the high co-smoothing model space encompasses models with arbitrary extraneous dynamics in their latent representations. To address this, we introduce a secondary metric -- *few-shot co-smoothing*, performing regression from the latent variables to held-out channels in the data using fewer trials. Our results indicate that among models with near-optimal co-smoothing, those with extraneous dynamics underperform in the few-shot co-smoothing compared to 'minimal' models that are devoid of such dynamics. We provide analytical insights into the origin of this phenomenon and further validate our findings on real neural data using two state-of-the-art methods: LFADS and STNDT. In the absence of ground truth, we suggest a novel measure to validate our approach. By cross-decoding the latent variables of all model pairs with high co-smoothing, we identify models with minimal extraneous dynamics. We find a correlation between few-shot co-smoothing performance and this new measure. In summary, we present a novel prediction metric designed to yield latent variables that more accurately reflect the ground truth, offering a significant improvement for latent dynamics inference.
- [84] arXiv:2405.16563 (replaced) [pdf, other]
-
Title: Higher-Order Transformer Derivative Estimates for Explicit Pathwise Learning GuaranteesComments: 11 pages (+30 appendix), 3 figures, 6 tablesSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
An inherent challenge in computing fully-explicit generalization bounds for transformers involves obtaining covering number estimates for the given transformer class $T$. Crude estimates rely on a uniform upper bound on the local-Lipschitz constants of transformers in $T$, and finer estimates require an analysis of their higher-order partial derivatives. Unfortunately, these precise higher-order derivative estimates for (realistic) transformer models are not currently available in the literature as they are combinatorially delicate due to the intricate compositional structure of transformer blocks.
This paper fills this gap by precisely estimating all the higher-order derivatives of all orders for the transformer model. We consider realistic transformers with multiple (non-linearized) attention heads per block and layer normalization. We obtain fully-explicit estimates of all constants in terms of the number of attention heads, the depth and width of each transformer block, and the number of normalization layers. Further, we explicitly analyze the impact of various standard activation function choices (e.g. SWISH and GeLU). As an application, we obtain explicit pathwise generalization bounds for transformers on a single trajectory of an exponentially-ergodic Markov process valid at a fixed future time horizon. We conclude that real-world transformers can learn from $N$ (non-i.i.d.) samples of a single Markov process's trajectory at a rate of ${O}(\operatorname{polylog}(N)/\sqrt{N})$. - [85] arXiv:2406.07475 (replaced) [pdf, html, other]
-
Title: Partially Observed Trajectory Inference using Optimal Transport and a Dynamics PriorComments: ICLR 2025Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Trajectory inference seeks to recover the temporal dynamics of a population from snapshots of its (uncoupled) temporal marginals, i.e. where observed particles are not tracked over time. Prior works addressed this challenging problem under a stochastic differential equation (SDE) model with a gradient-driven drift in the observed space, introducing a minimum entropy estimator relative to the Wiener measure and a practical grid-free mean-field Langevin (MFL) algorithm using Schrödinger bridges. Motivated by the success of observable state space models in the traditional paired trajectory inference problem (e.g. target tracking), we extend the above framework to a class of latent SDEs in the form of observable state space models. In this setting, we use partial observations to infer trajectories in the latent space under a specified dynamics model (e.g. the constant velocity/acceleration models from target tracking). We introduce the PO-MFL algorithm to solve this latent trajectory inference problem and provide theoretical guarantees to the partially observed setting. Experiments validate the robustness of our method and the exponential convergence of the MFL dynamics, and demonstrate significant outperformance over the latent-free baseline in key scenarios.
- [86] arXiv:2407.05287 (replaced) [pdf, html, other]
-
Title: Model-agnostic meta-learners for estimating heterogeneous treatment effects over timeComments: Accepted at ICLR 2025Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Estimating heterogeneous treatment effects (HTEs) over time is crucial in many disciplines such as personalized medicine. For example, electronic health records are commonly collected over several time periods and then used to personalize treatment decisions. Existing works for this task have mostly focused on model-based learners (i.e., learners that adapt specific machine-learning models). In contrast, model-agnostic learners -- so-called meta-learners -- are largely unexplored. In our paper, we propose several meta-learners that are model-agnostic and thus can be used in combination with arbitrary machine learning models (e.g., transformers) to estimate HTEs over time. Here, our focus is on learners that can be obtained via weighted pseudo-outcome regressions, which allows for efficient estimation by targeting the treatment effect directly. We then provide a comprehensive theoretical analysis that characterizes the different learners and that allows us to offer insights into when specific learners are preferable. Finally, we confirm our theoretical insights through numerical experiments. In sum, while meta-learners are already state-of-the-art for the static setting, we are the first to propose a comprehensive set of meta-learners for estimating HTEs in the time-varying setting.
- [87] arXiv:2407.11676 (replaced) [pdf, html, other]
-
Title: SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse ModalitiesYanis Lalou, Théo Gnassounou, Antoine Collas, Antoine de Mathelin, Oleksii Kachaiev, Ambroise Odonnat, Alexandre Gramfort, Thomas Moreau, Rémi FlamarySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift. While many methods have been proposed in the literature, fair and realistic evaluation remains an open question, particularly due to methodological difficulties in selecting hyperparameters in the unsupervised setting. With SKADA-bench, we propose a framework to evaluate DA methods on diverse modalities, beyond computer vision task that have been largely explored in the literature. We present a complete and fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment. Realistic hyperparameter selection is performed with nested cross-validation and various unsupervised model selection scores, on both simulated datasets with controlled shifts and real-world datasets across diverse modalities, such as images, text, biomedical, and tabular data. Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications, with key insights into the choice and impact of model selection approaches. SKADA-bench is open-source, reproducible, and can be easily extended with novel DA methods, datasets, and model selection criteria without requiring re-evaluating competitors. SKADA-bench is available on Github at this https URL.
- [88] arXiv:2410.08847 (replaced) [pdf, html, other]
-
Title: Unintentional Unalignment: Likelihood Displacement in Direct Preference OptimizationComments: Accepted to ICLR 2025; Code available at this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.
- [89] arXiv:2410.10473 (replaced) [pdf, other]
-
Title: The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean LabelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data. A recent class of neural network models gaining increasing popularity is structured state space models (SSMs), regarded as an efficient alternative to transformers. Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher. In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs. Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails. This failure occurs despite the special training examples being labeled by the teacher, i.e. having clean labels! We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks. In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. Given the proliferation of SSMs, particularly in large language models, we believe significant efforts should be invested in further delineating their susceptibility to clean-label poisoning, and in developing methods for overcoming this susceptibility.
- [90] arXiv:2410.13211 (replaced) [pdf, html, other]
-
Title: Estimating the Probabilities of Rare Outputs in Language ModelsComments: 29 pages, 9 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We consider the problem of low probability estimation: given a machine learning model and a formally-specified input distribution, how can we estimate the probability of a binary property of the model's output, even when that probability is too small to estimate by random sampling? This problem is motivated by the need to improve worst-case performance, which distribution shift can make much more likely. We study low probability estimation in the context of argmax sampling from small transformer language models. We compare two types of methods: importance sampling, which involves searching for inputs giving rise to the rare output, and activation extrapolation, which involves extrapolating a probability distribution fit to the model's logits. We find that importance sampling outperforms activation extrapolation, but both outperform naive sampling. Finally, we explain how minimizing the probability estimate of an undesirable behavior generalizes adversarial training, and argue that new methods for low probability estimation are needed to provide stronger guarantees about worst-case performance.
- [91] arXiv:2410.15483 (replaced) [pdf, html, other]
-
Title: Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference LearningHeshan Fernando, Han Shen, Parikshit Ram, Yi Zhou, Horst Samulowitz, Nathalie Baracaldo, Tianyi ChenSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Optimization and Control (math.OC); Machine Learning (stat.ML)
Post-training of pre-trained LLMs, which typically consists of the supervised fine-tuning (SFT) stage and the preference learning (RLHF or DPO) stage, is crucial to effective and safe LLM applications. The widely adopted approach in post-training popular open-source LLMs is to sequentially perform SFT and RLHF/DPO. However, sequential training is sub-optimal in terms of SFT and RLHF/DPO trade-off: the LLM gradually forgets about the first stage's training when undergoing the second stage's training. We theoretically prove the sub-optimality of sequential post-training. Furthermore, we propose a practical joint post-training framework with theoretical convergence guarantees and empirically outperforms sequential post-training framework, while having similar computational cost. Our code is available at this https URL.
- [92] arXiv:2410.22559 (replaced) [pdf, html, other]
-
Title: Unpicking Data at the Seams: Understanding Disentanglement in VAEsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Disentanglement, or identifying statistically independent factors of the data, is relevant to much of machine learning, from controlled data generation and robust classification to efficient encoding and improving our understanding of the data itself. Disentanglement arises in several generative paradigms including Variational Autoencoders (VAEs), Generative Adversarial Networks and diffusion models. Recent progress has been made in understanding disentanglement in VAEs, where a choice of diagonal posterior covariance matrices is shown to promote mutual orthogonality between columns of the decoder's Jacobian. We build on this to show how such orthogonality, a geometric property, translates to disentanglement, a statistical property, furthering our understanding of how a VAE identifies independent components of, or disentangles, the data.
- [93] arXiv:2411.01696 (replaced) [pdf, html, other]
-
Title: Conformal Risk Minimization with Variance ReductionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Conformal prediction (CP) is a distribution-free framework for achieving probabilistic guarantees on black-box models. CP is generally applied to a model post-training. Recent research efforts, on the other hand, have focused on optimizing CP efficiency during training. We formalize this concept as the problem of conformal risk minimization (CRM). In this direction, conformal training (ConfTr) by Stutz et al.(2022) is a technique that seeks to minimize the expected prediction set size of a model by simulating CP in-between training updates. Despite its potential, we identify a strong source of sample inefficiency in ConfTr that leads to overly noisy estimated gradients, introducing training instability and limiting practical use. To address this challenge, we propose variance-reduced conformal training (VR-ConfTr), a CRM method that incorporates a variance reduction technique in the gradient estimation of the ConfTr objective function. Through extensive experiments on various benchmark datasets, we demonstrate that VR-ConfTr consistently achieves faster convergence and smaller prediction sets compared to baselines.
- [94] arXiv:2411.08987 (replaced) [pdf, other]
-
Title: Non-Euclidean High-Order Smooth Convex OptimizationComments: randomized and parallel lower bounds (and gen. to all norms), convexity of subproblems, inexactness of unacc. alg., better writingSubjects: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
We develop algorithms for the optimization of convex objectives that have Hölder continuous $q$-th derivatives by using a $q$-th order oracle, for any $q \geq 1$. Our algorithms work for general norms under mild conditions, including the $\ell_p$-settings for $1\leq p\leq \infty$. We can also optimize structured functions that allow for inexactly implementing a non-Euclidean ball optimization oracle. We do this by developing a non-Euclidean inexact accelerated proximal point method that makes use of an \emph{inexact uniformly convex regularizer}. We show a lower bound for general norms that demonstrates our algorithms are nearly optimal in high-dimensions in the black-box oracle model for $\ell_p$-settings and all $q \geq 1$, even in randomized and parallel settings. This new lower bound, when applied to the first-order smooth case, resolves an open question in parallel convex optimization.
- [95] arXiv:2411.16591 (replaced) [pdf, html, other]
-
Title: Adversarial Attacks for Drift DetectionComments: Accepted at ESANN 2025Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Concept drift refers to the change of data distributions over time. While drift poses a challenge for learning models, requiring their continual adaption, it is also relevant in system monitoring to detect malfunctions, system failures, and unexpected behavior. In the latter case, the robust and reliable detection of drifts is imperative. This work studies the shortcomings of commonly used drift detection schemes. We show how to construct data streams that are drifting without being detected. We refer to those as drift adversarials. In particular, we compute all possible adversairals for common detection schemes and underpin our theoretical findings with empirical evaluations.
- [96] arXiv:2412.09814 (replaced) [pdf, html, other]
-
Title: Federated Learning of Dynamic Bayesian Network via Continuous Optimization from Time Series DataComments: 34 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO)
Traditionally, learning the structure of a Dynamic Bayesian Network has been centralized, requiring all data to be pooled in one location. However, in real-world scenarios, data are often distributed across multiple entities (e.g., companies, devices) that seek to collaboratively learn a Dynamic Bayesian Network while preserving data privacy and security. More importantly, due to the presence of diverse clients, the data may follow different distributions, resulting in data heterogeneity. This heterogeneity poses additional challenges for centralized approaches. In this study, we first introduce a federated learning approach for estimating the structure of a Dynamic Bayesian Network from homogeneous time series data that are horizontally distributed across different parties. We then extend this approach to heterogeneous time series data by incorporating a proximal operator as a regularization term in a personalized federated learning framework. To this end, we propose \texttt{FDBNL} and \texttt{PFDBNL}, which leverage continuous optimization, ensuring that only model parameters are exchanged during the optimization process. Experimental results on synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art techniques, particularly in scenarios with many clients and limited individual sample sizes.
- [97] arXiv:2501.08411 (replaced) [pdf, html, other]
-
Title: BiDepth Multimodal Neural Network: Bidirectional Depth Deep Learning Architecture for Spatial-Temporal PredictionComments: This paper has been submitted to Applied Intelligence for reviewSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
Accurate prediction of spatial-temporal (ST) information in dynamic systems, such as urban mobility and weather patterns, is a crucial yet challenging problem. The complexity stems from the intricate interplay between spatial proximity and temporal relevance, where both long-term trends and short-term fluctuations are present in convoluted patterns. Existing approaches, including traditional statistical methods and conventional neural networks, may provide inaccurate results due to the lack of an effective mechanism that simultaneously incorporates information at variable temporal depths while maintaining spatial context, resulting in a trade-off between comprehensive long-term historical analysis and responsiveness to short-term new information. To bridge this gap, this paper proposes the BiDepth Multimodal Neural Network (BDMNN) with bidirectional depth modulation that enables a comprehensive understanding of both long-term seasonality and short-term fluctuations, adapting to the complex ST context. Case studies with real-world public data demonstrate significant improvements in prediction accuracy, with a 12% reduction in Mean Squared Error for urban traffic prediction and a 15% improvement in rain precipitation forecasting compared to state-of-the-art benchmarks, without demanding extra computational resources.
- [98] arXiv:2501.11622 (replaced) [pdf, html, other]
-
Title: Causal Learning for Heterogeneous Subgroups Based on Nonlinear Causal Kernel ClusteringSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Due to the challenge posed by multi-source and heterogeneous data collected from diverse environments, causal relationships among features can exhibit variations influenced by different time spans, regions, or strategies. This diversity makes a single causal model inadequate for accurately representing complex causal relationships in all observational data, a crucial consideration in causal learning. To address this challenge, the nonlinear Causal Kernel Clustering method is introduced for heterogeneous subgroup causal learning, highlighting variations in causal relationships across diverse subgroups. \textcolor{new}{The main component for clustering heterogeneous subgroups lies in the construction of the $u$-centered sample mapping function with the property of unbiased estimation, which assesses the differences in potential nonlinear causal relationships in various samples and supported by causal identifiability theory.} Experimental results indicate that the method performs well in identifying heterogeneous subgroups and enhancing causal learning, leading to a reduction in prediction error.
- [99] arXiv:2501.16521 (replaced) [pdf, html, other]
-
Title: On characterizing optimal learning trajectories in a class of learning problemsComments: 5 Pages (A further extension of the paper: arXiv:2412.08772)Subjects: Optimization and Control (math.OC); Machine Learning (stat.ML)
In this brief paper, we provide a mathematical framework that exploits the relationship between the maximum principle and dynamic programming for characterizing optimal learning trajectories in a class of learning problem, which is related to point estimations for modeling of high-dimensional nonlinear functions. Here, such characterization for the optimal learning trajectories is associated with the solution of an optimal control problem for a weakly-controlled gradient system with small parameters, whose time-evolution is guided by a model training dataset and its perturbed version, while the optimization problem consists of a cost functional that summarizes how to gauge the quality/performance of the estimated model parameters at a certain fixed final time w.r.t. a model validating dataset. Moreover, using a successive Galerkin approximation method, we provide an algorithmic recipe how to construct the corresponding optimal learning trajectories leading to the optimal estimated model parameters for such a class of learning problem.
- [100] arXiv:2502.00463 (replaced) [pdf, html, other]
-
Title: Efficient Over-parameterized Matrix Sensing from Noisy Measurements via Alternating Preconditioned Gradient DescentComments: 18 pages, 8 figuresSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We consider the noisy matrix sensing problem in the over-parameterization setting, where the estimated rank $r$ is larger than the true rank $r_\star$. Specifically, our main objective is to recover a matrix $ X_\star \in \mathbb{R}^{n_1 \times n_2} $ with rank $ r_\star $ from noisy measurements using an over-parameterized factorized form $ LR^\top $, where $ L \in \mathbb{R}^{n_1 \times r}, \, R \in \mathbb{R}^{n_2 \times r} $ and $ \min\{n_1, n_2\} \ge r > r_\star $, with the true rank $ r_\star $ being unknown. Recently, preconditioning methods have been proposed to accelerate the convergence of matrix sensing problem compared to vanilla gradient descent, incorporating preconditioning terms $ (L^\top L + \lambda I)^{-1} $ and $ (R^\top R + \lambda I)^{-1} $ into the original gradient. However, these methods require careful tuning of the damping parameter $\lambda$ and are sensitive to initial points and step size. To address these limitations, we propose the alternating preconditioned gradient descent (APGD) algorithm, which alternately updates the two factor matrices, eliminating the need for the damping parameter and enabling faster convergence with larger step sizes. We theoretically prove that APGD achieves near-optimal error convergence at a linear rate, starting from arbitrary random initializations. Through extensive experiments, we validate our theoretical results and demonstrate that APGD outperforms other methods, achieving the fastest convergence rate. Notably, both our theoretical analysis and experimental results illustrate that APGD does not rely on the initialization procedure, making it more practical and versatile.
- [101] arXiv:2502.03048 (replaced) [pdf, html, other]
-
Title: The Ensemble Kalman Update is an Empirical Matheron UpdateSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The Ensemble Kalman Filter (EnKF) is a widely used method for data assimilation in high-dimensional systems. In this paper, we show that the ensemble update step of the EnKF is equivalent to an empirical version of the Matheron update popular in the study of Gaussian process regression. While this connection is simple, it seems not to be widely known, the literature about each technique seems distinct, and connections between the methods are not exploited. This paper exists to provide an informal introduction to the connection, with the necessary definitions so that it is intelligible to as broad an audience as possible.