Machine Learning

New submissions
Cross-lists
Replacements

See recent articles

Total of 29 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2408.11977 [pdf, html, other]: Title: An Asymptotically Optimal Coordinate Descent Algorithm for Learning Bayesian Networks from Gaussian Models

Tong Xu, Armeen Taeb, Simge Küçükyavuz, Ali Shojaie

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

This paper studies the problem of learning Bayesian networks from continuous observational data, generated according to a linear Gaussian structural equation model. We consider an $\ell_0$-penalized maximum likelihood estimator for this problem which is known to have favorable statistical properties but is computationally challenging to solve, especially for medium-sized Bayesian networks. We propose a new coordinate descent algorithm to approximate this estimator and prove several remarkable properties of our procedure: the algorithm converges to a coordinate-wise minimum, and despite the non-convexity of the loss function, as the sample size tends to infinity, the objective value of the coordinate descent solution converges to the optimal objective value of the $\ell_0$-penalized maximum likelihood estimator. Finite-sample optimality and statistical consistency guarantees are also established. To the best of our knowledge, our proposal is the first coordinate descent procedure endowed with optimality and statistical guarantees in the context of learning Bayesian networks. Numerical experiments on synthetic and real data demonstrate that our coordinate descent method can obtain near-optimal solutions while being scalable.
[2] arXiv:2408.12063 [pdf, html, other]: Title: A Deconfounding Approach to Climate Model Bias Correction

Wentao Gao, Jiuyong Li, Debo Cheng, Lin Liu, Jixue Liu, Thuc Duy Le, Xiaojing Du, Xiongren Chen, Yanchang Zhao, Yun Chen

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)

Global Climate Models (GCMs) are crucial for predicting future climate changes by simulating the Earth systems. However, GCM outputs exhibit systematic biases due to model uncertainties, parameterization simplifications, and inadequate representation of complex climate phenomena. Traditional bias correction methods, which rely on historical observation data and statistical techniques, often neglect unobserved confounders, leading to biased results. This paper proposes a novel bias correction approach to utilize both GCM and observational data to learn a factor model that captures multi-cause latent confounders. Inspired by recent advances in causality based time series deconfounding, our method first constructs a factor model to learn latent confounders from historical data and then applies them to enhance the bias correction process using advanced time series forecasting models. The experimental results demonstrate significant improvements in the accuracy of precipitation outputs. By addressing unobserved confounders, our approach offers a robust and theoretically grounded solution for climate model bias correction.
[3] arXiv:2408.12186 [pdf, html, other]: Title: Transformers are Minimax Optimal Nonparametric In-Context Learners

Juno Kim, Tai Nakamaki, Taiji Suzuki

Comments: 40 pages, 3 figures, ICML 2024 Workshop on Theoretical Foundations of Foundation Models

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

In-context learning (ICL) of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples. In this paper, we study the efficacy of ICL from the viewpoint of statistical learning theory. We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer, pretrained on nonparametric regression tasks sampled from general function spaces including the Besov space and piecewise $\gamma$-smooth class. We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context by encoding the most relevant basis representations during pretraining. Our analysis extends to high-dimensional or sequential data and distinguishes the \emph{pretraining} and \emph{in-context} generalization gaps. Furthermore, we establish information-theoretic lower bounds for meta-learners w.r.t. both the number of tasks and in-context examples. These findings shed light on the roles of task diversity and representation learning for ICL.
[4] arXiv:2408.12288 [pdf, html, other]: Title: Demystifying Functional Random Forests: Novel Explainability Tools for Model Transparency in High-Dimensional Spaces

Fabrizio Maturo, Annamaria Porreca

Comments: 33 pages

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)

The advent of big data has raised significant challenges in analysing high-dimensional datasets across various domains such as medicine, ecology, and economics. Functional Data Analysis (FDA) has proven to be a robust framework for addressing these challenges, enabling the transformation of high-dimensional data into functional forms that capture intricate temporal and spatial patterns. However, despite advancements in functional classification methods and very high performance demonstrated by combining FDA and ensemble methods, a critical gap persists in the literature concerning the transparency and interpretability of black-box models, e.g. Functional Random Forests (FRF). In response to this need, this paper introduces a novel suite of explainability tools to illuminate the inner mechanisms of FRF. We propose using Functional Partial Dependence Plots (FPDPs), Functional Principal Component (FPC) Probability Heatmaps, various model-specific and model-agnostic FPCs' importance metrics, and the FPC Internal-External Importance and Explained Variance Bubble Plot. These tools collectively enhance the transparency of FRF models by providing a detailed analysis of how individual FPCs contribute to model predictions. By applying these methods to an ECG dataset, we demonstrate the effectiveness of these tools in revealing critical patterns and improving the explainability of FRF.
[5] arXiv:2408.12319 [pdf, html, other]: Title: Neural-ANOVA: Model Decomposition for Interpretable Machine Learning

Steffen Limmer, Steffen Udluft, Clemens Otte

Comments: 8 pages, 4 figures, 5 tables

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

The analysis of variance (ANOVA) decomposition offers a systematic method to understand the interaction effects that contribute to a specific decision output. In this paper we introduce Neural-ANOVA, an approach to decompose neural networks into glassbox models using the ANOVA decomposition. Our approach formulates a learning problem, which enables rapid and closed-form evaluation of integrals over subspaces that appear in the calculation of the ANOVA decomposition. Finally, we conduct numerical experiments to illustrate the advantages of enhanced interpretability and model validation by a decomposition of the learned interaction effects.
[6] arXiv:2408.12353 [pdf, html, other]: Title: Distributed quasi-Newton robust estimation under differential privacy

Chuhan Wang, Lixing Zhu, Xuehu Zhu

Comments: 38 pages, 6 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

For distributed computing with Byzantine machines under Privacy Protection (PP) constraints, this paper develops a robust PP distributed quasi-Newton estimation, which only requires the node machines to transmit five vectors to the central processor with high asymptotic relative efficiency. Compared with the gradient descent strategy which requires more rounds of transmission and the Newton iteration strategy which requires the entire Hessian matrix to be transmitted, the novel quasi-Newton iteration has advantages in reducing privacy budgeting and transmission cost. Moreover, our PP algorithm does not depend on the boundedness of gradients and second-order derivatives. When gradients and second-order derivatives follow sub-exponential distributions, we offer a mechanism that can ensure PP with a sufficiently high probability. Furthermore, this novel estimator can achieve the optimal convergence rate and the asymptotic normality. The numerical studies on synthetic and real data sets evaluate the performance of the proposed algorithm.

[7] arXiv:2408.11979 (cross-list from cs.LG) [pdf, html, other]: Title: Only Strict Saddles in the Energy Landscape of Predictive Coding Networks?

Francesco Innocenti, El Mehdi Achour, Ryan Singh, Christopher L. Buckley

Comments: 26 pages, 12 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

Predictive coding (PC) is an energy-based learning algorithm that performs iterative inference over network activities before weight updates. Recent work suggests that PC can converge in fewer learning steps than backpropagation thanks to its inference procedure. However, these advantages are not always observed, and the impact of PC inference on learning is theoretically not well understood. Here, we study the geometry of the PC energy landscape at the (inference) equilibrium of the network activities. For deep linear networks, we first show that the equilibrated energy is simply a rescaled mean squared error loss with a weight-dependent rescaling. We then prove that many highly degenerate (non-strict) saddles of the loss including the origin become much easier to escape (strict) in the equilibrated energy. Our theory is validated by experiments on both linear and non-linear networks. Based on these results, we conjecture that all the saddles of the equilibrated energy are strict. Overall, this work suggests that PC inference makes the loss landscape more benign and robust to vanishing gradients, while also highlighting the challenge of speeding up PC inference on large-scale models.
[8] arXiv:2408.12004 (cross-list from cs.LG) [pdf, html, other]: Title: CSPI-MT: Calibrated Safe Policy Improvement with Multiple Testing for Threshold Policies

Brian M Cho, Ana-Roxana Pop, Kyra Gan, Sam Corbett-Davies, Israel Nir, Ariel Evnine, Nathan Kallus

Subjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)

When modifying existing policies in high-risk settings, it is often necessary to ensure with high certainty that the newly proposed policy improves upon a baseline, such as the status quo. In this work, we consider the problem of safe policy improvement, where one only adopts a new policy if it is deemed to be better than the specified baseline with at least pre-specified probability. We focus on threshold policies, a ubiquitous class of policies with applications in economics, healthcare, and digital advertising. Existing methods rely on potentially underpowered safety checks and limit the opportunities for finding safe improvements, so too often they must revert to the baseline to maintain safety. We overcome these issues by leveraging the most powerful safety test in the asymptotic regime and allowing for multiple candidates to be tested for improvement over the baseline. We show that in adversarial settings, our approach controls the rate of adopting a policy worse than the baseline to the pre-specified error level, even in moderate sample sizes. We present CSPI and CSPI-MT, two novel heuristics for selecting cutoff(s) to maximize the policy improvement from baseline. We demonstrate through both synthetic and external datasets that our approaches improve both the detection rates of safe policies and the realized improvement, particularly under stringent safety requirements and low signal-to-noise conditions.
[9] arXiv:2408.12007 (cross-list from cs.LG) [pdf, html, other]: Title: QuaCK-TSF: Quantum-Classical Kernelized Time Series Forecasting

Abdallah Aaraba, Soumaya Cherkaoui, Ola Ahmad, Jean-Frédéric Laprade, Olivier Nahman-Lévesque, Alexis Vieloszynski, Shengrui Wang

Comments: 12 pages, 15 figures, to be published in IEEE Quantum Week 2024's conference proceeding

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Forecasting in probabilistic time series is a complex endeavor that extends beyond predicting future values to also quantifying the uncertainty inherent in these predictions. Gaussian process regression stands out as a Bayesian machine learning technique adept at addressing this multifaceted challenge. This paper introduces a novel approach that blends the robustness of this Bayesian technique with the nuanced insights provided by the kernel perspective on quantum models, aimed at advancing quantum kernelized probabilistic forecasting. We incorporate a quantum feature map inspired by Ising interactions and demonstrate its effectiveness in capturing the temporal dependencies critical for precise forecasting. The optimization of our model's hyperparameters circumvents the need for computationally intensive gradient descent by employing gradient-free Bayesian optimization. Comparative benchmarks against established classical kernel models are provided, affirming that our quantum-enhanced approach achieves competitive performance.
[10] arXiv:2408.12136 (cross-list from cs.LG) [pdf, html, other]: Title: Domain Adaptation for Offline Reinforcement Learning with Limited Samples

Weiqin Chen, Sandipan Mishra, Santiago Paternain

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Offline reinforcement learning (RL) learns effective policies from a static target dataset. Despite state-of-the-art (SOTA) offline RL algorithms being promising, they highly rely on the quality of the target dataset. The performance of SOTA algorithms can degrade in scenarios with limited samples in the target dataset, which is often the case in real-world applications. To address this issue, domain adaptation that leverages auxiliary samples from related source datasets (such as simulators) can be beneficial. In this context, determining the optimal way to trade off the source and target datasets remains a critical challenge in offline RL. To the best of our knowledge, this paper proposes the first framework that theoretically and experimentally explores how the weight assigned to each dataset affects the performance of offline RL. We establish the performance bounds and convergence neighborhood of our framework, both of which depend on the selection of the weight. Furthermore, we identify the existence of an optimal weight for balancing the two datasets. All theoretical guarantees and optimal weight depend on the quality of the source dataset and the size of the target dataset. Our empirical results on the well-known Procgen Benchmark substantiate our theoretical contributions.
[11] arXiv:2408.12175 (cross-list from cs.LG) [pdf, html, other]: Title: How disentangled are your classification uncertainties?

Ivo Pascal de Jong, Andreea Ioana Sburlea, Matias Valdenegro-Toro

Comments: 11 pages, 11 figures

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Uncertainty Quantification in Machine Learning has progressed to predicting the source of uncertainty in a prediction: Uncertainty from stochasticity in the data (aleatoric), or uncertainty from limitations of the model (epistemic). Generally, each uncertainty is evaluated in isolation, but this obscures the fact that they are often not truly disentangled. This work proposes a set of experiments to evaluate disentanglement of aleatoric and epistemic uncertainty, and uses these methods to compare two competing formulations for disentanglement (the Information Theoretic approach, and the Gaussian Logits approach). The results suggest that the Information Theoretic approach gives better disentanglement, but that either predicted source of uncertainty is still largely contaminated by the other for both methods. We conclude that with the current methods for disentangling, aleatoric and epistemic uncertainty are not reliably separated, and we provide a clear set of experimental criteria that good uncertainty disentanglement should follow.
[12] arXiv:2408.12209 (cross-list from math.OC) [pdf, html, other]: Title: Zeroth-Order Stochastic Mirror Descent Algorithms for Minimax Excess Risk Optimization

Zhihao Gu, Zi Xu

Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)

The minimax excess risk optimization (MERO) problem is a new variation of the traditional distributionally robust optimization (DRO) problem, which achieves uniformly low regret across all test distributions under suitable conditions. In this paper, we propose a zeroth-order stochastic mirror descent (ZO-SMD) algorithm available for both smooth and non-smooth MERO to estimate the minimal risk of each distrbution, and finally solve MERO as (non-)smooth stochastic convex-concave (linear) minimax optimization problems. The proposed algorithm is proved to converge at optimal convergence rates of $\mathcal{O}\left(1/\sqrt{t}\right)$ on the estimate of $R_i^*$ and $\mathcal{O}\left(1/\sqrt{t}\right)$ on the optimization error of both smooth and non-smooth MERO. Numerical results show the efficiency of the proposed algorithm.
[13] arXiv:2408.12332 (cross-list from stat.AP) [pdf, html, other]: Title: Simplifying Random Forests' Probabilistic Forecasts

Nils Koster, Fabian Krüger

Subjects: Applications (stat.AP); Machine Learning (stat.ML)

Since their introduction by Breiman, Random Forests (RFs) have proven to be useful for both classification and regression tasks. The RF prediction of a previously unseen observation can be represented as a weighted sum of all training sample observations. This nearest-neighbor-type representation is useful, among other things, for constructing forecast distributions (Meinshausen, 2006). In this paper, we consider simplifying RF-based forecast distributions by sparsifying them. That is, we focus on a small subset of nearest neighbors while setting the remaining weights to zero. This sparsification step greatly improves the interpretability of RF predictions. It can be applied to any forecasting task without re-training existing RF models. In empirical experiments, we document that the simplified predictions can be similar to or exceed the original ones in terms of forecasting performance. We explore the statistical sources of this finding via a stylized analytical model of RFs. The model suggests that simplification is particularly promising if the unknown true forecast distribution contains many small weights that are estimated imprecisely.
[14] arXiv:2408.12564 (cross-list from math.ST) [pdf, html, other]: Title: Factor Adjusted Spectral Clustering for Mixture Models

Shange Tang, Soham Jana, Jianqing Fan

Comments: 37 pages, 8 figures, 1 table

Subjects: Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)

This paper studies a factor modeling-based approach for clustering high-dimensional data generated from a mixture of strongly correlated variables. Statistical modeling with correlated structures pervades modern applications in economics, finance, genomics, wireless sensing, etc., with factor modeling being one of the popular techniques for explaining the common dependence. Standard techniques for clustering high-dimensional data, e.g., naive spectral clustering, often fail to yield insightful results as their performances heavily depend on the mixture components having a weakly correlated structure. To address the clustering problem in the presence of a latent factor model, we propose the Factor Adjusted Spectral Clustering (FASC) algorithm, which uses an additional data denoising step via eliminating the factor component to cope with the data dependency. We prove this method achieves an exponentially low mislabeling rate, with respect to the signal to noise ratio under a general set of assumptions. Our assumption bridges many classical factor models in the literature, such as the pervasive factor model, the weak factor model, and the sparse factor model. The FASC algorithm is also computationally efficient, requiring only near-linear sample complexity with respect to the data dimension. We also show the applicability of the FASC algorithm with real data experiments and numerical studies, and establish that FASC provides significant results in many cases where traditional spectral clustering fails.

[15] arXiv:2206.06885 (replaced) [pdf, html, other]: Title: Neural interval-censored survival regression with feature selection

Carlos García Meixide, Marcos Matabuena, Louis Abraham, Michael R. Kosorok

Journal-ref: Statistical Analysis and Data Mining: The ASA Data Science Journal 17.4 (2024):

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high-dimensional datasets, such as omics and medical image data. However, the literature on non-linear regression algorithms and variable selection techniques for interval-censoring is either limited or non-existent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval-censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: i) a variable selection phase leveraging recent advances on sparse neural network architectures, ii) a regression model targeting prediction of the interval-censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real-world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring non-linear relationships.
[16] arXiv:2302.09193 (replaced) [pdf, html, other]: Title: Copula-based transferable models for synthetic population generation

Pascal Jutras-Dubé, Mohammad B. Al-Khasawneh, Zhichao Yang, Javier Bas, Fabian Bastin, Cinzia Cirillo

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Population synthesis involves generating synthetic yet realistic representations of a target population of micro-agents for behavioral modeling and simulation. Traditional methods, often reliant on target population samples, such as census data or travel surveys, face limitations due to high costs and small sample sizes, particularly at smaller geographical scales. We propose a novel framework based on copulas to generate synthetic data for target populations where only empirical marginal distributions are known. This method utilizes samples from different populations with similar marginal dependencies, introduces a spatial component into population synthesis, and considers various information sources for more realistic generators. Concretely, the process involves normalizing the data and treating it as realizations of a given copula, and then training a generative model before incorporating the information on the marginals of the target population. Utilizing American Community Survey data, we assess our framework's performance through standardized root mean squared error (SRMSE) and so-called sampled zeros. We focus on its capacity to transfer a model learned from one population to another. Our experiments include transfer tests between regions at the same geographical level as well as to lower geographical levels, hence evaluating the framework's adaptability in varied spatial contexts. We compare Bayesian Networks, Variational Autoencoders, and Generative Adversarial Networks, both individually and combined with our copula framework. Results show that the copula enhances machine learning methods in matching the marginals of the reference data. Furthermore, it consistently surpasses Iterative Proportional Fitting in terms of SRMSE in the transferability experiments, while introducing unique observations not found in the original training sample.
[17] arXiv:2311.07511 (replaced) [pdf, other]: Title: Uncertainty estimation of machine learning spatial precipitation predictions from satellite data

Georgia Papacharalampous, Hristos Tyralis, Nikolaos Doulamis, Anastasios Doulamis

Journal-ref: Machine Learning: Science and Technology 5 (2024) 035044

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Applications (stat.AP); Methodology (stat.ME)

Merging satellite and gauge data with machine learning produces high-resolution precipitation datasets, but uncertainty estimates are often missing. We addressed the gap of how to optimally provide such estimates by benchmarking six algorithms, mostly novel even for the more general task of quantifying predictive uncertainty in spatial prediction settings. On 15 years of monthly data from over the contiguous United States (CONUS), we compared quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machine (LightGBM), and quantile regression neural networks (QRNN). Their ability to issue predictive precipitation quantiles at nine quantile levels (0.025, 0.050, 0.100, 0.250, 0.500, 0.750, 0.900, 0.950, 0.975), approximating the full probability distribution, was evaluated using quantile scoring functions and the quantile scoring rule. Predictors at a site were nearby values from two satellite precipitation retrievals, namely PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and IMERG (Integrated Multi-satellitE Retrievals), and the site's elevation. The dependent variable was the monthly mean gauge precipitation. With respect to QR, LightGBM showed improved performance in terms of the quantile scoring rule by 11.10%, also surpassing QRF (7.96%), GRF (7.44%), GBM (4.64%) and QRNN (1.73%). Notably, LightGBM outperformed all random forest variants, the current standard in spatial prediction with machine learning. To conclude, we propose a suite of machine learning algorithms for estimating uncertainty in spatial data prediction, supported with a formal evaluation framework based on scoring functions and scoring rules.
[18] arXiv:2403.18540 (replaced) [pdf, html, other]: Title: skscope: Fast Sparsity-Constrained Optimization in Python

Zezhi Wang, Jin Zhu, Peng Chen, Huiyang Peng, Xiaoke Zhang, Anran Wang, Junxian Zhu, Xueqin Wang

Comments: 4 pages;add experiment

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)

Applying iterative solvers on sparsity-constrained optimization (SCO) requires tedious mathematical deduction and careful programming/debugging that hinders these solvers' broad impact. In the paper, the library skscope is introduced to overcome such an obstacle. With skscope, users can solve the SCO by just programming the objective function. The convenience of skscope is demonstrated through two examples in the paper, where sparse linear regression and trend filtering are addressed with just four lines of code. More importantly, skscope's efficient implementation allows state-of-the-art solvers to quickly attain the sparse solution regardless of the high dimensionality of parameter space. Numerical experiments reveal the available solvers in skscope can achieve up to 80x speedup on the competing relaxation solutions obtained via the benchmarked convex solver. skscope is published on the Python Package Index (PyPI) and Conda, and its source code is available at: this https URL.
[19] arXiv:2407.01079 (replaced) [pdf, html, other]: Title: On Statistical Rates and Provably Efficient Criteria of Latent Diffusion Transformers (DiTs)

Jerry Yao-Chieh Hu, Weimin Wu, Zhao Song, Han Liu

Comments: v2 fixed typos, added Fig. 1 and added clarifications

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

We investigate the statistical and computational limits of latent \textbf{Di}ffusion \textbf{T}ransformers (\textbf{DiT}s) under the low-dimensional linear latent space assumption. Statistically, we study the universal approximation and sample complexity of the DiTs score function, as well as the distribution recovery property of the initial data. Specifically, under mild data assumptions, we derive an approximation error bound for the score network of latent DiTs, which is sub-linear in the latent space dimension. Additionally, we derive the corresponding sample complexity bound and show that the data distribution generated from the estimated score function converges toward a proximate area of the original one. Computationally, we characterize the hardness of both forward inference and backward computation of latent DiTs, assuming the Strong Exponential Time Hypothesis (SETH). For forward inference, we identify efficient criteria for all possible latent DiTs inference algorithms and showcase our theory by pushing the efficiency toward almost-linear time inference. For backward computation, we leverage the low-rank structure within the gradient computation of DiTs training for possible algorithmic speedup. Specifically, we show that such speedup achieves almost-linear time latent DiTs training by casting the DiTs gradient as a series of chained low-rank approximations with bounded error. Under the low-dimensional assumption, we show that the convergence rate and the computational efficiency are both dominated by the dimension of the subspace, suggesting that latent DiTs have the potential to bypass the challenges associated with the high dimensionality of initial data.
[20] arXiv:1603.09326 (replaced) [pdf, other]: Title: Estimating Treatment Effects using Multiple Surrogates: The Role of the Surrogate Score and the Surrogate Index

Susan Athey, Raj Chetty, Guido Imbens, Hyunseung Kang

Subjects: Methodology (stat.ME); Econometrics (econ.EM); Machine Learning (stat.ML)

Estimating the long-term effects of treatments is of interest in many fields. A common challenge in estimating such treatment effects is that long-term outcomes are unobserved in the time frame needed to make policy decisions. One approach to overcome this missing data problem is to analyze treatments effects on an intermediate outcome, often called a statistical surrogate, if it satisfies the condition that treatment and outcome are independent conditional on the statistical surrogate. The validity of the surrogacy condition is often controversial. Here we exploit that fact that in modern datasets, researchers often observe a large number, possibly hundreds or thousands, of intermediate outcomes, thought to lie on or close to the causal chain between the treatment and the long-term outcome of interest. Even if none of the individual proxies satisfies the statistical surrogacy criterion by itself, using multiple proxies can be useful in causal inference. We focus primarily on a setting with two samples, an experimental sample containing data about the treatment indicator and the surrogates and an observational sample containing information about the surrogates and the primary outcome. We state assumptions under which the average treatment effect be identified and estimated with a high-dimensional vector of proxies that collectively satisfy the surrogacy assumption, and derive the bias from violations of the surrogacy assumption, and show that even if the primary outcome is also observed in the experimental sample, there is still information to be gained from using surrogates.
[21] arXiv:1912.01094 (replaced) [pdf, html, other]: Title: Recovering from Biased Data: Can Fairness Constraints Improve Accuracy?

Avrim Blum, Kevin Stangl

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Multiple fairness constraints have been proposed in the literature, motivated by a range of concerns about how demographic groups might be treated unfairly by machine learning classifiers. In this work we consider a different motivation; learning from biased training data. We posit several ways in which training data may be biased, including having a more noisy or negatively biased labeling process on members of a disadvantaged group, or a decreased prevalence of positive or negative examples from the disadvantaged group, or both.
Given such biased training data, Empirical Risk Minimization (ERM) may produce a classifier that not only is biased but also has suboptimal accuracy on the true data distribution. We examine the ability of fairness-constrained ERM to correct this problem. In particular, we find that the Equal Opportunity fairness constraint (Hardt, Price, and Srebro 2016) combined with ERM will provably recover the Bayes Optimal Classifier under a range of bias models. We also consider other recovery methods including reweighting the training data, Equalized Odds, and Demographic Parity. These theoretical results provide additional motivation for considering fairness interventions even if an actor cares primarily about accuracy.
[22] arXiv:2204.06544 (replaced) [pdf, other]: Title: Features of the Earth's seasonal hydroclimate: Characterizations and comparisons across the Koppen-Geiger climates and across continents

Georgia Papacharalampous, Hristos Tyralis, Yannis Markonis, Petr Maca, Martin Hanel

Journal-ref: Progress in Earth and Planetary Science 10 (2023) 46

Subjects: Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)

Detailed investigations of time series features across climates, continents and variable types can progress our understanding and modelling ability of the Earth's hydroclimate and its dynamics. They can also improve our comprehension of the climate classification systems appearing in their core. Still, such investigations for seasonal hydroclimatic temporal dependence, variability and change are currently missing from the literature. Herein, we propose and apply at the global scale a methodological framework for filling this specific gap. We analyse over 13 000 earth-observed quarterly temperature, precipitation and river flow time series. We adopt the Koppen-Geiger climate classification system and define continental-scale geographical regions for conducting upon them seasonal hydroclimatic feature summaries. The analyses rely on three sample autocorrelation features, a temporal variation feature, a spectral entropy feature, a Hurst feature, a trend strength feature and a seasonality strength feature. We find notable differences to characterize the magnitudes of these features across the various Koppen-Geiger climate classes, as well as between continental-scale geographical regions. We, therefore, deem that the consideration of the comparative summaries could be beneficial in water resources engineering contexts. Lastly, we apply explainable machine learning to compare the investigated features with respect to how informative they are in distinguishing either the main Koppen-Geiger climates or the continental-scale regions. In this regard, the sample autocorrelation, temporal variation and seasonality strength features are found to be more informative than the spectral entropy, Hurst and trend strength features at the seasonal time scale.
[23] arXiv:2209.07111 (replaced) [pdf, other]: Title: $\rho$-GNF: A Copula-based Sensitivity Analysis to Unobserved Confounding Using Normalizing Flows

Sourabh Balgi, Jose M. Peña, Adel Daoud

Comments: 12 main pages (+8 reference pages), 4 Figures, Accepted at Probabilistic Graphical Models (PGM) 2024. Oral Presentation

Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Machine Learning (stat.ML)

We propose a novel sensitivity analysis to unobserved confounding in observational studies using copulas and normalizing flows. Using the idea of interventional equivalence of structural causal models, we develop $\rho$-GNF ($\rho$-graphical normalizing flow), where $\rho{\in}[-1,+1]$ is a bounded sensitivity parameter. This parameter represents the back-door non-causal association due to unobserved confounding, and which is encoded with a Gaussian copula. In other words, the $\rho$-GNF enables scholars to estimate the average causal effect (ACE) as a function of $\rho$, while accounting for various assumed strengths of the unobserved confounding. The output of the $\rho$-GNF is what we denote as the $\rho_{curve}$ that provides the bounds for the ACE given an interval of assumed $\rho$ values. In particular, the $\rho_{curve}$ enables scholars to identify the confounding strength required to nullify the ACE, similar to other sensitivity analysis methods (e.g., the E-value). Leveraging on experiments from simulated and real-world data, we show the benefits of $\rho$-GNF. One benefit is that the $\rho$-GNF uses a Gaussian copula to encode the distribution of the unobserved causes, which is commonly used in many applied settings. This distributional assumption produces narrower ACE bounds compared to other popular sensitivity analysis methods.
[24] arXiv:2307.01497 (replaced) [pdf, html, other]: Title: Accelerated stochastic approximation with state-dependent noise

Sasila Ilandarideva, Anatoli Juditsky, Guanghui Lan, Tianjiao Li

Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)

We consider a class of stochastic smooth convex optimization problems under rather general assumptions on the noise in the stochastic gradient observation. As opposed to the classical problem setting in which the variance of noise is assumed to be uniformly bounded, herein we assume that the variance of stochastic gradients is related to the "sub-optimality" of the approximate solutions delivered by the algorithm. Such problems naturally arise in a variety of applications, in particular, in the well-known generalized linear regression problem in statistics. However, to the best of our knowledge, none of the existing stochastic approximation algorithms for solving this class of problems attain optimality in terms of the dependence on accuracy, problem parameters, and mini-batch size.
We discuss two non-Euclidean accelerated stochastic approximation routines--stochastic accelerated gradient descent (SAGD) and stochastic gradient extrapolation (SGE)--which carry a particular duality relationship. We show that both SAGD and SGE, under appropriate conditions, achieve the optimal convergence rate, attaining the optimal iteration and sample complexities simultaneously. However, corresponding assumptions for the SGE algorithm are more general; they allow, for instance, for efficient application of the SGE to statistical estimation problems under heavy tail noises and discontinuous score functions. We also discuss the application of the SGE to problems satisfying quadratic growth conditions, and show how it can be used to recover sparse solutions. Finally, we report on some simulation experiments to illustrate numerical performance of our proposed algorithms in high-dimensional settings.
[25] arXiv:2307.05352 (replaced) [pdf, other]: Title: Leveraging Variational Autoencoders for Parameterized MMSE Estimation

Michael Baur, Benedikt Fesl, Wolfgang Utschick

Comments: Accepted for publication in the IEEE Transactions on Signal Processing

Subjects: Signal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (stat.ML)

In this manuscript, we propose to use a variational autoencoder-based framework for parameterizing a conditional linear minimum mean squared error estimator. The variational autoencoder models the underlying unknown data distribution as conditionally Gaussian, yielding the conditional first and second moments of the estimand, given a noisy observation. The derived estimator is shown to approximate the minimum mean squared error estimator by utilizing the variational autoencoder as a generative prior for the estimation problem. We propose three estimator variants that differ in their access to ground-truth data during the training and estimation phases. The proposed estimator variant trained solely on noisy observations is particularly noteworthy as it does not require access to ground-truth data during training or estimation. We conduct a rigorous analysis by bounding the difference between the proposed and the minimum mean squared error estimator, connecting the training objective and the resulting estimation performance. Furthermore, the resulting bound reveals that the proposed estimator entails a bias-variance tradeoff, which is well-known in the estimation literature. As an example application, we portray channel estimation, allowing for a structured covariance matrix parameterization and low-complexity implementation. Nevertheless, the proposed framework is not limited to channel estimation but can be applied to a broad class of estimation problems. Extensive numerical simulations first validate the theoretical analysis of the proposed variational autoencoder-based estimators and then demonstrate excellent estimation performance compared to related classical and machine learning-based state-of-the-art estimators.
[26] arXiv:2312.03386 (replaced) [pdf, other]: Title: An Infinite-Width Analysis on the Jacobian-Regularised Training of a Neural Network

Taeyoung Kim, Hongseok Yang

Comments: Accepted at ICML 2024. 74 pages, 18 figures

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

The recent theoretical analysis of deep neural networks in their infinite-width limits has deepened our understanding of initialisation, feature learning, and training of those networks, and brought new practical techniques for finding appropriate hyperparameters, learning network weights, and performing inference. In this paper, we broaden this line of research by showing that this infinite-width analysis can be extended to the Jacobian of a deep neural network. We show that a multilayer perceptron (MLP) and its Jacobian at initialisation jointly converge to a Gaussian process (GP) as the widths of the MLP's hidden layers go to infinity and characterise this GP. We also prove that in the infinite-width limit, the evolution of the MLP under the so-called robust training (i.e., training with a regulariser on the Jacobian) is described by a linear first-order ordinary differential equation that is determined by a variant of the Neural Tangent Kernel. We experimentally show the relevance of our theoretical claims to wide finite networks, and empirically analyse the properties of kernel regression solution to obtain an insight into Jacobian regularisation.
[27] arXiv:2404.03764 (replaced) [pdf, html, other]: Title: Covariate-Elaborated Robust Partial Information Transfer with Conditional Spike-and-Slab Prior

Ruqian Zhang, Yijiao Zhang, Annie Qu, Zhongyi Zhu, Juan Shen

Comments: 35 pages, 4 figures

Subjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)

The popularity of transfer learning stems from the fact that it can borrow information from useful auxiliary datasets. Existing statistical transfer learning methods usually adopt a global similarity measure between the source data and the target data, which may lead to inefficiency when only partial information is shared. In this paper, we propose a novel Bayesian transfer learning method named ``CONCERT'' to allow robust partial information transfer for high-dimensional data analysis. A conditional spike-and-slab prior is introduced in the joint distribution of target and source parameters for information transfer. By incorporating covariate-specific priors, we can characterize partial similarities and integrate source information collaboratively to improve the performance on the target. In contrast to existing work, the CONCERT is a one-step procedure, which achieves variable selection and information transfer simultaneously. We establish variable selection consistency, as well as estimation and prediction error bounds for CONCERT. Our theory demonstrates the covariate-specific benefit of transfer learning. To ensure that our algorithm is scalable, we adopt the variational Bayes framework to facilitate implementation. Extensive experiments and two real data applications showcase the validity and advantage of CONCERT over existing cutting-edge transfer learning methods.
[28] arXiv:2408.06425 (replaced) [pdf, html, other]: Title: Bayesian Learning in a Nonlinear Multiscale State-Space Model

Nayely Vélez-Cruz, Manfred D. Laubichler

Comments: Included additional figures

Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)

The ubiquity of multiscale interactions in complex systems is well-recognized, with development and heredity serving as a prime example of how processes at different temporal scales influence one another. This work introduces a novel multiscale state-space model to explore the dynamic interplay between systems interacting across different time scales, with feedback between each scale. We propose a Bayesian learning framework to estimate unknown states by learning the unknown process noise covariances within this multiscale model. We develop a Particle Gibbs with Ancestor Sampling (PGAS) algorithm for inference and demonstrate through simulations the efficacy of our approach.
[29] arXiv:2408.09672 (replaced) [pdf, html, other]: Title: Regularization for Adversarial Robust Learning

Jie Wang, Rui Gao, Yao Xie

Comments: 51 pages, 5 figures

Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Despite the growing prevalence of artificial neural networks in real-world applications, their vulnerability to adversarial attacks remains a significant concern, which motivates us to investigate the robustness of machine learning models. While various heuristics aim to optimize the distributionally robust risk using the $\infty$-Wasserstein metric, such a notion of robustness frequently encounters computation intractability. To tackle the computational challenge, we develop a novel approach to adversarial training that integrates $\phi$-divergence regularization into the distributionally robust risk function. This regularization brings a notable improvement in computation compared with the original formulation. We develop stochastic gradient methods with biased oracles to solve this problem efficiently, achieving the near-optimal sample complexity. Moreover, we establish its regularization effects and demonstrate it is asymptotic equivalence to a regularized empirical risk minimization framework, by considering various scaling regimes of the regularization parameter and robustness level. These regimes yield gradient norm regularization, variance regularization, or a smoothed gradient norm regularization that interpolates between these extremes. We numerically validate our proposed method in supervised learning, reinforcement learning, and contextual learning and showcase its state-of-the-art performance against various adversarial attacks.

Total of 29 entries

Showing up to 2000 entries per page: fewer | more | all

Machine Learning

New submissions for Friday, 23 August 2024 (showing 6 of 6 entries )

Cross submissions for Friday, 23 August 2024 (showing 8 of 8 entries )

Replacement submissions for Friday, 23 August 2024 (showing 15 of 15 entries )