Machine Learning

New submissions
Cross-lists
Replacements

Total of 92 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2405.15050 [pdf, ps, html, other]: Title: Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs

Kihyuk Hong, Yufan Zhang, Ambuj Tewari

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We resolve the open problem of designing a computationally efficient algorithm for infinite-horizon average-reward linear Markov Decision Processes (MDPs) with $\widetilde{O}(\sqrt{T})$ regret. Previous approaches with $\widetilde{O}(\sqrt{T})$ regret either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity. In this paper, we approximate the average-reward setting by the discounted setting and show that running an optimistic value iteration-based algorithm for learning the discounted setting achieves $\widetilde{O}(\sqrt{T})$ regret when the discounting factor $\gamma$ is tuned appropriately. The challenge in the approximation approach is to get a regret bound with a sharp dependency on the effective horizon $1 / (1 - \gamma)$. We use a computationally efficient clipping operator that constrains the span of the optimistic state value function estimate to achieve a sharp regret bound in terms of the effective horizon, which leads to $\widetilde{O}(\sqrt{T})$ regret.
[2] arXiv:2405.15074 [pdf, ps, other]: Title: 4+3 Phases of Compute-Optimal Neural Scaling Laws

Elliot Paquette, Courtney Paquette, Lechao Xiao, Jeffrey Pennington

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)

We consider the three parameter solvable neural scaling model introduced by Maloney, Roberts, and Sully. The model has three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scaling-law exponents in all of these phases, in particular computing the optimal model-parameter-count as a function of floating point operation budget.
[3] arXiv:2405.15105 [pdf, ps, html, other]: Title: Certified Inventory Control of Critical Resources

Ludvig Hult, Dave Zachariah, Petre Stoica

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Inventory control is subject to service-level requirements, in which sufficient stock levels must be maintained despite an unknown demand. We propose a data-driven order policy that certifies any prescribed service level under minimal assumptions on the unknown demand process. The policy achieves this using any online learning method along with integral action. We further propose an inference method that is valid in finite samples. The properties and theoretical guarantees of the method are illustrated using both synthetic and real-world data.
[4] arXiv:2405.15106 [pdf, ps, html, other]: Title: Conformal Classification with Equalized Coverage for Adaptively Selected Groups

Yanfei Zhou, Matteo Sesia

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

This paper introduces a conformal inference method to evaluate uncertainty in classification by generating prediction sets with valid coverage conditional on adaptively chosen features. These features are carefully selected to reflect potential model limitations or biases. This can be useful to find a practical compromise between efficiency -- by providing informative predictions -- and algorithmic fairness -- by ensuring equalized coverage for the most sensitive groups. We demonstrate the validity and effectiveness of this method on simulated and real data sets.
[5] arXiv:2405.15107 [pdf, ps, html, other]: Title: Is Algorithmic Stability Testable? A Unified Framework under Computational Constraints

Yuetian Luo, Rina Foygel Barber

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

Algorithmic stability is a central notion in learning theory that quantifies the sensitivity of an algorithm to small changes in the training data. If a learning algorithm satisfies certain stability properties, this leads to many important downstream implications, such as generalization, robustness, and reliable predictive inference. Verifying that stability holds for a particular algorithm is therefore an important and practical question. However, recent results establish that testing the stability of a black-box algorithm is impossible, given limited data from an unknown distribution, in settings where the data lies in an uncountably infinite space (such as real-valued data). In this work, we extend this question to examine a far broader range of settings, where the data may lie in any space -- for example, categorical data. We develop a unified framework for quantifying the hardness of testing algorithmic stability, which establishes that across all settings, if the available data is limited then exhaustive search is essentially the only universally valid mechanism for certifying algorithmic stability. Since in practice, any test of stability would naturally be subject to computational constraints, exhaustive search is impossible and so this implies fundamental limits on our ability to test the stability property for a black-box algorithm.
[6] arXiv:2405.15132 [pdf, ps, html, other]: Title: Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification

Antonio Di Noia, Iuri Macocco, Aldo Glielmo, Alessandro Laio, Antonietta Mira

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)

The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also be erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. Since to estimate the density it is necessary to know the ID, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets.
[7] arXiv:2405.15167 [pdf, ps, html, other]: Title: ProDAG: Projection-induced variational inference for directed acyclic graphs

Ryan Thompson, Edwin V. Bonilla, Robert Kohn

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Directed acyclic graph (DAG) learning is a rapidly expanding field of research. Though the field has witnessed remarkable advances over the past few years, it remains statistically and computationally challenging to learn a single (point estimate) DAG from data, let alone provide uncertainty quantification. Our article addresses the difficult task of quantifying graph uncertainty by developing a variational Bayes inference framework based on novel distributions that have support directly on the space of DAGs. The distributions, which we use to form our prior and variational posterior, are induced by a projection operation, whereby an arbitrary continuous distribution is projected onto the space of sparse weighted acyclic adjacency matrices (matrix representations of DAGs) with probability mass on exact zeros. Though the projection constitutes a combinatorial optimization problem, it is solvable at scale via recently developed techniques that reformulate acyclicity as a continuous constraint. We empirically demonstrate that our method, ProDAG, can deliver accurate inference, and often outperforms existing state-of-the-art alternatives.
[8] arXiv:2405.15172 [pdf, ps, html, other]: Title: Learning the Distribution Map in Reverse Causal Performative Prediction

Daniele Bracale, Subha Maity, Moulinath Banerjee, Yuekai Sun

Comments: 17 pages, 4 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

In numerous predictive scenarios, the predictive model affects the sampling distribution; for example, job applicants often meticulously craft their resumes to navigate through a screening systems. Such shifts in distribution are particularly prevalent in the realm of social computing, yet, the strategies to learn these shifts from data remain remarkably limited. Inspired by a microeconomic model that adeptly characterizes agents' behavior within labor markets, we introduce a novel approach to learn the distribution shift. Our method is predicated on a reverse causal model, wherein the predictive model instigates a distribution shift exclusively through a finite set of agents' actions. Within this framework, we employ a microfoundation model for the agents' actions and develop a statistically justified methodology to learn the distribution shift map, which we demonstrate to be effective in minimizing the performative prediction risk.
[9] arXiv:2405.15254 [pdf, ps, html, other]: Title: Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime

Alistair Shilton, Sunil Gupta, Santu Rana, Svetha Venkatesh

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology, assuming only finite-energy neural activations; and a novel representor theory for neural networks in terms of a matrix-valued kernel. The first model is exact (un-approximated) and global, casting the neural network as an elements in a reproducing kernel Banach space (RKBS); we use this model to provide tight bounds on Rademacher complexity. The second model is exact and local, casting the change in neural network function resulting from a bounded change in weights and biases (ie. a training step) in reproducing kernel Hilbert space (RKHS) in terms of a local-intrinsic neural kernel (LiNK). This local model provides insight into model adaptation through tight bounds on Rademacher complexity of network adaptation. We also prove that the neural tangent kernel (NTK) is a first-order approximation of the LiNK kernel. Finally, and noting that the LiNK does not provide a representor theory for technical reasons, we present an exact novel representor theory for layer-wise neural network training with unregularized gradient descent in terms of a local-extrinsic neural kernel (LeNK). This representor theory gives insight into the role of higher-order statistics in neural network training and the effect of kernel evolution in neural-network kernel models. Throughout the paper (a) feedforward ReLU networks and (b) residual networks (ResNet) are used as illustrative examples.
[10] arXiv:2405.15294 [pdf, ps, html, other]: Title: Semi-Supervised Learning guided by the Generalized Bayes Rule under Soft Revision

Stefan Dietrich, Julian Rodemann, Christoph Jansen

Comments: Accepted at the 11th International Conference on Soft Methods in Probability and Statistics (SMPS) 2024

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)

We provide a theoretical and computational investigation of the Gamma-Maximin method with soft revision, which was recently proposed as a robust criterion for pseudo-label selection (PLS) in semi-supervised learning. Opposed to traditional methods for PLS we use credal sets of priors ("generalized Bayes") to represent the epistemic modeling uncertainty. These latter are then updated by the Gamma-Maximin method with soft revision. We eventually select pseudo-labeled data that are most likely in light of the least favorable distribution from the so updated credal set. We formalize the task of finding optimal pseudo-labeled data w.r.t. the Gamma-Maximin method with soft revision as an optimization problem. A concrete implementation for the class of logistic models then allows us to compare the predictive power of the method with competing approaches. It is observed that the Gamma-Maximin method with soft revision can achieve very promising results, especially when the proportion of labeled data is low.
[11] arXiv:2405.15337 [pdf, ps, html, other]: Title: Discriminative Estimation of Total Variation Distance: A Fidelity Auditor for Generative Data

Lan Tao, Shirong Xu, Chi-Hua Wang, Namjoon Suh, Guang Cheng

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

With the proliferation of generative AI and the increasing volume of generative data (also called as synthetic data), assessing the fidelity of generative data has become a critical concern. In this paper, we propose a discriminative approach to estimate the total variation (TV) distance between two distributions as an effective measure of generative data fidelity. Our method quantitatively characterizes the relation between the Bayes risk in classifying two distributions and their TV distance. Therefore, the estimation of total variation distance reduces to that of the Bayes risk. In particular, this paper establishes theoretical results regarding the convergence rate of the estimation error of TV distance between two Gaussian distributions. We demonstrate that, with a specific choice of hypothesis class in classification, a fast convergence rate in estimating the TV distance can be achieved. Specifically, the estimation accuracy of the TV distance is proven to inherently depend on the separation of two Gaussian distributions: smaller estimation errors are achieved when the two Gaussian distributions are farther apart. This phenomenon is also validated empirically through extensive simulations. In the end, we apply this discriminative estimation method to rank fidelity of synthetic image data using the MNIST dataset.
[12] arXiv:2405.15357 [pdf, ps, html, other]: Title: Strong screening rules for group-based SLOPE models

Fabio Feser, Marina Evangelou

Comments: 31 pages, 18 figures, 10 tables

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

Tuning the regularization parameter in penalized regression models is an expensive task, requiring multiple models to be fit along a path of parameters. Strong screening rules drastically reduce computational costs by lowering the dimensionality of the input prior to fitting. We develop strong screening rules for group-based Sorted L-One Penalized Estimation (SLOPE) models: Group SLOPE and Sparse-group SLOPE. The developed rules are applicable for the wider family of group-based OWL models, including OSCAR. Our experiments on both synthetic and real data show that the screening rules significantly accelerate the fitting process. The screening rules make it accessible for group SLOPE and sparse-group SLOPE to be applied to high-dimensional datasets, particularly those encountered in genetics.
[13] arXiv:2405.15358 [pdf, ps, html, other]: Title: Coordinated Multi-Neighborhood Learning on a Directed Acyclic Graph

Stephen Smith, Qing Zhou

Comments: 13 pages, 6 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Learning the structure of causal directed acyclic graphs (DAGs) is useful in many areas of machine learning and artificial intelligence, with wide applications. However, in the high-dimensional setting, it is challenging to obtain good empirical and theoretical results without strong and often restrictive assumptions. Additionally, it is questionable whether all of the variables purported to be included in the network are observable. It is of interest then to restrict consideration to a subset of the variables for relevant and reliable inferences. In fact, researchers in various disciplines can usually select a set of target nodes in the network for causal discovery. This paper develops a new constraint-based method for estimating the local structure around multiple user-specified target nodes, enabling coordination in structure learning between neighborhoods. Our method facilitates causal discovery without learning the entire DAG structure. We establish consistency results for our algorithm with respect to the local neighborhood structure of the target nodes in the true graph. Experimental results on synthetic and real-world data show that our algorithm is more accurate in learning the neighborhood structures with much less computational cost than standard methods that estimate the entire DAG. An R package implementing our methods may be accessed at this https URL.
[14] arXiv:2405.15379 [pdf, ps, html, other]: Title: Log-Concave Sampling on Compact Supports: A Versatile Proximal Framework

Lu Yu

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)

In this paper, we explore sampling from strongly log-concave distributions defined on convex and compact supports. We propose a general proximal framework that involves projecting onto the constrained set, which is highly flexible and supports various projection options. Specifically, we consider the cases of Euclidean and Gauge projections, with the latter having the advantage of being performed efficiently using a membership oracle. This framework can be seamlessly integrated with multiple sampling methods. Our analysis focuses on Langevin-type sampling algorithms within the context of constrained sampling. We provide nonasymptotic upper bounds on the W1 and W2 errors, offering a detailed comparison of the performance of these methods in constrained sampling.
[15] arXiv:2405.15393 [pdf, ps, html, other]: Title: Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization

Thomas Nagler, Lennart Schneider, Bernd Bischl, Matthias Feurer

Comments: 39 pages, 4 tables, 29 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Hyperparameter optimization is crucial for obtaining peak performance of machine learning models. The standard protocol evaluates various hyperparameter configurations using a resampling estimate of the generalization error to guide optimization and select a final hyperparameter configuration. Without much evidence, paired resampling splits, i.e., either a fixed train-validation split or a fixed cross-validation scheme, are often recommended. We show that, surprisingly, reshuffling the splits for every configuration often improves the final model's generalization performance on unseen data. Our theoretical analysis explains how reshuffling affects the asymptotic behavior of the validation loss surface and provides a bound on the expected regret in the limiting regime. This bound connects the potential benefits of reshuffling to the signal and noise characteristics of the underlying optimization problem. We confirm our theoretical results in a controlled simulation study and demonstrate the practical usefulness of reshuffling in a large-scale, realistic hyperparameter optimization experiment. While reshuffling leads to test performances that are competitive with using fixed splits, it drastically improves results for a single train-validation holdout protocol and can often make holdout become competitive with standard CV while being computationally cheaper.
[16] arXiv:2405.15441 [pdf, ps, html, other]: Title: Statistical and Computational Guarantees of Kernel Max-Sliced Wasserstein Distances

Jie Wang, March Boedihardjo, Yao Xie

Comments: 44 pages, 7 figures, 4 tables

Subjects: Machine Learning (stat.ML); Computational Complexity (cs.CC); Machine Learning (cs.LG)

Optimal transport has been very successful for various machine learning tasks; however, it is known to suffer from the curse of dimensionality. Hence, dimensionality reduction is desirable when applied to high-dimensional data with low-dimensional structures. The kernel max-sliced (KMS) Wasserstein distance is developed for this purpose by finding an optimal nonlinear mapping that reduces data into $1$ dimensions before computing the Wasserstein distance. However, its theoretical properties have not yet been fully developed. In this paper, we provide sharp finite-sample guarantees under milder technical assumptions compared with state-of-the-art for the KMS $p$-Wasserstein distance between two empirical distributions with $n$ samples for general $p\in[1,\infty)$. Algorithm-wise, we show that computing the KMS $2$-Wasserstein distance is NP-hard, and then we further propose a semidefinite relaxation (SDR) formulation (which can be solved efficiently in polynomial time) and provide a relaxation gap for the SDP solution. We provide numerical examples to demonstrate the good performance of our scheme for high-dimensional two-sample testing.
[17] arXiv:2405.15459 [pdf, ps, html, other]: Title: Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions

Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Luca Pesce, Ludovic Stephan

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Neural networks can identify low-dimensional relevant structures within high-dimensional noisy data, yet our mathematical understanding of how they do so remains scarce. Here, we investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms, and discuss how they learn pertinent features in multi-index models, that is target functions with low-dimensional relevant directions. In the high-dimensional regime, where the input dimension $d$ diverges, we show that a simple modification of the idealized single-pass gradient descent training scenario, where data can now be repeated or iterated upon twice, drastically improves its computational efficiency. In particular, it surpasses the limitations previously believed to be dictated by the Information and Leap exponents associated with the target function to be learned. Our results highlight the ability of networks to learn relevant structures from data alone without any pre-processing. More precisely, we show that (almost) all directions are learned with at most $O(d \log d)$ steps. Among the exceptions is a set of hard functions that includes sparse parities. In the presence of coupling between directions, however, these can be learned sequentially through a hierarchical mechanism that generalizes the notion of staircase functions. Our results are proven by a rigorous study of the evolution of the relevant statistics for high-dimensional dynamics.
[18] arXiv:2405.15473 [pdf, ps, html, other]: Title: Encoder Embedding for General Graph and Node Classification

Cencheng Shen

Comments: 26 pages

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

Graph encoder embedding, a recent technique for graph data, offers speed and scalability in producing vertex-level representations from binary graphs. In this paper, we extend the applicability of this method to a general graph model, which includes weighted graphs, distance matrices, and kernel matrices. We prove that the encoder embedding satisfies the law of large numbers and the central limit theorem on a per-observation basis. Under certain condition, it achieves asymptotic normality on a per-class basis, enabling optimal classification through discriminant analysis. These theoretical findings are validated through a series of experiments involving weighted graphs, as well as text and image data transformed into general graph representations using appropriate distance metrics.
[19] arXiv:2405.15514 [pdf, ps, other]: Title: On the Convexity and Reliability of the Bethe Free Energy Approximation

Harald Leisenberger, Christian Knoll, Franz Pernkopf

Comments: This work has been submitted to the Journal of Machine Learning Research (JMLR) for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The Bethe free energy approximation provides an effective way for relaxing NP-hard problems of probabilistic inference. However, its accuracy depends on the model parameters and particularly degrades if a phase transition in the model occurs. In this work, we analyze when the Bethe approximation is reliable and how this can be verified. We argue and show by experiment that it is mostly accurate if it is convex on a submanifold of its domain, the 'Bethe box'. For verifying its convexity, we derive two sufficient conditions that are based on the definiteness properties of the Bethe Hessian matrix: the first uses the concept of diagonal dominance, and the second decomposes the Bethe Hessian matrix into a sum of sparse matrices and characterizes the definiteness properties of the individual matrices in that sum. These theoretical results provide a simple way to estimate the critical phase transition temperature of a model. As a practical contribution we propose $\texttt{BETHE-MIN}$, a projected quasi-Newton method to efficiently find a minimum of the Bethe free energy.
[20] arXiv:2405.15539 [pdf, ps, other]: Title: A generalized neural tangent kernel for surrogate gradient learning

Luke Eilers, Raoul-Martin Memmesheimer, Sven Goedeke

Comments: 52 pages, 3 figures + 2 supplementary figures

Subjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Probability (math.PR); Neurons and Cognition (q-bio.NC)

State-of-the-art neural network training methods depend on the gradient of the network function. Therefore, they cannot be applied to networks whose activation functions do not have useful derivatives, such as binary and discrete-time spiking neural networks. To overcome this problem, the activation function's derivative is commonly substituted with a surrogate derivative, giving rise to surrogate gradient learning (SGL). This method works well in practice but lacks theoretical foundation. The neural tangent kernel (NTK) has proven successful in the analysis of gradient descent. Here, we provide a generalization of the NTK, which we call the surrogate gradient NTK, that enables the analysis of SGL. First, we study a naive extension of the NTK to activation functions with jumps, demonstrating that gradient descent for such activation functions is also ill-posed in the infinite-width limit. To address this problem, we generalize the NTK to gradient descent with surrogate derivatives, i.e., SGL. We carefully define this generalization and expand the existing key theorems on the NTK with mathematical rigor. Further, we illustrate our findings with numerical experiments. Finally, we numerically compare SGL in networks with sign activation function and finite width to kernel regression with the surrogate gradient NTK; the results confirm that the surrogate gradient NTK provides a good characterization of SGL.
[21] arXiv:2405.15600 [pdf, ps, html, other]: Title: Transfer Learning for Spatial Autoregressive Models

Hao Zeng, Wei Zhong, Xingbai Xu

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)

The spatial autoregressive (SAR) model has been widely applied in various empirical economic studies to characterize the spatial dependence among subjects. However, the precision of estimating the SAR model diminishes when the sample size of the target data is limited. In this paper, we propose a new transfer learning framework for the SAR model to borrow the information from similar source data to improve both estimation and prediction. When the informative source data sets are known, we introduce a two-stage algorithm, including a transferring stage and a debiasing stage, to estimate the unknown parameters and also establish the theoretical convergence rates for the resulting estimators. If we do not know which sources to transfer, a transferable source detection algorithm is proposed to detect informative sources data based on spatial residual bootstrap to retain the necessary spatial dependence. Its detection consistency is also derived. Simulation studies demonstrate that using informative source data, our transfer learning algorithm significantly enhances the performance of the classical two-stage least squares estimator. In the empirical application, we apply our method to the election prediction in swing states in the 2020 U.S. presidential election, utilizing polling data from the 2016 U.S. presidential election along with other demographic and geographical data. The empirical results show that our method outperforms traditional estimation methods.
[22] arXiv:2405.15625 [pdf, ps, html, other]: Title: Nonlinear denoising score matching for enhanced learning of structured distributions

Jeremiah Birrell, Markos A. Katsoulakis, Luc Rey-Bellet, Benjamin Zhang, Wei Zhu

Comments: 14 pages, 8 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We present a novel method for training score-based generative models which uses nonlinear noising dynamics to improve learning of structured distributions. Generalizing to a nonlinear drift allows for additional structure to be incorporated into the dynamics, thus making the training better adapted to the data, e.g., in the case of multimodality or (approximate) symmetries. Such structure can be obtained from the data by an inexpensive preprocessing step. The nonlinear dynamics introduces new challenges into training which we address in two ways: 1) we develop a new nonlinear denoising score matching (NDSM) method, 2) we introduce neural control variates in order to reduce the variance of the NDSM training objective. We demonstrate the effectiveness of this method on several examples: a) a collection of low-dimensional examples, motivated by clustering in latent space, b) high-dimensional images, addressing issues with mode collapse, small training sets, and approximate symmetries, the latter being a challenge for methods based on equivariant neural networks, which require exact symmetries.
[23] arXiv:2405.15643 [pdf, ps, html, other]: Title: Reducing the cost of posterior sampling in linear inverse problems via task-dependent score learning

Fabian Schneider, Duc-Lam Duong, Matti Lassas, Maarten V. de Hoop, Tapio Helin

Comments: 23 pages, 2 figues

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Probability (math.PR)

Score-based diffusion models (SDMs) offer a flexible approach to sample from the posterior distribution in a variety of Bayesian inverse problems. In the literature, the prior score is utilized to sample from the posterior by different methods that require multiple evaluations of the forward mapping in order to generate a single posterior sample. These methods are often designed with the objective of enabling the direct use of the unconditional prior score and, therefore, task-independent training. In this paper, we focus on linear inverse problems, when evaluation of the forward mapping is computationally expensive and frequent posterior sampling is required for new measurement data, such as in medical imaging. We demonstrate that the evaluation of the forward mapping can be entirely bypassed during posterior sample generation. Instead, without introducing any error, the computational effort can be shifted to an offline task of training the score of a specific diffusion-like random process. In particular, the training is task-dependent requiring information about the forward mapping but not about the measurement data. It is shown that the conditional score corresponding to the posterior can be obtained from the auxiliary score by suitable affine transformations. We prove that this observation generalizes to the framework of infinite-dimensional diffusion models introduced recently and provide numerical analysis of the method. Moreover, we validate our findings with numerical experiments.
[24] arXiv:2405.15676 [pdf, ps, html, other]: Title: Taming Score-Based Diffusion Priors for Infinite-Dimensional Nonlinear Inverse Problems

Lorenzo Baldassari, Ali Siahkoohi, Josselin Garnier, Knut Solna, Maarten V. de Hoop

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)

This work introduces a sampling method capable of solving Bayesian inverse problems in function space. It does not assume the log-concavity of the likelihood, meaning that it is compatible with nonlinear inverse problems. The method leverages the recently defined infinite-dimensional score-based diffusion models as a learning-based prior, while enabling provable posterior sampling through a Langevin-type MCMC algorithm defined on function spaces. A novel convergence analysis is conducted, inspired by the fixed-point methods established for traditional regularization-by-denoising algorithms and compatible with weighted annealing. The obtained convergence bound explicitly depends on the approximation error of the score; a well-approximated score is essential to obtain a well-approximated posterior. Stylized and PDE-based examples are provided, demonstrating the validity of our convergence analysis. We conclude by presenting a discussion of the method's challenges related to learning the score and computational complexity.
[25] arXiv:2405.15699 [pdf, ps, html, other]: Title: Dimension-free deterministic equivalents for random feature regression

Leonardo Defilippis, Bruno Loureiro, Theodor Misiakiewicz

Subjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)

In this work we investigate the generalization performance of random feature ridge regression (RFRR). Our main contribution is a general deterministic equivalent for the test error of RFRR. Specifically, under a certain concentration property, we show that the test error is well approximated by a closed-form expression that only depends on the feature map eigenvalues. Notably, our approximation guarantee is non-asymptotic, multiplicative, and independent of the feature map dimension -- allowing for infinite-dimensional features. We expect this deterministic equivalent to hold broadly beyond our theoretical analysis, and we empirically validate its predictions on various real and synthetic datasets. As an application, we derive sharp excess error rates under standard power-law assumptions of the spectrum and target decay. In particular, we provide a tight result for the smallest number of features achieving optimal minimax error rate.
[26] arXiv:2405.15712 [pdf, ps, html, other]: Title: Infinite Limits of Multi-head Transformer Dynamics

Blake Bordelon, Hamza Tahir Chaudhry, Cengiz Pehlevan

Subjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)

In this work, we analyze various scaling limits of the training dynamics of transformer models in the feature learning regime. We identify the set of parameterizations that admit well-defined infinite width and depth limits, allowing the attention layers to update throughout training--a relevant notion of feature learning in these models. We then use tools from dynamical mean field theory (DMFT) to analyze various infinite limits (infinite key/query dimension, infinite heads, and infinite depth) which have different statistical descriptions depending on which infinite limit is taken and how attention layers are scaled. We provide numerical evidence of convergence to the limits and discuss how the parameterization qualitatively influences learned features.
[27] arXiv:2405.15727 [pdf, ps, html, other]: Title: Anomalous Change Point Detection Using Probabilistic Predictive Coding

Roelof G. Hup, Julian P. Merkofer, Alex A. Bhogal, Ruud J.G. van Sloun, Reinder Haakma, Rik Vullings

Comments: Submitted to Machine Learning

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Change point detection (CPD) and anomaly detection (AD) are essential techniques in various fields to identify abrupt changes or abnormal data instances. However, existing methods are often constrained to univariate data, face scalability challenges with large datasets due to computational demands, and experience reduced performance with high-dimensional or intricate data, as well as hidden anomalies. Furthermore, they often lack interpretability and adaptability to domain-specific knowledge, which limits their versatility across different fields. In this work, we propose a deep learning-based CPD/AD method called Probabilistic Predictive Coding (PPC) that jointly learns to encode sequential data to low dimensional latent space representations and to predict the subsequent data representations as well as the corresponding prediction uncertainties. The model parameters are optimized with maximum likelihood estimation by comparing these predictions with the true encodings. At the time of application, the true and predicted encodings are used to determine the probability of conformity, an interpretable and meaningful anomaly score. Furthermore, our approach has linear time complexity, scalability issues are prevented, and the method can easily be adjusted to a wide range of data types and intricate applications. We demonstrate the effectiveness and adaptability of our proposed method across synthetic time series experiments, image data, and real-world magnetic resonance spectroscopic imaging data.
[28] arXiv:2405.15754 [pdf, ps, html, other]: Title: Score-based generative models are provably robust: an uncertainty quantification perspective

Nikiforos Mimikos-Stamatopoulos, Benjamin J. Zhang, Markos A. Katsoulakis

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

Through an uncertainty quantification (UQ) perspective, we show that score-based generative models (SGMs) are provably robust to the multiple sources of error in practical implementation. Our primary tool is the Wasserstein uncertainty propagation (WUP) theorem, a model-form UQ bound that describes how the $L^2$ error from learning the score function propagates to a Wasserstein-1 ($\mathbf{d}_1$) ball around the true data distribution under the evolution of the Fokker-Planck equation. We show how errors due to (a) finite sample approximation, (b) early stopping, (c) score-matching objective choice, (d) score function parametrization expressiveness, and (e) reference distribution choice, impact the quality of the generative model in terms of a $\mathbf{d}_1$ bound of computable quantities. The WUP theorem relies on Bernstein estimates for Hamilton-Jacobi-Bellman partial differential equations (PDE) and the regularizing properties of diffusion processes. Specifically, PDE regularity theory shows that stochasticity is the key mechanism ensuring SGM algorithms are provably robust. The WUP theorem applies to integral probability metrics beyond $\mathbf{d}_1$, such as the total variation distance and the maximum mean discrepancy. Sample complexity and generalization bounds in $\mathbf{d}_1$ follow directly from the WUP theorem. Our approach requires minimal assumptions, is agnostic to the manifold hypothesis and avoids absolute continuity assumptions for the target distribution. Additionally, our results clarify the trade-offs among multiple error sources in SGMs.
[29] arXiv:2405.15768 [pdf, ps, html, other]: Title: Canonical Variates in Wasserstein Metric Space

Jia Li, Lin Lin

Comments: double space 37 pages, 6 figures

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

In this paper, we address the classification of instances each characterized not by a singular point, but by a distribution on a vector space. We employ the Wasserstein metric to measure distances between distributions, which are then used by distance-based classification algorithms such as k-nearest neighbors, k-means, and pseudo-mixture modeling. Central to our investigation is dimension reduction within the Wasserstein metric space to enhance classification accuracy. We introduce a novel approach grounded in the principle of maximizing Fisher's ratio, defined as the quotient of between-class variation to within-class variation. The directions in which this ratio is maximized are termed discriminant coordinates or canonical variates axes. In practice, we define both between-class and within-class variations as the average squared distances between pairs of instances, with the pairs either belonging to the same class or to different classes. This ratio optimization is achieved through an iterative algorithm, which alternates between optimal transport and maximization steps within the vector space. We conduct empirical studies to assess the algorithm's convergence and, through experimental validation, demonstrate that our dimension reduction technique substantially enhances classification performance. Moreover, our method outperforms well-established algorithms that operate on vector representations derived from distributional data. It also exhibits robustness against variations in the distributional representations of data clouds.

[30] arXiv:2405.14896 (cross-list from stat.AP) [pdf, ps, html, other]: Title: Study on spike-and-wave detection in epileptic signals using t-location-scale distribution and the K-nearest neighbors classifier

Antonio Quintero-Rincón, Jorge Prendes, Valeria Muro, Carlos D'Giano

Comments: 7 pages, 6 figures, INPROCEEDINGS IEEE paper

Journal-ref: 2017 IEEE URUCON Conferences

Subjects: Applications (stat.AP); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)

Pattern classification in electroencephalography (EEG) signals is an important problem in biomedical engineering since it enables the detection of brain activity, particularly the early detection of epileptic seizures. In this paper, we propose a k-nearest neighbors classification for epileptic EEG signals based on a t-location-scale statistical representation to detect spike-and-waves. The proposed approach is demonstrated on a real dataset containing both spike-and-wave events and normal brain function signals, where our performance is evaluated in terms of classification accuracy, sensitivity, and specificity.
[31] arXiv:2405.14913 (cross-list from stat.ME) [pdf, ps, other]: Title: High Rank Path Development: an approach of learning the filtration of stochastic processes

Jiajie Tao, Hao Ni, Chong Liu

Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)

Since the weak convergence for stochastic processes does not account for the growth of information over time which is represented by the underlying filtration, a slightly erroneous stochastic model in weak topology may cause huge loss in multi-periods decision making problems. To address such discontinuities Aldous introduced the extended weak convergence, which can fully characterise all essential properties, including the filtration, of stochastic processes; however was considered to be hard to find efficient numerical implementations. In this paper, we introduce a novel metric called High Rank PCF Distance (HRPCFD) for extended weak convergence based on the high rank path development method from rough path theory, which also defines the characteristic function for measure-valued processes. We then show that such HRPCFD admits many favourable analytic properties which allows us to design an efficient algorithm for training HRPCFD from data and construct the HRPCF-GAN by using HRPCFD as the discriminator for conditional time series generation. Our numerical experiments on both hypothesis testing and generative modelling validate the out-performance of our approach compared with several state-of-the-art methods, highlighting its potential in broad applications of synthetic time series generation and in addressing classic financial and economic challenges, such as optimal stopping or utility maximisation problems.
[32] arXiv:2405.14932 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Fast Inference Using Automatic Differentiation and Neural Transport in Astroparticle Physics

Dorian W. P. Amaral, Shixiao Liang, Juehang Qin, Christopher Tunnell

Comments: 20 pages, 7 figures, 4 tables, 6 appendices

Subjects: Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Machine Learning (stat.ML)

Multi-dimensional parameter spaces are commonly encountered in astroparticle physics theories that attempt to capture novel phenomena. However, they often possess complicated posterior geometries that are expensive to traverse using techniques traditional to this community. Effectively sampling these spaces is crucial to bridge the gap between experiment and theory. Several recent innovations, which are only beginning to make their way into this field, have made navigating such complex posteriors possible. These include GPU acceleration, automatic differentiation, and neural-network-guided reparameterization. We apply these advancements to astroparticle physics experimental results in the context of novel neutrino physics and benchmark their performances against traditional nested sampling techniques. Compared to nested sampling alone, we find that these techniques increase performance for both nested sampling and Hamiltonian Monte Carlo, accelerating inference by factors of $\sim 100$ and $\sim 60$, respectively. As nested sampling also evaluates the Bayesian evidence, these advancements can be exploited to improve model comparison performance while retaining compatibility with existing implementations that are widely used in the natural sciences.
[33] arXiv:2405.14953 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Mallows-DPO: Fine-Tune Your LLM with Preference Dispersions

Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao, Wenpin Tang

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune large language models (LLM). A weakness of DPO, however, lies in its lack of capability to characterize the diversity of human preferences. Inspired by Mallows' theory of preference ranking, we develop in this paper a new approach, the Mallows-DPO. A distinct feature of this approach is a dispersion index, which reflects the dispersion of human preference to prompts. We show that existing DPO models can be reduced to special cases of this dispersion index, thus unified with Mallows-DPO. More importantly, we demonstrate (empirically) how to use this dispersion index to enhance the performance of DPO in a broad array of benchmark tasks, from synthetic bandit selection to controllable generations and dialogues, while maintaining great generalization capabilities.
[34] arXiv:2405.14957 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Understanding the dynamics of the frequency bias in neural networks

Juan Molina, Mircea Petrache, Francisco Sahli Costabal, Matías Courdurier

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Recent works have shown that traditional Neural Network (NN) architectures display a marked frequency bias in the learning process. Namely, the NN first learns the low-frequency features before learning the high-frequency ones. In this study, we rigorously develop a partial differential equation (PDE) that unravels the frequency dynamics of the error for a 2-layer NN in the Neural Tangent Kernel regime. Furthermore, using this insight, we explicitly demonstrate how an appropriate choice of distributions for the initialization weights can eliminate or control the frequency bias. We focus our study on the Fourier Features model, an NN where the first layer has sine and cosine activation functions, with frequencies sampled from a prescribed distribution. In this setup, we experimentally validate our theoretical results and compare the NN dynamics to the solution of the PDE using the finite element method. Finally, we empirically show that the same principle extends to multi-layer NNs.
[35] arXiv:2405.14982 (cross-list from cs.LG) [pdf, ps, html, other]: Title: In-context Time Series Predictor

Jiecheng Lu, Yan Sun, Shihao Yang

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

Recent Transformer-based large language models (LLMs) demonstrate in-context learning ability to perform various functions based solely on the provided context, without updating model parameters. To fully utilize the in-context capabilities in time series forecasting (TSF) problems, unlike previous Transformer-based or LLM-based time series forecasting methods, we reformulate "time series forecasting tasks" as input tokens by constructing a series of (lookback, future) pairs within the tokens. This method aligns more closely with the inherent in-context mechanisms, and is more parameter-efficient without the need of using pre-trained LLM parameters. Furthermore, it addresses issues such as overfitting in existing Transformer-based TSF models, consistently achieving better performance across full-data, few-shot, and zero-shot settings compared to previous architectures.
[36] arXiv:2405.15002 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Private Regression via Data-Dependent Sufficient Statistic Perturbation

Cecilia Ferrando, Daniel Sheldon

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Sufficient statistic perturbation (SSP) is a widely used method for differentially private linear regression. SSP adopts a data-independent approach where privacy noise from a simple distribution is added to sufficient statistics. However, sufficient statistics can often be expressed as linear queries and better approximated by data-dependent mechanisms. In this paper we introduce data-dependent SSP for linear regression based on post-processing privately released marginals, and find that it outperforms state-of-the-art data-independent SSP. We extend this result to logistic regression by developing an approximate objective that can be expressed in terms of sufficient statistics, resulting in a novel and highly competitive SSP approach for logistic regression. We also make a connection to synthetic data for machine learning: for models with sufficient statistics, training on synthetic data corresponds to data-dependent SSP, with the overall utility determined by how well the mechanism answers these linear queries.
[37] arXiv:2405.15006 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Path-metrics, pruning, and generalization

Antoine Gonon, Nicolas Brisebarre, Elisa Riccietti, Rémi Gribonval

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Analyzing the behavior of ReLU neural networks often hinges on understanding the relationships between their parameters and the functions they implement. This paper proves a new bound on function distances in terms of the so-called path-metrics of the parameters. Since this bound is intrinsically invariant with respect to the rescaling symmetries of the networks, it sharpens previously known bounds. It is also, to the best of our knowledge, the first bound of its kind that is broadly applicable to modern networks such as ResNets, VGGs, U-nets, and many more. In contexts such as network pruning and quantization, the proposed path-metrics can be efficiently computed using only two forward passes. Besides its intrinsic theoretical interest, the bound yields not only novel theoretical generalization bounds, but also a promising proof of concept for rescaling-invariant pruning.
[38] arXiv:2405.15059 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Message-Passing Monte Carlo: Generating low-discrepancy point sets via Graph Neural Networks

T. Konstantin Rusch, Nathan Kirk, Michael M. Bronstein, Christiane Lemieux, Daniela Rus

Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)

Discrepancy is a well-known measure for the irregularity of the distribution of a point set. Point sets with small discrepancy are called low-discrepancy and are known to efficiently fill the space in a uniform manner. Low-discrepancy points play a central role in many problems in science and engineering, including numerical integration, computer vision, machine perception, computer graphics, machine learning, and simulation. In this work, we present the first machine learning approach to generate a new class of low-discrepancy point sets named Message-Passing Monte Carlo (MPMC) points. Motivated by the geometric nature of generating low-discrepancy point sets, we leverage tools from Geometric Deep Learning and base our model on Graph Neural Networks. We further provide an extension of our framework to higher dimensions, which flexibly allows the generation of custom-made points that emphasize the uniformity in specific dimensions that are primarily important for the particular problem at hand. Finally, we demonstrate that our proposed model achieves state-of-the-art performance superior to previous methods by a significant margin. In fact, MPMC points are empirically shown to be either optimal or near-optimal with respect to the discrepancy for every dimension and the number of points for which the optimal discrepancy can be determined.
[39] arXiv:2405.15084 (cross-list from cs.DS) [pdf, ps, other]: Title: Efficient Certificates of Anti-Concentration Beyond Gaussians

Ainesh Bakshi, Pravesh Kothari, Goutham Rajendran, Madhur Tulsiani, Aravindan Vijayaraghavan

Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)

A set of high dimensional points $X=\{x_1, x_2,\ldots, x_n\} \subset R^d$ in isotropic position is said to be $\delta$-anti concentrated if for every direction $v$, the fraction of points in $X$ satisfying $|\langle x_i,v \rangle |\leq \delta$ is at most $O(\delta)$. Motivated by applications to list-decodable learning and clustering, recent works have considered the problem of constructing efficient certificates of anti-concentration in the average case, when the set of points $X$ corresponds to samples from a Gaussian distribution. Their certificates played a crucial role in several subsequent works in algorithmic robust statistics on list-decodable learning and settling the robust learnability of arbitrary Gaussian mixtures, yet remain limited to rotationally invariant distributions.
This work presents a new (and arguably the most natural) formulation for anti-concentration. Using this formulation, we give quasi-polynomial time verifiable sum-of-squares certificates of anti-concentration that hold for a wide class of non-Gaussian distributions including anti-concentrated bounded product distributions and uniform distributions over $L_p$ balls (and their affine transformations). Consequently, our method upgrades and extends results in algorithmic robust statistics e.g., list-decodable learning and clustering, to such distributions. Our approach constructs a canonical integer program for anti-concentration and analysis a sum-of-squares relaxation of it, independent of the intended application. We rely on duality and analyze a pseudo-expectation on large subsets of the input points that take a small value in some direction. Our analysis uses the method of polynomial reweightings to reduce the problem to analyzing only analytically dense or sparse directions.
[40] arXiv:2405.15090 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Pure Exploration for Constrained Best Mixed Arm Identification with a Fixed Budget

Dengwang Tang, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo

Comments: 7 pages, 5 figures, 1 table

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In this paper, we introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget. This is a pure exploration problem in a stochastic finite armed bandit model. Each arm is associated with a reward and multiple types of costs from unknown distributions. Unlike the unconstrained best arm identification problem, the optimal solution for the CBMAI problem may be a randomized mixture of multiple arms. The goal thus is to find the best mixed arm that maximizes the expected reward subject to constraints on the expected costs with a given learning budget $N$. We propose a novel, parameter-free algorithm, called the Score Function-based Successive Reject (SFSR) algorithm, that combines the classical successive reject framework with a novel score-function-based rejection criteria based on linear programming theory to identify the optimal support. We provide a theoretical upper bound on the mis-identification (of the the support of the best mixed arm) probability and show that it decays exponentially in the budget $N$ and some constants that characterize the hardness of the problem instance. We also develop an information theoretic lower bound on the error probability that shows that these constants appropriately characterize the problem difficulty. We validate this empirically on a number of average and hard instances.
[41] arXiv:2405.15115 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification

Shang Liu, Zhongze Cai, Guanting Chen, Xiaocheng Li

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

Predicting simple function classes has been widely used as a testbed for developing theory and understanding of the trained Transformer's in-context learning (ICL) ability. In this paper, we revisit the training of Transformers on linear regression tasks, and different from all the existing literature, we consider a bi-objective prediction task of predicting both the conditional expectation $\mathbb{E}[Y|X]$ and the conditional variance Var$(Y|X)$. This additional uncertainty quantification objective provides a handle to (i) better design out-of-distribution experiments to distinguish ICL from in-weight learning (IWL) and (ii) make a better separation between the algorithms with and without using the prior information of the training distribution. Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution. Our method can be extended to other cases. Specifically, with the Transformer's context window $S$, we prove a generalization bound of $\tilde{\mathcal{O}}(\sqrt{\min\{S, T\}/(n T)})$ on $n$ tasks with sequences of length $T$, providing sharper analysis compared to previous results of $\tilde{\mathcal{O}}(\sqrt{1/n})$. Empirically, we illustrate that while the trained Transformer behaves as the Bayes-optimal solution as a natural consequence of supervised training in distribution, it does not necessarily perform a Bayesian inference when facing task shifts, in contrast to the \textit{equivalence} between these two proposed in many existing literature. We also demonstrate the trained Transformer's ICL ability over covariates shift and prompt-length shift and interpret them as a generalization over a meta distribution.
[42] arXiv:2405.15119 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Bayesian Optimization of Functions over Node Subsets in Graphs

Huidong Liang, Xingchen Wan, Xiaowen Dong

Comments: 26 pages with 20 figures

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We address the problem of optimizing over functions defined on node subsets in a graph. The optimization of such functions is often a non-trivial task given their combinatorial, black-box and expensive-to-evaluate nature. Although various algorithms have been introduced in the literature, most are either task-specific or computationally inefficient and only utilize information about the graph structure without considering the characteristics of the function. To address these limitations, we utilize Bayesian Optimization (BO), a sample-efficient black-box solver, and propose a novel framework for combinatorial optimization on graphs. More specifically, we map each $k$-node subset in the original graph to a node in a new combinatorial graph and adopt a local modeling approach to efficiently traverse the latter graph by progressively sampling its subgraphs using a recursive algorithm. Extensive experiments under both synthetic and real-world setups demonstrate the effectiveness of the proposed BO framework on various types of graphs and optimization tasks, where its behavior is analyzed in detail with ablation studies.
[43] arXiv:2405.15251 (cross-list from math.OC) [pdf, ps, html, other]: Title: Learning to optimize: A tutorial for continuous and mixed-integer optimization

Xiaohan Chen, Jialin Liu, Wotao Yin

Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)

Learning to Optimize (L2O) stands at the intersection of traditional optimization and machine learning, utilizing the capabilities of machine learning to enhance conventional optimization techniques. As real-world optimization problems frequently share common structures, L2O provides a tool to exploit these structures for better or faster solutions. This tutorial dives deep into L2O techniques, introducing how to accelerate optimization algorithms, promptly estimate the solutions, or even reshape the optimization problem itself, making it more adaptive to real-world applications. By considering the prerequisites for successful applications of L2O and the structure of the optimization problems at hand, this tutorial provides a comprehensive guide for practitioners and researchers alike.
[44] arXiv:2405.15325 (cross-list from cs.LG) [pdf, ps, html, other]: Title: On the Identification of Temporally Causal Representation with Instantaneous Dependence

Zijian Li, Yifan Shen, Kaitao Zheng, Ruichu Cai, Xiangchen Song, Mingming Gong, Zhifeng Hao, Zhengmao Zhu, Guangyi Chen, Kun Zhang

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Temporally causal representation learning aims to identify the latent causal process from time series observations, but most methods require the assumption that the latent causal processes do not have instantaneous relations. Although some recent methods achieve identifiability in the instantaneous causality case, they require either interventions on the latent variables or grouping of the observations, which are in general difficult to obtain in real-world scenarios. To fill this gap, we propose an \textbf{ID}entification framework for instantane\textbf{O}us \textbf{L}atent dynamics (\textbf{IDOL}) by imposing a sparse influence constraint that the latent causal processes have sparse time-delayed and instantaneous relations. Specifically, we establish identifiability results of the latent causal process based on sufficient variability and the sparse influence constraint by employing contextual information of time series data. Based on these theories, we incorporate a temporally variational inference architecture to estimate the latent variables and a gradient-based sparsity regularization to identify the latent causal process. Experimental results on simulation datasets illustrate that our method can identify the latent causal process. Furthermore, evaluations on multiple human motion forecasting benchmarks with instantaneous dependencies indicate the effectiveness of our method in real-world settings.
[45] arXiv:2405.15403 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Fine-Grained Dynamic Framework for Bias-Variance Joint Optimization on Data Missing Not at Random

Mingming Ha, Xuewen Tao, Wenfang Lin, Qionxu Ma, Wujiang Xu, Linxun Chen

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In most practical applications such as recommendation systems, display advertising, and so forth, the collected data often contains missing values and those missing values are generally missing-not-at-random, which deteriorates the prediction performance of models. Some existing estimators and regularizers attempt to achieve unbiased estimation to improve the predictive performance. However, variances and generalization bound of these methods are generally unbounded when the propensity scores tend to zero, compromising their stability and robustness. In this paper, we first theoretically reveal that limitations of regularization techniques. Besides, we further illustrate that, for more general estimators, unbiasedness will inevitably lead to unbounded variance. These general laws inspire us that the estimator designs is not merely about eliminating bias, reducing variance, or simply achieve a bias-variance trade-off. Instead, it involves a quantitative joint optimization of bias and variance. Then, we develop a systematic fine-grained dynamic learning framework to jointly optimize bias and variance, which adaptively selects an appropriate estimator for each user-item pair according to the predefined objective function. With this operation, the generalization bounds and variances of models are reduced and bounded with theoretical guarantees. Extensive experiments are conducted to verify the theoretical results and the effectiveness of the proposed dynamic learning framework.
[46] arXiv:2405.15443 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Fairness-Accuracy Trade-Offs: A Causal Perspective

Drago Plecko, Elias Bareinboim

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Systems based on machine learning may exhibit discriminatory behavior based on sensitive characteristics such as gender, sex, religion, or race. In light of this, various notions of fairness and methods to quantify discrimination were proposed, leading to the development of numerous approaches for constructing fair predictors. At the same time, imposing fairness constraints may decrease the utility of the decision-maker, highlighting a tension between fairness and utility. This tension is also recognized in legal frameworks, for instance in the disparate impact doctrine of Title VII of the Civil Rights Act of 1964 -- in which specific attention is given to considerations of business necessity -- possibly allowing the usage of proxy variables associated with the sensitive attribute in case a high-enough utility cannot be achieved without them. In this work, we analyze the tension between fairness and accuracy from a causal lens for the first time. We introduce the notion of a path-specific excess loss (PSEL) that captures how much the predictor's loss increases when a causal fairness constraint is enforced. We then show that the total excess loss (TEL), defined as the difference between the loss of predictor fair along all causal pathways vs. an unconstrained predictor, can be decomposed into a sum of more local PSELs. At the same time, enforcing a causal constraint often reduces the disparity between demographic groups. Thus, we introduce a quantity that summarizes the fairness-utility trade-off, called the causal fairness/utility ratio, defined as the ratio of the reduction in discrimination vs. the excess loss from constraining a causal pathway. This quantity is suitable for comparing the fairness-utility trade-off across causal pathways. Finally, as our approach requires causally-constrained fair predictors, we introduce a new neural approach for causally-constrained fair learning.
[47] arXiv:2405.15446 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Mind the Gap: A Causal Perspective on Bias Amplification in Prediction & Decision-Making

Drago Plecko, Elias Bareinboim

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Investigating fairness and equity of automated systems has become a critical field of inquiry. Most of the literature in fair machine learning focuses on defining and achieving fairness criteria in the context of prediction, while not explicitly focusing on how these predictions may be used later on in the pipeline. For instance, if commonly used criteria, such as independence or sufficiency, are satisfied for a prediction score $S$ used for binary classification, they need not be satisfied after an application of a simple thresholding operation on $S$ (as commonly used in practice). In this paper, we take an important step to address this issue in numerous statistical and causal notions of fairness. We introduce the notion of a margin complement, which measures how much a prediction score $S$ changes due to a thresholding operation. We then demonstrate that the marginal difference in the optimal 0/1 predictor $\widehat Y$ between groups, written $P(\hat y \mid x_1) - P(\hat y \mid x_0)$, can be causally decomposed into the influences of $X$ on the $L_2$-optimal prediction score $S$ and the influences of $X$ on the margin complement $M$, along different causal pathways (direct, indirect, spurious). We then show that under suitable causal assumptions, the influences of $X$ on the prediction score $S$ are equal to the influences of $X$ on the true outcome $Y$. This yields a new decomposition of the disparity in the predictor $\widehat Y$ that allows us to disentangle causal differences inherited from the true outcome $Y$ that exists in the real world vs. those coming from the optimization procedure itself. This observation highlights the need for more regulatory oversight due to the potential for bias amplification, and to address this issue we introduce new notions of weak and strong business necessity, together with an algorithm for assessing whether these notions are satisfied.
[48] arXiv:2405.15505 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Revisiting Counterfactual Regression through the Lens of Gromov-Wasserstein Information Bottleneck

Hao Yang, Zexu Sun, Hongteng Xu, Xu Chen

Comments: 19 pages

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

As a promising individualized treatment effect (ITE) estimation method, counterfactual regression (CFR) maps individuals' covariates to a latent space and predicts their counterfactual outcomes. However, the selection bias between control and treatment groups often imbalances the two groups' latent distributions and negatively impacts this method's performance. In this study, we revisit counterfactual regression through the lens of information bottleneck and propose a novel learning paradigm called Gromov-Wasserstein information bottleneck (GWIB). In this paradigm, we learn CFR by maximizing the mutual information between covariates' latent representations and outcomes while penalizing the kernelized mutual information between the latent representations and the covariates. We demonstrate that the upper bound of the penalty term can be implemented as a new regularizer consisting of $i)$ the fused Gromov-Wasserstein distance between the latent representations of different groups and $ii)$ the gap between the transport cost generated by the model and the cross-group Gromov-Wasserstein distance between the latent representations and the covariates. GWIB effectively learns the CFR model through alternating optimization, suppressing selection bias while avoiding trivial latent distributions. Experiments on ITE estimation tasks show that GWIB consistently outperforms state-of-the-art CFR methods. To promote the research community, we release our project at this https URL.
[49] arXiv:2405.15545 (cross-list from math.OC) [pdf, ps, html, other]: Title: Freya PAGE: First Optimal Time Complexity for Large-Scale Nonconvex Finite-Sum Optimization with Heterogeneous Asynchronous Computations

Alexander Tyurin, Kaja Gruntkowska, Peter Richtárik

Comments: 43 pages, 2 figures

Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)

In practical distributed systems, workers are typically not homogeneous, and due to differences in hardware configurations and network conditions, can have highly varying processing times. We consider smooth nonconvex finite-sum (empirical risk minimization) problems in this setup and introduce a new parallel method, Freya PAGE, designed to handle arbitrarily heterogeneous and asynchronous computations. By being robust to "stragglers" and adaptively ignoring slow computations, Freya PAGE offers significantly improved time complexity guarantees compared to all previous methods, including Asynchronous SGD, Rennala SGD, SPIDER, and PAGE, while requiring weaker assumptions. The algorithm relies on novel generic stochastic gradient collection strategies with theoretical guarantees that can be of interest on their own, and may be used in the design of future optimization methods. Furthermore, we establish a lower bound for smooth nonconvex finite-sum problems in the asynchronous setup, providing a fundamental time complexity limit. This lower bound is tight and demonstrates the optimality of Freya PAGE in the large-scale regime, i.e., when $\sqrt{m} \geq n$, where $n$ is # of workers, and $m$ is # of data samples.
[50] arXiv:2405.15599 (cross-list from cs.LG) [pdf, ps, html, other]: Title: On the Computational Landscape of Replicable Learning

Alkis Kalavasis, Amin Karbasi, Grigoris Velegkas, Felix Zhou

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We study computational aspects of algorithmic replicability, a notion of stability introduced by Impagliazzo, Lei, Pitassi, and Sorrell [2022]. Motivated by a recent line of work that established strong statistical connections between replicability and other notions of learnability such as online learning, private learning, and SQ learning, we aim to understand better the computational connections between replicability and these learning paradigms. Our first result shows that there is a concept class that is efficiently replicably PAC learnable, but, under standard cryptographic assumptions, no efficient online learner exists for this class. Subsequently, we design an efficient replicable learner for PAC learning parities when the marginal distribution is far from uniform, making progress on a question posed by Impagliazzo et al. [2022]. To obtain this result, we design a replicable lifting framework inspired by Blanc, Lange, Malik, and Tan [2023] that transforms in a black-box manner efficient replicable PAC learners under the uniform marginal distribution over the Boolean hypercube to replicable PAC learners under any marginal distribution, with sample and time complexity that depends on a certain measure of the complexity of the distribution. Finally, we show that any pure DP learner can be transformed to a replicable one in time polynomial in the accuracy, confidence parameters and exponential in the representation dimension of the underlying hypothesis class.
[51] arXiv:2405.15673 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Consistency of Neural Causal Partial Identification

Jiyuan Tan, Jose Blanchet, Vasilis Syrgkanis

Comments: 37 pages, 8 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Recent progress in Neural Causal Models (NCMs) showcased how identification and partial identification of causal effects can be automatically carried out via training of neural generative models that respect the constraints encoded in a given causal graph [Xia et al. 2022, Balazadeh et al. 2022]. However, formal consistency of these methods has only been proven for the case of discrete variables or only for linear causal models. In this work, we prove consistency of partial identification via NCMs in a general setting with both continuous and categorical variables. Further, our results highlight the impact of the design of the underlying neural network architecture in terms of depth and connectivity as well as the importance of applying Lipschitz regularization in the training phase. In particular, we provide a counterexample showing that without Lipschitz regularization the NCM may not be asymptotically consistent. Our results are enabled by new results on the approximability of structural causal models via neural generative models, together with an analysis of the sample complexity of the resulting architectures and how that translates into an error in the constrained optimization problem that defines the partial identification bounds.
[52] arXiv:2405.15682 (cross-list from cs.LG) [pdf, ps, html, other]: Title: The Road Less Scheduled

Aaron Defazio, Xingyu (Alice)Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)

Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available (this https URL).
[53] arXiv:2405.15709 (cross-list from cs.LG) [pdf, ps, other]: Title: Information-theoretic Generalization Analysis for Expected Calibration Error

Futoshi Futami, Masahiro Fujisawa

Comments: 34 pages, 3 figures

Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)

While the expected calibration error (ECE), which employs binning, is widely adopted to evaluate the calibration performance of machine learning models, theoretical understanding of its estimation bias is limited. In this paper, we present the first comprehensive analysis of the estimation bias in the two common binning strategies, uniform mass and uniform width binning. Our analysis establishes upper bounds on the bias, achieving an improved convergence rate. Moreover, our bounds reveal, for the first time, the optimal number of bins to minimize the estimation bias. We further extend our bias analysis to generalization error analysis based on the information-theoretic approach, deriving upper bounds that enable the numerical evaluation of how small the ECE is for unknown data. Experiments using deep learning models show that our bounds are nonvacuous thanks to this information-theoretic generalization analysis approach.
[54] arXiv:2405.15719 (cross-list from cs.CV) [pdf, ps, other]: Title: Hierarchical Uncertainty Exploration via Feedforward Posterior Trees

Elias Nehme, Rotem Mulayoff, Tomer Michaeli

Comments: 32 pages, 21 figures

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)

When solving ill-posed inverse problems, one often desires to explore the space of potential solutions rather than be presented with a single plausible reconstruction. Valuable insights into these feasible solutions and their associated probabilities are embedded in the posterior distribution. However, when confronted with data of high dimensionality (such as images), visualizing this distribution becomes a formidable challenge, necessitating the application of effective summarization techniques before user examination. In this work, we introduce a new approach for visualizing posteriors across multiple levels of granularity using tree-valued predictions. Our method predicts a tree-valued hierarchical summarization of the posterior distribution for any input measurement, in a single forward pass of a neural network. We showcase the efficacy of our approach across diverse datasets and image restoration challenges, highlighting its prowess in uncertainty quantification and visualization. Our findings reveal that our method performs comparably to a baseline that hierarchically clusters samples from a diffusion-based posterior sampler, yet achieves this with orders of magnitude greater speed.
[55] arXiv:2405.15767 (cross-list from cs.LG) [pdf, ps, html, other]: Title: Improved Particle Approximation Error for Mean Field Neural Networks

Atsushi Nitanda

Comments: 16 pages

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Mean-field Langevin dynamics (MFLD) minimizes an entropy-regularized nonlinear convex functional defined over the space of probability distributions. MFLD has gained attention due to its connection with noisy gradient descent for mean-field two-layer neural networks. Unlike standard Langevin dynamics, the nonlinearity of the objective functional induces particle interactions, necessitating multiple particles to approximate the dynamics in a finite-particle setting. Recent works (Chen et al., 2022; Suzuki et al., 2023b) have demonstrated the uniform-in-time propagation of chaos for MFLD, showing that the gap between the particle system and its mean-field limit uniformly shrinks over time as the number of particles increases. In this work, we improve the dependence on logarithmic Sobolev inequality (LSI) constants in their particle approximation errors, which can exponentially deteriorate with the regularization coefficient. Specifically, we establish an LSI-constant-free particle approximation error concerning the objective gap by leveraging the problem structure in risk minimization. As the application, we demonstrate improved convergence of MFLD, sampling guarantee for the mean-field stationary distribution, and uniform-in-time Wasserstein propagation of chaos in terms of particle complexity.

[56] arXiv:2010.13018 (replaced) [pdf, ps, html, other]: Title: Adversarial Robust Low Rank Matrix Estimation: Compressed Sensing and Matrix Completion

Takeyuki Sasai, Hironori Fujisawa

Comments: The lasso part of this manuscript with contaminated input as well as output is investigated in arXiv:2208.11592. This manuscript will not be submitted for publications

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

We consider robust low rank matrix estimation as a trace regression when outputs are contaminated by adversaries. The adversaries are allowed to add arbitrary values to arbitrary outputs. Such values can depend on any samples. We deal with matrix compressed sensing, including lasso as a partial problem, and matrix completion, and then we obtain sharp estimation error bounds. To obtain the error bounds for different models such as matrix compressed sensing and matrix completion, we propose a simple unified approach based on a combination of the Huber loss function and the nuclear norm penalization, which is a different approach from the conventional ones. Some error bounds obtained in the present paper are sharper than the past ones.
[57] arXiv:2203.15756 (replaced) [pdf, ps, other]: Title: Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data

Siyuan Guo, Viktor Tóth, Bernhard Schölkopf, Ferenc Huszár

Comments: camera-ready NeurIPS 2023

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)

Constraint-based causal discovery methods leverage conditional independence tests to infer causal relationships in a wide variety of applications. Just as the majority of machine learning methods, existing work focuses on studying $\textit{independent and identically distributed}$ data. However, it is known that even with infinite i.i.d.$\ $ data, constraint-based methods can only identify causal structures up to broad Markov equivalence classes, posing a fundamental limitation for causal discovery. In this work, we observe that exchangeable data contains richer conditional independence structure than i.i.d.$\ $ data, and show how the richer structure can be leveraged for causal discovery. We first present causal de Finetti theorems, which state that exchangeable distributions with certain non-trivial conditional independences can always be represented as $\textit{independent causal mechanism (ICM)}$ generative processes. We then present our main identifiability theorem, which shows that given data from an ICM generative process, its unique causal structure can be identified through performing conditional independence tests. We finally develop a causal discovery algorithm and demonstrate its applicability to inferring causal relationships from multi-environment data. Our code and models are publicly available at: this https URL
[58] arXiv:2302.02033 (replaced) [pdf, ps, html, other]: Title: An Asymptotically Optimal Algorithm for the Convex Hull Membership Problem

Gang Qiao, Ambuj Tewari

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We study the convex hull membership (CHM) problem in the pure exploration setting where one aims to efficiently and accurately determine if a given point lies in the convex hull of means of a finite set of distributions. We give a complete characterization of the sample complexity of the CHM problem in the one-dimensional case. We present the first asymptotically optimal algorithm called Thompson-CHM, whose modular design consists of a stopping rule and a sampling rule. In addition, we extend the algorithm to settings that generalize several important problems in the multi-armed bandit literature. Furthermore, we discuss the extension of Thompson-CHM to higher dimensions. Finally, we provide numerical experiments to demonstrate the empirical behavior of the algorithm matches our theoretical results for realistic time horizons.
[59] arXiv:2302.04344 (replaced) [pdf, ps, html, other]: Title: Learning Dynamical Systems by Leveraging Data from Similar Systems

Lei Xin, Lintao Ye, George Chiu, Shreyas Sundaram

Comments: 15 pages,9 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)

We consider the problem of learning the dynamics of a linear system when one has access to data generated by an auxiliary system that shares similar (but not identical) dynamics, in addition to data from the true system. We use a weighted least squares approach, and provide finite sample error bounds of the learned model as a function of the number of samples and various system parameters from the two systems as well as the weight assigned to the auxiliary data. We show that the auxiliary data can help to reduce the intrinsic system identification error due to noise, at the price of adding a portion of error that is due to the differences between the two system models. We further provide a data-dependent bound that is computable when some prior knowledge about the systems, such as upper bounds on noise levels and model difference, is available. This bound can also be used to determine the weight that should be assigned to the auxiliary data during the model training stage.
[60] arXiv:2303.08777 (replaced) [pdf, ps, html, other]: Title: Distribution-free Deviation Bounds and The Role of Domain Knowledge in Learning via Model Selection with Cross-validation Risk Estimation

Diego Marcondes, Cláudia Peixoto

Comments: arXiv admin note: text overlap with arXiv:2109.03866

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Cross-validation techniques for risk estimation and model selection are widely used in statistics and machine learning. However, the understanding of the theoretical properties of learning via model selection with cross-validation risk estimation is quite low in face of its widespread use. In this context, this paper presents learning via model selection with cross-validation risk estimation as a general systematic learning framework within classical statistical learning theory and establishes distribution-free deviation bounds in terms of VC dimension, giving detailed proofs of the results and considering both bounded and unbounded loss functions. In particular, we investigate how the generalization of learning via model selection may be increased by modeling the collection of candidate models. We define the Learning Spaces as a class of candidate models in which the partial order by inclusion reflects the models complexities, and we formalize a manner of defining them based on domain knowledge. We illustrate this modeling in a worst-case scenario of learning a classifier with finite domain and a typical scenario of linear regression. Through theoretical insights and concrete examples, we aim to provide guidance on selecting the family of candidate models based on domain knowledge to increase generalization.
[61] arXiv:2303.16954 (replaced) [pdf, ps, html, other]: Title: Leveraging joint sparsity in hierarchical Bayesian learning

Jan Glaubitz, Anne Gelb

Comments: 30 pages, 15 figures

Journal-ref: SIAM-ASA J Uncertain Quantif 12(2), 2024

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)

We present a hierarchical Bayesian learning approach to infer jointly sparse parameter vectors from multiple measurement vectors. Our model uses separate conditionally Gaussian priors for each parameter vector and common gamma-distributed hyper-parameters to enforce joint sparsity. The resulting joint-sparsity-promoting priors are combined with existing Bayesian inference methods to generate a new family of algorithms. Our numerical experiments, which include a multi-coil magnetic resonance imaging application, demonstrate that our new approach consistently outperforms commonly used hierarchical Bayesian methods.
[62] arXiv:2307.10870 (replaced) [pdf, ps, other]: Title: Nonlinear Meta-Learning Can Guarantee Faster Rates

Dimitri Meunier, Zhu Li, Arthur Gretton, Samory Kpotufe

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

Many recent theoretical works on meta-learning aim to achieve guarantees in leveraging similar representational structures from related tasks towards simplifying a target task. Importantly, the main aim in theory works on the subject is to understand the extent to which convergence rates -- in learning a common representation -- may scale with the number $N$ of tasks (as well as the number of samples per task). First steps in this setting demonstrate this property when both the shared representation amongst tasks, and task-specific regression functions, are linear. This linear setting readily reveals the benefits of aggregating tasks, e.g., via averaging arguments. In practice, however, the representation is often highly nonlinear, introducing nontrivial biases in each task that cannot easily be averaged out as in the linear case. In the present work, we derive theoretical guarantees for meta-learning with nonlinear representations. In particular, assuming the shared nonlinearity maps to an infinite-dimensional RKHS, we show that additional biases can be mitigated with careful regularization that leverages the smoothness of task-specific regression functions,
[63] arXiv:2307.11018 (replaced) [pdf, ps, html, other]: Title: Amortized Variational Inference: When and Why?

Charles C. Margossian, David M. Blei

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

In a probabilistic latent variable model, factorized (or mean-field) variational inference (F-VI) fits a separate parametric distribution for each latent variable. Amortized variational inference (A-VI) instead learns a common inference function, which maps each observation to its corresponding latent variable's approximate posterior. Typically, A-VI is used as a step in the training of variational autoencoders, however it stands to reason that A-VI could also be used as a general alternative to F-VI. In this paper we study when and why A-VI can be used for approximate Bayesian inference. We derive conditions on a latent variable model which are necessary, sufficient, and verifiable under which A-VI can attain F-VI's optimal solution, thereby closing the amortization gap. We prove these conditions are uniquely verified by simple hierarchical models, a broad class that encompasses many models in machine learning. We then show, on a broader class of models, how to expand the domain of AVI's inference function to improve its solution, and we provide examples, e.g. hidden Markov models, where the amortization gap cannot be closed.
[64] arXiv:2310.04854 (replaced) [pdf, ps, html, other]: Title: Repelling Random Walks

Isaac Reid, Eli Berger, Krzysztof Choromanski, Adrian Weller

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We present a novel quasi-Monte Carlo mechanism to improve graph-based sampling, coined repelling random walks. By inducing correlations between the trajectories of an interacting ensemble such that their marginal transition probabilities are unmodified, we are able to explore the graph more efficiently, improving the concentration of statistical estimators whilst leaving them unbiased. The mechanism has a trivial drop-in implementation. We showcase the effectiveness of repelling random walks in a range of settings including estimation of graph kernels, the PageRank vector and graphlet concentrations. We provide detailed experimental evaluation and robust theoretical guarantees. To our knowledge, repelling random walks constitute the first rigorously studied quasi-Monte Carlo scheme correlating the directions of walkers on a graph, inviting new research in this exciting nascent domain.
[65] arXiv:2310.04859 (replaced) [pdf, ps, html, other]: Title: General Graph Random Features

Isaac Reid, Krzysztof Choromanski, Eli Berger, Adrian Weller

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We propose a novel random walk-based algorithm for unbiased estimation of arbitrary functions of a weighted adjacency matrix, coined universal graph random features (u-GRFs). This includes many of the most popular examples of kernels defined on the nodes of a graph. Our algorithm enjoys subquadratic time complexity with respect to the number of nodes, overcoming the notoriously prohibitive cubic scaling of exact graph kernel evaluation. It can also be trivially distributed across machines, permitting learning on much larger networks. At the heart of the algorithm is a modulation function which upweights or downweights the contribution from different random walks depending on their lengths. We show that by parameterising it with a neural network we can obtain u-GRFs that give higher-quality kernel estimates or perform efficient, scalable kernel learning. We provide robust theoretical analysis and support our findings with experiments including pointwise estimation of fixed graph kernels, solving non-homogeneous graph ordinary differential equations, node clustering and kernel regression on triangular meshes.
[66] arXiv:2310.07852 (replaced) [pdf, ps, html, other]: Title: On the Computational Complexity of Private High-dimensional Model Selection

Saptarshi Roy, Zehua Wang, Ambuj Tewari

Comments: 27 pages, 2 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)

We consider the problem of model selection in a high-dimensional sparse linear regression model under privacy constraints. We propose a differentially private best subset selection method with strong utility properties by adopting the well-known exponential mechanism for selecting the best model. We propose an efficient Metropolis-Hastings algorithm and establish that it enjoys polynomial mixing time to its stationary distribution. Furthermore, we also establish approximate differential privacy for the estimates of the mixed Metropolis-Hastings chain. Finally, we perform some illustrative experiments that show the strong utility of our algorithm.
[67] arXiv:2311.07259 (replaced) [pdf, ps, html, other]: Title: Towards Bounding Causal Effects under Markov Equivalence

Alexis Bellot

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Predicting the effect of unseen interventions is a fundamental research question across the data sciences. It is well established that in general such questions cannot be answered definitively from observational data. This realization has fuelled a growing literature introducing various identifying assumptions, for example in the form of a causal diagram among relevant variables. In practice, this paradigm is still too rigid for many practical applications as it is generally not possible to confidently delineate the true causal diagram. In this paper, we consider the derivation of bounds on causal effects given only observational data. We propose to take as input a less informative structure known as a Partial Ancestral Graph, which represents a Markov equivalence class of causal diagrams and is learnable from data. In this more ``data-driven'' setting, we provide a systematic algorithm to derive bounds on causal effects that exploit the invariant properties of the equivalence class, and that can be computed analytically. We demonstrate our method with synthetic and real data examples.
[68] arXiv:2402.03527 (replaced) [pdf, ps, html, other]: Title: Consistent Validation for Predictive Methods in Spatial Settings

David R. Burt, Yunyi Shen, Tamara Broderick

Comments: 52 pages, 14 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

Spatial prediction tasks are key to weather forecasting, studying air pollution, and other scientific endeavors. Determining how much to trust predictions made by statistical or physical methods is essential for the credibility of scientific conclusions. Unfortunately, classical approaches for validation fail to handle mismatch between locations available for validation and (test) locations where we want to make predictions. This mismatch is often not an instance of covariate shift (as commonly formalized) because the validation and test locations are fixed (e.g., on a grid or at select points) rather than i.i.d. from two distributions. In the present work, we formalize a check on validation methods: that they become arbitrarily accurate as validation data becomes arbitrarily dense. We show that classical and covariate-shift methods can fail this check. We instead propose a method that builds from existing ideas in the covariate-shift literature, but adapts them to the validation data at hand. We prove that our proposal passes our check. And we demonstrate its advantages empirically on simulated and real data.
[69] arXiv:2402.05639 (replaced) [pdf, ps, html, other]: Title: Nonparametric Instrumental Variable Regression through Stochastic Approximate Gradients

Yuri Fonseca, Caio Peixoto, Yuri Saporito

Comments: 23 pages, 2 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Instrumental variables (IVs) provide a powerful strategy for identifying causal effects in the presence of unobservable confounders. Within the nonparametric setting (NPIV), recent methods have been based on nonlinear generalizations of Two-Stage Least Squares and on minimax formulations derived from moment conditions or duality. In a novel direction, we show how to formulate a functional stochastic gradient descent algorithm to tackle NPIV regression by directly minimizing the populational risk. We provide theoretical support in the form of bounds on the excess risk, and conduct numerical experiments showcasing our method's superior stability and competitive performance relative to current state-of-the-art alternatives. This algorithm enables flexible estimator choices, such as neural networks or kernel based methods, as well as non-quadratic loss functions, which may be suitable for structural equations beyond the setting of continuous outcomes and additive noise. Finally, we demonstrate this flexibility of our framework by presenting how it naturally addresses the important case of binary outcomes, which has received far less attention by recent developments in the NPIV literature.
[70] arXiv:2405.14459 (replaced) [pdf, ps, other]: Title: Semi-Discrete Optimal Transport: Nearly Minimax Estimation With Stochastic Gradient Descent and Adaptive Entropic Regularization

Ferdinand Genans (LPSM (UMR\_8001)), Antoine Godichon-Baggioni (LPSM (UMR\_8001)), François-Xavier Vialard (LIGM), Olivier Wintenberger (LPSM (UMR\_8001))

Subjects: Machine Learning (stat.ML)

Optimal Transport (OT) based distances are powerful tools for machine learning to compare probability measures and manipulate them using OT maps. In this field, a setting of interest is semi-discrete OT, where the source measure $\mu$ is continuous, while the target $\nu$ is discrete. Recent works have shown that the minimax rate for the OT map is $\mathcal{O}(t^{-1/2})$ when using $t$ i.i.d. subsamples from each measure (two-sample setting). An open question is whether a better convergence rate can be achieved when the full information of the discrete measure $\nu$ is known (one-sample setting). In this work, we answer positively to this question by (i) proving an $\mathcal{O}(t^{-1})$ lower bound rate for the OT map, using the similarity between Laguerre cells estimation and density support estimation, and (ii) proposing a Stochastic Gradient Descent (SGD) algorithm with adaptive entropic regularization and averaging acceleration. To nearly achieve the desired fast rate, characteristic of non-regular parametric problems, we design an entropic regularization scheme decreasing with the number of samples. Another key step in our algorithm consists of using a projection step that permits to leverage the local strong convexity of the regularized OT problem. Our convergence analysis integrates online convex optimization and stochastic gradient techniques, complemented by the specificities of the OT semi-dual. Moreover, while being as computationally and memory efficient as vanilla SGD, our algorithm achieves the unusual fast rates of our theory in numerical experiments.
[71] arXiv:2004.05990 (replaced) [pdf, ps, html, other]: Title: Robust estimation with Lasso when outputs are adversarially contaminated

Takeyuki Sasai, Hironori Fujisawa

Comments: The case of contaminated inputs as well as outputs is investigated in arXiv:2208.11592. This manuscript will not be submitted for publications

Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)

We consider robust estimation when outputs are adversarially contaminated. Nguyen and Tran (2012) proposed an extended Lasso for robust parameter estimation and then they showed the convergence rate of the estimation error. Recently, Dalalyan and Thompson (2019) gave some useful inequalities and then they showed a faster convergence rate than Nguyen and Tran (2012). They focused on the fact that the minimization problem of the extended Lasso can become that of the penalized Huber loss function with $L_1$ penalty. The distinguishing point is that the Huber loss function includes an extra tuning parameter, which is different from the conventional method. We give the proof, which is different from Dalalyan and Thompson (2019) and then we give the same convergence rate as Dalalyan and Thompson (2019). The significance of our proof is to use some specific properties of the Huber function. Such techniques have not been used in the past proofs.
[72] arXiv:2102.11120 (replaced) [pdf, ps, html, other]: Title: Adversarial robust weighted Huber regression

Takeyuki Sasai, Hironori Fujisawa

Comments: The case of sparse coefficients is investigated in arXiv:2208.11592. This manuscript will not be submitted for publications

Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

We consider a robust estimation of linear regression coefficients. In this note, we focus on the case where the covariates are sampled from an $L$-subGaussian distribution with unknown covariance, the noises are sampled from a distribution with a bounded absolute moment and both covariates and noises may be contaminated by an adversary. We derive an estimation error bound, which depends on the stable rank and the condition number of the covariance matrix of covariates with a polynomial computational complexity of estimation.
[73] arXiv:2208.11592 (replaced) [pdf, ps, html, other]: Title: Outlier Robust and Sparse Estimation of Linear Regression Coefficients

Takeyuki Sasai, Hironori Fujisawa

Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

We consider outlier-robust and sparse estimation of linear regression coefficients, when the covariates and the noises are contaminated by adversarial outliers and noises are sampled from a heavy-tailed distribution. Our results present sharper error bounds under weaker assumptions than prior studies that share similar interests with this study. Our analysis relies on some sharp concentration inequalities resulting from generic chaining.
[74] arXiv:2305.14120 (replaced) [pdf, ps, html, other]: Title: Learning Relevant Contextual Variables Within Bayesian Optimization

Julien Martinelli, Ayush Bharti, Armi Tiihonen, S.T. John, Louis Filstroff, Sabina J. Sloman, Patrick Rinke, Samuel Kaski

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Contextual Bayesian Optimization (CBO) efficiently optimizes black-box functions with respect to design variables, while simultaneously integrating contextual information regarding the environment, such as experimental conditions. However, the relevance of contextual variables is not necessarily known beforehand. Moreover, contextual variables can sometimes be optimized themselves at an additional cost, a setting overlooked by current CBO algorithms. Cost-sensitive CBO would simply include optimizable contextual variables as part of the design variables based on their cost. Instead, we adaptively select a subset of contextual variables to include in the optimization, based on the trade-off between their relevance and the additional cost incurred by optimizing them compared to leaving them to be determined by the environment. We learn the relevance of contextual variables by sensitivity analysis of the posterior surrogate model while minimizing the cost of optimization by leveraging recent developments on early stopping for BO. We empirically evaluate our proposed Sensitivity-Analysis-Driven Contextual BO (SADCBO) method against alternatives on both synthetic and real-world experiments, together with extensive ablation studies, and demonstrate a consistent improvement across examples.
[75] arXiv:2308.12221 (replaced) [pdf, ps, html, other]: Title: Critical Learning Periods Emerge Even in Deep Linear Networks

Michael Kleinman, Alessandro Achille, Stefano Soatto

Comments: ICLR 2024 (Spotlight)

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)

Critical learning periods are periods early in development where temporary sensory deficits can have a permanent effect on behavior and learned representations. Despite the radical differences between biological and artificial networks, critical learning periods have been empirically observed in both systems. This suggests that critical periods may be fundamental to learning and not an accident of biology. Yet, why exactly critical periods emerge in deep networks is still an open question, and in particular it is unclear whether the critical periods observed in both systems depend on particular architectural or optimization details. To isolate the key underlying factors, we focus on deep linear network models, and show that, surprisingly, such networks also display much of the behavior seen in biology and artificial networks, while being amenable to analytical treatment. We show that critical periods depend on the depth of the model and structure of the data distribution. We also show analytically and in simulations that the learning of features is tied to competition between sources. Finally, we extend our analysis to multi-task learning to show that pre-training on certain tasks can damage the transfer performance on new tasks, and show how this depends on the relationship between tasks and the duration of the pre-training stage. To the best of our knowledge, our work provides the first analytically tractable model that sheds light into why critical learning periods emerge in biological and artificial networks.
[76] arXiv:2309.17371 (replaced) [pdf, ps, html, other]: Title: Adversarial Imitation Learning from Visual Observations using Latent Information

Vittorio Giammarino, James Queeney, Ioannis Ch. Paschalidis

Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)

We focus on the problem of imitation learning from visual observations, where the learning agent has access to videos of experts as its sole learning source. The challenges of this framework include the absence of expert actions and the partial observability of the environment, as the ground-truth states can only be inferred from pixels. To tackle this problem, we first conduct a theoretical analysis of imitation learning in partially observable environments. We establish upper bounds on the suboptimality of the learning agent with respect to the divergence between the expert and the agent latent state-transition distributions. Motivated by this analysis, we introduce an algorithm called Latent Adversarial Imitation from Observations, which combines off-policy adversarial imitation techniques with a learned latent representation of the agent's state from sequences of observations. In experiments on high-dimensional continuous robotic tasks, we show that our model-free approach in latent space matches state-of-the-art performance. Additionally, we show how our method can be used to improve the efficiency of reinforcement learning from pixels by leveraging expert videos. To ensure reproducibility, we provide free access to our code.
[77] arXiv:2310.10107 (replaced) [pdf, ps, html, other]: Title: Posterior Sampling-based Online Learning for Episodic POMDPs

Dengwang Tang, Dongze Ye, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo

Comments: 32 pages, 4 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Machine Learning (stat.ML)

Learning in POMDPs is known to be significantly harder than MDPs. In this paper, we consider the online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs. We show that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes, matching the lower bound, and is polynomial in the other parameters. In a general setting, its regret scales exponentially in the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, when the POMDP is undercomplete and weakly revealing (a common assumption in the recent literature), we establish a polynomial Bayesian regret bound. We finally propose a posterior sampling algorithm for multi-agent POMDPs, and show it too has sublinear regret.
[78] arXiv:2311.17021 (replaced) [pdf, ps, html, other]: Title: Optimal Categorical Instrumental Variables

Thomas Wiemann

Subjects: Econometrics (econ.EM); Machine Learning (stat.ML)

This paper discusses estimation with a categorical instrumental variable in settings with potentially few observations per category. The proposed categorical instrumental variable estimator (CIV) leverages a regularization assumption that implies existence of a latent categorical variable with fixed finite support achieving the same first stage fit as the observed instrument. In asymptotic regimes that allow the number of observations per category to grow at arbitrary small polynomial rate with the sample size, I show that when the cardinality of the support of the optimal instrument is known, CIV is root-n asymptotically normal, achieves the same asymptotic variance as the oracle IV estimator that presumes knowledge of the optimal instrument, and is semiparametrically efficient under homoskedasticity. Under-specifying the number of support points reduces efficiency but maintains asymptotic normality. In an application that leverages judge fixed effects as instruments, CIV compares favorably to commonly used jackknife-based instrumental variable estimators.
[79] arXiv:2312.06478 (replaced) [pdf, ps, html, other]: Title: Prediction De-Correlated Inference: A safe approach for post-prediction inference

Feng Gan, Wanfeng Liang, Changliang Zou

Subjects: Methodology (stat.ME); Machine Learning (stat.ML)

In modern data analysis, it is common to use machine learning methods to predict outcomes on unlabeled datasets and then use these pseudo-outcomes in subsequent statistical inference. Inference in this setting is often called post-prediction inference. We propose a novel assumption-lean framework for statistical inference under post-prediction setting, called Prediction De-Correlated Inference (PDC). Our approach is safe, in the sense that PDC can automatically adapt to any black-box machine-learning model and consistently outperform the supervised counterparts. The PDC framework also offers easy extensibility for accommodating multiple predictive models. Both numerical results and real-world data analysis demonstrate the superiority of PDC over the state-of-the-art methods.
[80] arXiv:2401.17789 (replaced) [pdf, ps, other]: Title: Robustly overfitting latents for flexible neural image compression

Yura Perugachi-Diaz, Arwin Gansekoele, Sandjai Bhulai

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

Neural image compression has made a great deal of progress. State-of-the-art models are based on variational autoencoders and are outperforming classical models. Neural compression models learn to encode an image into a quantized latent representation that can be efficiently sent to the decoder, which decodes the quantized latent into a reconstructed image. While these models have proven successful in practice, they lead to sub-optimal results due to imperfect optimization and limitations in the encoder and decoder capacity. Recent work shows how to use stochastic Gumbel annealing (SGA) to refine the latents of pre-trained neural image compression models. We extend this idea by introducing SGA+, which contains three different methods that build upon SGA. We show how our method improves the overall compression performance in terms of the R-D trade-off, compared to its predecessors. Additionally, we show how refinement of the latents with our best-performing method improves the compression performance on both the Tecnick and CLIC dataset. Our method is deployed for a pre-trained hyperprior and for a more flexible model. Further, we give a detailed analysis of our proposed methods and show that they are less sensitive to hyperparameter choices. Finally, we show how each method can be extended to three- instead of two-class rounding.
[81] arXiv:2402.00522 (replaced) [pdf, ps, other]: Title: Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

Mingze Wang, Weinan E

Comments: 70 pages

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads. These theoretical insights are validated experimentally and offer natural suggestions for alternative architectures.
[82] arXiv:2402.03614 (replaced) [pdf, ps, html, other]: Title: Bayesian Vector AutoRegression with Factorised Granger-Causal Graphs

He Zhao, Vassili Kitsios, Terence J. O'Kane, Edwin V. Bonilla

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We study the problem of automatically discovering Granger causal relations from observational multivariate time-series data.Vector autoregressive (VAR) models have been time-tested for this problem, including Bayesian variants and more recent developments using deep neural networks. Most existing VAR methods for Granger causality use sparsity-inducing penalties/priors or post-hoc thresholds to interpret their coefficients as Granger causal graphs. Instead, we propose a new Bayesian VAR model with a hierarchical factorised prior distribution over binary Granger causal graphs, separately from the VAR coefficients. We develop an efficient algorithm to infer the posterior over binary Granger causal graphs. Comprehensive experiments on synthetic, semi-synthetic, and climate data show that our method is more uncertainty aware, has less hyperparameters, and achieves better performance than competing approaches, especially in low-data regimes where there are less observations.
[83] arXiv:2402.04650 (replaced) [pdf, ps, html, other]: Title: An analysis of the noise schedule for score-based generative models

Stanislas Strasman (SU, LPSM (UMR\_8001)), Antonio Ocello (CMAP), Claire Boyer (LPSM (UMR\_8001), IUF), Sylvain Le Corff (LPSM (UMR\_8001), SU), Vincent Lemaire (LPSM (UMR\_8001))

Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

Score-based generative models (SGMs) aim at estimating a target data distribution by learning score functions using only noise-perturbed samples from the target.Recent literature has focused extensively on assessing the error between the target and estimated distributions, gauging the generative quality through the Kullback-Leibler (KL) divergence and Wasserstein distances. Under mild assumptions on the data distribution, we establish an upper bound for the KL divergence between the target and the estimated distributions, explicitly depending on any time-dependent noise schedule. Under additional regularity assumptions, taking advantage of favorable underlying contraction mechanisms, we provide a tighter error bound in Wasserstein distance compared to state-of-the-art results. In addition to being tractable, this upper bound jointly incorporates properties of the target distribution and SGM hyperparameters that need to be tuned during training.
[84] arXiv:2402.06886 (replaced) [pdf, ps, html, other]: Title: Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF

Han Shen, Zhuoran Yang, Tianyi Chen

Comments: The shorter version accepted in ICML 2024; Added related works

Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Bilevel optimization has been recently applied to many machine learning tasks. However, their applications have been restricted to the supervised learning setting, where static objective functions with benign structures are considered. But bilevel problems such as incentive design, inverse reinforcement learning (RL), and RL from human feedback (RLHF) are often modeled as dynamic objective functions that go beyond the simple static objective structures, which pose significant challenges of using existing bilevel solutions. To tackle this new class of bilevel problems, we introduce the first principled algorithmic framework for solving bilevel RL problems through the lens of penalty formulation. We provide theoretical studies of the problem landscape and its penalty-based (policy) gradient algorithms. We demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov game, RL from human feedback and incentive design.
[85] arXiv:2402.08530 (replaced) [pdf, ps, html, other]: Title: A Distributional Analogue to the Successor Representation

Harley Wiltzer, Jesse Farebrother, Arthur Gretton, Yunhao Tang, André Barreto, Will Dabney, Marc G. Bellemare, Mark Rowland

Comments: Accepted to ICML 2024. First two authors contributed equally

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

This paper contributes a new approach for distributional reinforcement learning which elucidates a clean separation of transition structure and reward in the learning process. Analogous to how the successor representation (SR) describes the expected consequences of behaving according to a given policy, our distributional successor measure (SM) describes the distributional consequences of this behaviour. We formulate the distributional SM as a distribution over distributions and provide theory connecting it with distributional and model-based reinforcement learning. Moreover, we propose an algorithm that learns the distributional SM from data by minimizing a two-level maximum mean discrepancy. Key to our method are a number of algorithmic techniques that are independently valuable for learning generative models of state. As an illustration of the usefulness of the distributional SM, we show that it enables zero-shot risk-sensitive policy evaluation in a way that was not previously possible.
[86] arXiv:2402.09469 (replaced) [pdf, ps, html, other]: Title: Fourier Circuits in Neural Networks: Unlocking the Potential of Large Language Models in Mathematical Reasoning and Modular Arithmetic

Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Tianyi Zhou

Comments: Update Section 5.3; clean up problem setup

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the internal representations harnessed by neural networks and Transformers. Building on recent progress toward comprehending how networks execute distinct target functions, our study embarks on an exploration of the underlying reasons behind networks adopting specific computational strategies. We direct our focus to the complex algebraic learning task of modular addition involving $k$ inputs. Our research presents a thorough analytical characterization of the features learned by stylized one-hidden layer neural networks and one-layer Transformers in addressing this task. A cornerstone of our theoretical framework is the elucidation of how the principle of margin maximization shapes the features adopted by one-hidden layer neural networks. Let $p$ denote the modulus, $D_p$ denote the dataset of modular arithmetic with $k$ inputs and $m$ denote the network width. We demonstrate that a neuron count of $ m \geq 2^{2k-2} \cdot (p-1) $, these networks attain a maximum $ L_{2,k+1} $-margin on the dataset $ D_p $. Furthermore, we establish that each hidden-layer neuron aligns with a specific Fourier spectrum, integral to solving modular addition problems. By correlating our findings with the empirical observations of similar studies, we contribute to a deeper comprehension of the intrinsic computational mechanisms of neural networks. Furthermore, we observe similar computational mechanisms in the attention matrix of the one-layer Transformer. This research stands as a significant stride in unraveling their operation complexities, particularly in the realm of complex algebraic tasks.
[87] arXiv:2402.13380 (replaced) [pdf, ps, html, other]: Title: Toward TransfORmers: Revolutionizing the Solution of Mixed Integer Programs with Transformers

Joshua F. Cooper, Seung Jin Choi, I. Esra Buyuktahtakin

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Combinatorics (math.CO); Optimization and Control (math.OC); Machine Learning (stat.ML)

In this study, we introduce an innovative deep learning framework that employs a transformer model to address the challenges of mixed-integer programs, specifically focusing on the Capacitated Lot Sizing Problem (CLSP). Our approach, to our knowledge, is the first to utilize transformers to predict the binary variables of a mixed-integer programming (MIP) problem. Specifically, our approach harnesses the encoder decoder transformer's ability to process sequential data, making it well-suited for predicting binary variables indicating production setup decisions in each period of the CLSP. This problem is inherently dynamic, and we need to handle sequential decision making under constraints. We present an efficient algorithm in which CLSP solutions are learned through a transformer neural network. The proposed post-processed transformer algorithm surpasses the state-of-the-art solver, CPLEX and Long Short-Term Memory (LSTM) in solution time, optimal gap, and percent infeasibility over 240K benchmark CLSP instances tested. After the ML model is trained, conducting inference on the model, reduces the MIP into a linear program (LP). This transforms the ML-based algorithm, combined with an LP solver, into a polynomial-time approximation algorithm to solve a well-known NP-Hard problem, with almost perfect solution quality.
[88] arXiv:2402.13728 (replaced) [pdf, ps, html, other]: Title: Average gradient outer product as a mechanism for deep neural collapse

Daniel Beaglehole, Peter Súkeník, Marco Mondelli, Mikhail Belkin

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a variety of settings, its emergence is typically explained via data-agnostic approaches, such as the unconstrained features model. In this work, we introduce a data-dependent setting where DNC forms due to feature learning through the average gradient outer product (AGOP). The AGOP is defined with respect to a learned predictor and is equal to the uncentered covariance matrix of its input-output gradients averaged over the training dataset. Deep Recursive Feature Machines are a method that constructs a neural network by iteratively mapping the data with the AGOP and applying an untrained random feature map. We demonstrate theoretically and empirically that DNC occurs in Deep Recursive Feature Machines as a consequence of the projection with the AGOP matrix computed at each layer. We then provide evidence that this mechanism holds for neural networks more generally. We show that the right singular vectors and values of the weights can be responsible for the majority of within-class variability collapse for DNNs trained in the feature learning regime. As observed in recent work, this singular structure is highly correlated with that of the AGOP.
[89] arXiv:2404.04057 (replaced) [pdf, ps, html, other]: Title: Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation

Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, Hai Huang

Comments: ICML 2024, PyTorch implementation: this https URL

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

We introduce Score identity Distillation (SiD), an innovative data-free method that distills the generative capabilities of pretrained diffusion models into a single-step generator. SiD not only facilitates an exponentially fast reduction in Fréchet inception distance (FID) during distillation but also approaches or even exceeds the FID performance of the original teacher diffusion models. By reformulating forward diffusion processes as semi-implicit distributions, we leverage three score-related identities to create an innovative loss mechanism. This mechanism achieves rapid FID reduction by training the generator using its own synthesized images, eliminating the need for real data or reverse-diffusion-based generation, all accomplished within significantly shortened generation time. Upon evaluation across four benchmark datasets, the SiD algorithm demonstrates high iteration efficiency during distillation and surpasses competing distillation approaches, whether they are one-step or few-step, data-free, or dependent on training data, in terms of generation quality. This achievement not only redefines the benchmarks for efficiency and effectiveness in diffusion distillation but also in the broader field of diffusion-based generation. The PyTorch implementation is available at this https URL
[90] arXiv:2404.14202 (replaced) [pdf, ps, other]: Title: An Adaptive Approach for Infinitely Many-armed Bandits under Generalized Rotting Constraints

Jung-hun Kim, Milan Vojnovic, Se-Young Yun

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In this study, we consider the infinitely many-armed bandit problems in a rested rotting setting, where the mean reward of an arm may decrease with each pull, while otherwise, it remains unchanged. We explore two scenarios regarding the rotting of rewards: one in which the cumulative amount of rotting is bounded by $V_T$, referred to as the slow-rotting case, and the other in which the cumulative number of rotting instances is bounded by $S_T$, referred to as the abrupt-rotting case. To address the challenge posed by rotting rewards, we introduce an algorithm that utilizes UCB with an adaptive sliding window, designed to manage the bias and variance trade-off arising due to rotting rewards. Our proposed algorithm achieves tight regret bounds for both slow and abrupt rotting scenarios. Lastly, we demonstrate the performance of our algorithm using numerical experiments.
[91] arXiv:2404.16127 (replaced) [pdf, ps, other]: Title: Comparison of static and dynamic random forests models for EHR data in the presence of competing risks: predicting central line-associated bloodstream infection

Elena Albu, Shan Gao, Pieter Stijnen, Frank Rademakers, Christel Janssens, Veerle Cossey, Yves Debaveye, Laure Wynants, Ben Van Calster

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Prognostic outcomes related to hospital admissions typically do not suffer from censoring, and can be modeled either categorically or as time-to-event. Competing events are common but often ignored. We compared the performance of random forest (RF) models to predict the risk of central line-associated bloodstream infections (CLABSI) using different outcome operationalizations. We included data from 27478 admissions to the University Hospitals Leuven, covering 30862 catheter episodes (970 CLABSI, 1466 deaths and 28426 discharges) to build static and dynamic RF models for binary (CLABSI vs no CLABSI), multinomial (CLABSI, discharge, death or no event), survival (time to CLABSI) and competing risks (time to CLABSI, discharge or death) outcomes to predict the 7-day CLABSI risk. We evaluated model performance across 100 train/test splits. Performance of binary, multinomial and competing risks models was similar: AUROC was 0.74 for baseline predictions, rose to 0.78 for predictions at day 5 in the catheter episode, and decreased thereafter. Survival models overestimated the risk of CLABSI (E:O ratios between 1.2 and 1.6), and had AUROCs about 0.01 lower than other models. Binary and multinomial models had lowest computation times. Models including multiple outcome events (multinomial and competing risks) display a different internal structure compared to binary and survival models. In the absence of censoring, complex modelling choices do not considerably improve the predictive performance compared to a binary model for CLABSI prediction in our studied settings. Survival models censoring the competing events at their time of occurrence should be avoided.
[92] arXiv:2405.13535 (replaced) [pdf, ps, html, other]: Title: Generalized Laplace Approximation

Yinsong Chen, Samson S. Yu, Zhong Li, Chee Peng Lim

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In recent years, the inconsistency in Bayesian deep learning has garnered increasing attention. Tempered or generalized posterior distributions often offer a direct and effective solution to this issue. However, understanding the underlying causes and evaluating the effectiveness of generalized posteriors remain active areas of research. In this study, we introduce a unified theoretical framework to attribute Bayesian inconsistency to model misspecification and inadequate priors. We interpret the generalization of the posterior with a temperature factor as a correction for misspecified models through adjustments to the joint probability model, and the recalibration of priors by redistributing probability mass on models within the hypothesis space using data samples. Additionally, we highlight a distinctive feature of Laplace approximation, which ensures that the generalized normalizing constant can be treated as invariant, unlike the typical scenario in general Bayesian learning where this constant varies with model parameters post-generalization. Building on this insight, we propose the generalized Laplace approximation, which involves a simple adjustment to the computation of the Hessian matrix of the regularized loss function. This method offers a flexible and scalable framework for obtaining high-quality posterior distributions. We assess the performance and properties of the generalized Laplace approximation on state-of-the-art neural networks and real-world datasets.

Total of 92 entries

Showing up to 2000 entries per page: fewer | more | all

Machine Learning

New submissions for Monday, 27 May 2024 (showing 29 of 29 entries )

Cross submissions for Monday, 27 May 2024 (showing 26 of 26 entries )

Replacement submissions for Monday, 27 May 2024 (showing 37 of 37 entries )