Machine Learning
- [1] arXiv:2405.13073 [pdf, ps, html, other]
-
Title: A graph-structured distance for heterogeneous datasets with meta variablesComments: 25 pages (without references), 5 figures, data and scripts available at this https URLSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Heterogeneous datasets emerge in various machine learning or optimization applications that feature different data sources, various data types and complex relationships between variables. In practice, heterogeneous datasets are often partitioned into smaller well-behaved ones that are easier to process. However, some applications involve expensive-to-generate or limited size datasets, which motivates methods based on the whole dataset. The first main contribution of this work is a modeling graph-structured framework that generalizes state-of-the-art hierarchical, tree-structured, or variable-size frameworks. This framework models domains that involve heterogeneous datasets in which variables may be continuous, integer, or categorical, with some identified as meta if their values determine the inclusion/exclusion or affect the bounds of other so-called decreed variables. Excluded variables are introduced to manage variables that are either included or excluded depending on the given points. The second main contribution is the graph-structured distance that compares extended points with any combination of included and excluded variables: any pair of points can be compared, allowing to work directly in heterogeneous datasets with meta variables. The contributions are illustrated with some regression experiments, in which the performance of a multilayer perceptron with respect to its hyperparameters is modeled with inverse distance weighting and $K$-nearest neighbors models.
- [2] arXiv:2405.13149 [pdf, ps, html, other]
-
Title: Gaussian Measures Conditioned on Nonlinear Observations: Consistency, MAP Estimators, and SimulationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Computation (stat.CO)
The article presents a systematic study of the problem of conditioning a Gaussian random variable $\xi$ on nonlinear observations of the form $F \circ \phi(\xi)$ where $\phi: \mathcal{X} \to \mathbb{R}^N$ is a bounded linear operator and $F$ is nonlinear. Such problems arise in the context of Bayesian inference and recent machine learning-inspired PDE solvers. We give a representer theorem for the conditioned random variable $\xi \mid F\circ \phi(\xi)$, stating that it decomposes as the sum of an infinite-dimensional Gaussian (which is identified analytically) as well as a finite-dimensional non-Gaussian measure. We also introduce a novel notion of the mode of a conditional measure by taking the limit of the natural relaxation of the problem, to which we can apply the existing notion of maximum a posteriori estimators of posterior measures. Finally, we introduce a variant of the Laplace approximation for the efficient simulation of the aforementioned conditioned Gaussian random variables towards uncertainty quantification.
- [3] arXiv:2405.13160 [pdf, ps, html, other]
-
Title: Borrowing Strength in Distributionally Robust Optimization via Hierarchical Dirichlet ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
This paper presents a novel optimization framework to address key challenges presented by modern machine learning applications: High dimensionality, distributional uncertainty, and data heterogeneity. Our approach unifies regularized estimation, distributionally robust optimization (DRO), and hierarchical Bayesian modeling in a single data-driven criterion. By employing a hierarchical Dirichlet process (HDP) prior, the method effectively handles multi-source data, achieving regularization, distributional robustness, and borrowing strength across diverse yet related data-generating processes. We demonstrate the method's advantages by establishing theoretical performance guarantees and tractable Monte Carlo approximations based on Dirichlet process (DP) theory. Numerical experiments validate the framework's efficacy in improving and stabilizing both prediction and parameter estimation accuracy, showcasing its potential for application in complex data environments.
- [4] arXiv:2405.13302 [pdf, ps, html, other]
-
Title: Accelerated Evaluation of Ollivier-Ricci Curvature Lower Bounds: Bridging Theory and ComputationSubjects: Machine Learning (stat.ML); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Optimization and Control (math.OC)
Curvature serves as a potent and descriptive invariant, with its efficacy validated both theoretically and practically within graph theory. We employ a definition of generalized Ricci curvature proposed by Ollivier, which Lin and Yau later adapted to graph theory, known as Ollivier-Ricci curvature (ORC). ORC measures curvature using the Wasserstein distance, thereby integrating geometric concepts with probability theory and optimal transport. Jost and Liu previously discussed the lower bound of ORC by showing the upper bound of the Wasserstein distance. We extend the applicability of these bounds to discrete spaces with metrics on integers, specifically hypergraphs. Compared to prior work on ORC in hypergraphs by Coupette, Dalleiger, and Rieck, which faced computational challenges, our method introduces a simplified approach with linear computational complexity, making it particularly suitable for analyzing large-scale networks. Through extensive simulations and application to synthetic and real-world datasets, we demonstrate the significant improvements our method offers in evaluating ORC.
- [5] arXiv:2405.13456 [pdf, ps, html, other]
-
Title: Deep linear networks for regression are implicitly regularized towards flat minimaComments: 46 pages, 4 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear networks for overdetermined univariate regression. Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one. Indeed, we show a lower bound on the sharpness of minimizers, which grows linearly with depth. We then study the properties of the minimizer found by gradient flow, which is the limit of gradient descent with vanishing learning rate. We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound. The constant depends on the condition number of the data covariance matrix, but not on width or depth. This result is proven both for a small-scale initialization and a residual initialization. Results of independent interest are shown in both cases. For small-scale initialization, we show that the learned weight matrices are approximately rank-one and that their singular vectors align. For residual initialization, convergence of the gradient flow for a Gaussian initialization of the residual network is proven. Numerical experiments illustrate our results and connect them to gradient descent with non-vanishing learning rate.
- [6] arXiv:2405.13481 [pdf, ps, html, other]
-
Title: Locally Private Estimation with Public FeaturesSubjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
We initiate the study of locally differentially private (LDP) learning with public features. We define semi-feature LDP, where some features are publicly available while the remaining ones, along with the label, require protection under local differential privacy. Under semi-feature LDP, we demonstrate that the mini-max convergence rate for non-parametric regression is significantly reduced compared to that of classical LDP. Then we propose HistOfTree, an estimator that fully leverages the information contained in both public and private features. Theoretically, HistOfTree reaches the mini-max optimal convergence rate. Empirically, HistOfTree achieves superior performance on both synthetic and real data. We also explore scenarios where users have the flexibility to select features for protection manually. In such cases, we propose an estimator and a data-driven parameter tuning strategy, leading to analogous theoretical and empirical results.
- [7] arXiv:2405.13587 [pdf, ps, html, other]
-
Title: Exact Gradients for Stochastic Spiking Neural Networks Driven by Rough SignalsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
We introduce a mathematically rigorous framework based on rough path theory to model stochastic spiking neural networks (SSNNs) as stochastic differential equations with event discontinuities (Event SDEs) and driven by càdlàg rough paths. Our formalism is general enough to allow for potential jumps to be present both in the solution trajectories as well as in the driving noise. We then identify a set of sufficient conditions ensuring the existence of pathwise gradients of solution trajectories and event times with respect to the network's parameters and show how these gradients satisfy a recursive relation. Furthermore, we introduce a general-purpose loss function defined by means of a new class of signature kernels indexed on càdlàg rough paths and use it to train SSNNs as generative models. We provide an end-to-end autodifferentiable solver for Event SDEs and make its implementation available as part of the $\texttt{diffrax}$ library. Our framework is, to our knowledge, the first enabling gradient-based training of SSNNs with noise affecting both the spike timing and the network's dynamics.
- [8] arXiv:2405.13731 [pdf, ps, html, other]
-
Title: Control, Transport and Sampling: Towards Better Loss DesignSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
Leveraging connections between diffusion-based sampling, optimal transport, and optimal stochastic control through their shared links to the Schrödinger bridge problem, we propose novel objective functions that can be used to transport $\nu$ to $\mu$, consequently sample from the target $\mu$, via optimally controlled dynamics. We highlight the importance of the pathwise perspective and the role various optimality conditions on the path measure can play for the design of valid training losses, the careful choice of which offer numerical advantages in practical implementation.
- [9] arXiv:2405.13794 [pdf, ps, html, other]
-
Title: Conditioning diffusion models by explicit forward-backward bridgingComments: 24 pages, 12 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
Given an unconditional diffusion model $\pi(x, y)$, using it to perform conditional simulation $\pi(x \mid y)$ is still largely an open question and is typically achieved by learning conditional drifts to the denoising SDE after the fact. In this work, we express conditional simulation as an inference problem on an augmented space corresponding to a partial SDE bridge. This perspective allows us to implement efficient and principled particle Gibbs and pseudo-marginal samplers marginally targeting the conditional distribution $\pi(x \mid y)$. Contrary to existing methodology, our methods do not introduce any additional approximation to the unconditional diffusion model aside from the Monte Carlo error. We showcase the benefits and drawbacks of our approach on a series of synthetic and real data examples.
- [10] arXiv:2405.13846 [pdf, ps, html, other]
-
Title: Regression Trees Know CalculusComments: Comments very welcome!Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Regression trees have emerged as a preeminent tool for solving real-world regression problems due to their ability to deal with nonlinearities, interaction effects and sharp discontinuities. In this article, we rather study regression trees applied to well-behaved, differentiable functions, and determine the relationship between node parameters and the local gradient of the function being approximated. We find a simple estimate of the gradient which can be efficiently computed using quantities exposed by popular tree learning libraries. This allows the tools developed in the context of differentiable algorithms, like neural nets and Gaussian processes, to be deployed to tree-based models. To demonstrate this, we study measures of model sensitivity defined in terms of integrals of gradients and demonstrate how to compute them for regression trees using the proposed gradient estimates. Quantitative and qualitative numerical experiments reveal the capability of gradients estimated by regression trees to improve predictive analysis, solve tasks in uncertainty quantification, and provide interpretation of model behavior.
- [11] arXiv:2405.13899 [pdf, ps, html, other]
-
Title: Symmetric Linear Bandits with Hidden SymmetrySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
High-dimensional linear bandits with low-dimensional structure have received considerable attention in recent studies due to their practical significance. The most common structure in the literature is sparsity. However, it may not be available in practice. Symmetry, where the reward is invariant under certain groups of transformations on the set of arms, is another important inductive bias in the high-dimensional case that covers many standard structures, including sparsity. In this work, we study high-dimensional symmetric linear bandits where the symmetry is hidden from the learner, and the correct symmetry needs to be learned in an online setting. We examine the structure of a collection of hidden symmetry and provide a method based on model selection within the collection of low-dimensional subspaces. Our algorithm achieves a regret bound of $ O(d_0^{1/3} T^{2/3} \log(d))$, where $d$ is the ambient dimension which is potentially very large, and $d_0$ is the dimension of the true low-dimensional subspace such that $d_0 \ll d$. With an extra assumption on well-separated models, we can further improve the regret to $ O(d_0\sqrt{T\log(d)} )$.
- [12] arXiv:2405.13950 [pdf, ps, html, other]
-
Title: Actor-critic algorithms for fiber sampling problemsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We propose an actor-critic algorithm for a family of complex problems arising in algebraic statistics and discrete optimization. The core task is to produce a sample from a finite subset of the non-negative integer lattice defined by a high-dimensional polytope. We translate the problem into a Markov decision process and devise an actor-critic reinforcement learning (RL) algorithm to learn a set of good moves that can be used for sampling. We prove that the actor-critic algorithm converges to an approximately optimal sampling policy.
To tackle complexity issues that typically arise in these sampling problems, and to allow the RL to function at scale, our solution strategy takes three steps: decomposing the starting point of the sample, using RL on each induced subproblem, and reconstructing to obtain a sample in the original polytope. In this setup, the proof of convergence applies to each subproblem in the decomposition.
We test the method in two regimes. In statistical applications, a high-dimensional polytope arises as the support set for the reference distribution in a model/data fit test for a broad family of statistical models for categorical data. We demonstrate how RL can be used for model fit testing problems for data sets for which traditional MCMC samplers converge too slowly due to problem size and sparsity structure. To test the robustness of the algorithm and explore its generalization properties, we apply it to synthetically generated data of various sizes and sparsity levels. - [13] arXiv:2405.13962 [pdf, ps, html, other]
-
Title: Learning heavy-tailed distributions with Wasserstein-proximal-regularized $\alpha$-divergencesComments: 23 pages, 7 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
In this paper, we propose Wasserstein proximals of $\alpha$-divergences as suitable objective functionals for learning heavy-tailed distributions in a stable manner. First, we provide sufficient, and in some cases necessary, relations among data dimension, $\alpha$, and the decay rate of data distributions for the Wasserstein-proximal-regularized divergence to be finite. Finite-sample convergence rates for the estimation in the case of the Wasserstein-1 proximal divergences are then provided under certain tail conditions. Numerical experiments demonstrate stable learning of heavy-tailed distributions -- even those without first or second moment -- without any explicit knowledge of the tail behavior, using suitable generative models such as GANs and flow-based models related to our proposed Wasserstein-proximal-regularized $\alpha$-divergences. Heuristically, $\alpha$-divergences handle the heavy tails and Wasserstein proximals allow non-absolute continuity between distributions and control the velocities of flow-based algorithms as they learn the target distribution deep into the tails.
- [14] arXiv:2405.13997 [pdf, ps, html, other]
-
Title: Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of ExpertsComments: 31 pages, 2 figures. arXiv admin note: text overlap with arXiv:2402.02952Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that sigmoid gating, in fact, enjoys a higher sample efficiency than softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of convergence of the least squares estimator in the over-specified case in which the number of experts fitted is larger than the true value. We show that two gating regimes naturally arise and, in each of them, we formulate identifiability conditions for the expert functions and derive the corresponding convergence rates. In both cases, we find that experts formulated as feed-forward networks with commonly used activation such as $\mathrm{ReLU}$ and $\mathrm{GELU}$ enjoy faster convergence rates under sigmoid gating than softmax gating. Furthermore, given the same choice of experts, we demonstrate that the sigmoid gating function requires a smaller sample size than its softmax counterpart to attain the same error of expert estimation and, therefore, is more sample efficient.
- [15] arXiv:2405.14038 [pdf, ps, html, other]
-
Title: FLIPHAT: Joint Differential Privacy for High Dimensional Sparse Linear BanditsComments: 28 pages, 1 figureSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
High dimensional sparse linear bandits serve as an efficient model for sequential decision-making problems (e.g. personalized medicine), where high dimensional features (e.g. genomic data) on the users are available, but only a small subset of them are relevant. Motivated by data privacy concerns in these applications, we study the joint differentially private high dimensional sparse linear bandits, where both rewards and contexts are considered as private data. First, to quantify the cost of privacy, we derive a lower bound on the regret achievable in this setting. To further address the problem, we design a computationally efficient bandit algorithm, \textbf{F}orgetfu\textbf{L} \textbf{I}terative \textbf{P}rivate \textbf{HA}rd \textbf{T}hresholding (FLIPHAT). Along with doubling of episodes and episodic forgetting, FLIPHAT deploys a variant of Noisy Iterative Hard Thresholding (N-IHT) algorithm as a sparse linear regression oracle to ensure both privacy and regret-optimality. We show that FLIPHAT achieves optimal regret up to logarithmic factors. We analyze the regret by providing a novel refined analysis of the estimation error of N-IHT, which is of parallel interest.
- [16] arXiv:2405.14064 [pdf, ps, html, other]
-
Title: Building a stable classifier with the inflated argmaxSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We propose a new framework for algorithmic stability in the context of multiclass classification. In practice, classification algorithms often operate by first assigning a continuous score (for instance, an estimated probability) to each possible label, then taking the maximizer -- i.e., selecting the class that has the highest score. A drawback of this type of approach is that it is inherently unstable, meaning that it is very sensitive to slight perturbations of the training data, since taking the maximizer is discontinuous. Motivated by this challenge, we propose a pipeline for constructing stable classifiers from data, using bagging (i.e., resampling and averaging) to produce stable continuous scores, and then using a stable relaxation of argmax, which we call the "inflated argmax," to convert these scores to a set of candidate labels. The resulting stability guarantee places no distributional assumptions on the data, does not depend on the number of classes or dimensionality of the covariates, and holds for any base classifier. Using a common benchmark data set, we demonstrate that the inflated argmax provides necessary protection against unstable classifiers, without loss of accuracy.
- [17] arXiv:2405.14131 [pdf, ps, html, other]
-
Title: Statistical Advantages of Perturbing Cosine Router in Sparse Mixture of ExpertsComments: 44 pages, 2 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The cosine router in sparse Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in sparse MoE has been lacking. Considering the least square estimation of the cosine routing sparse MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as $\mathcal{O}(1/\log^{\tau}(n))$ where $\tau > 0$ is some constant and $n$ is the sample size. Surprisingly, these pessimistic non-polynomial convergence rates can be circumvented by the widely used technique in practice to stabilize the cosine router -- simply adding noises to the $\mathbb{L}_{2}$ norms in the cosine router, which we refer to as \textit{perturbed cosine router}. Under the strongly identifiable settings of the expert functions, we prove that the estimation rates for both the experts and model parameters under the perturbed cosine routing sparse MoE are significantly improved to polynomial rates. Finally, we conduct extensive simulation studies in both synthetic and real data settings to empirically validate our theoretical results.
- [18] arXiv:2405.14285 [pdf, ps, html, other]
-
Title: Computing the Bias of Constant-step Stochastic Approximation with Markovian NoiseComments: PreprintSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
We study stochastic approximation algorithms with Markovian noise and constant step-size $\alpha$. We develop a method based on infinitesimal generator comparisons to study the bias of the algorithm, which is the expected difference between $\theta_n$ -- the value at iteration $n$ -- and $\theta^*$ -- the unique equilibrium of the corresponding ODE. We show that, under some smoothness conditions, this bias is of order $O(\alpha)$. Furthermore, we show that the time-averaged bias is equal to $\alpha V + O(\alpha^2)$, where $V$ is a constant characterized by a Lyapunov equation, showing that $\esp{\bar{\theta}_n} \approx \theta^*+V\alpha + O(\alpha^2)$, where $\bar{\theta}_n=(1/n)\sum_{k=1}^n\theta_k$ is the Polyak-Ruppert average. We also show that $\bar{\theta}_n$ converges with high probability around $\theta^*+\alpha V$. We illustrate how to combine this with Richardson-Romberg extrapolation to derive an iterative scheme with a bias of order $O(\alpha^2)$.
- [19] arXiv:2405.14335 [pdf, ps, html, other]
-
Title: Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
This work investigates the offline formulation of the contextual bandit problem, where the goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies. Motivated by critical applications, we move beyond point estimators. Instead, we adopt the principle of pessimism where we construct upper bounds that assess a policy's worst-case performance, enabling us to confidently select and learn improved policies. Precisely, we introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators. These bounds are general enough to cover most existing estimators and pave the way for the development of new ones. In particular, our pursuit of the tightest bound within this class motivates a novel estimator (LS), that logarithmically smooths large importance weights. The bound for LS is provably tighter than all its competitors, and naturally results in improved policy selection and learning strategies. Extensive policy evaluation, selection, and learning experiments highlight the versatility and favorable performance of LS.
- [20] arXiv:2405.14374 [pdf, ps, html, other]
-
Title: State-Constrained Offline Reinforcement LearningSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Traditional offline reinforcement learning methods predominantly operate in a batch-constrained setting. This confines the algorithms to a specific state-action distribution present in the dataset, reducing the effects of distributional shift but restricting the algorithm greatly. In this paper, we alleviate this limitation by introducing a novel framework named \emph{state-constrained} offline reinforcement learning. By exclusively focusing on the dataset's state distribution, our framework significantly enhances learning potential and reduces previous limitations. The proposed setting not only broadens the learning horizon but also improves the ability to combine different trajectories from the dataset effectively, a desirable property inherent in offline reinforcement learning. Our research is underpinned by solid theoretical findings that pave the way for subsequent advancements in this domain. Additionally, we introduce StaCQ, a deep learning algorithm that is both performance-driven on the D4RL benchmark datasets and closely aligned with our theoretical propositions. StaCQ establishes a strong baseline for forthcoming explorations in state-constrained offline reinforcement learning.
- [21] arXiv:2405.14459 [pdf, ps, html, other]
-
Title: Semi-Discrete Optimal Transport: Nearly Minimax Estimation With Stochastic Gradient Descent and Adaptive Entropic RegularizationFerdinand Genans-Boiteux (LPSM (UMR\_8001)), Antoine Godichon-Baggioni (LPSM (UMR\_8001)), François-Xavier Vialard (LIGM), Olivier Wintenberger (LPSM (UMR\_8001))Subjects: Machine Learning (stat.ML)
Optimal Transport (OT) based distances are powerful tools for machine learning to compare probability measures and manipulate them using OT maps. In this field, a setting of interest is semi-discrete OT, where the source measure $\mu$ is continuous, while the target $\nu$ is discrete. Recent works have shown that the minimax rate for the OT map is $\mathcal{O}(t^{-1/2})$ when using $t$ i.i.d. subsamples from each measure (two-sample setting). An open question is whether a better convergence rate can be achieved when the full information of the discrete measure $\nu$ is known (one-sample setting). In this work, we answer positively to this question by (i) proving an $\mathcal{O}(t^{-1})$ lower bound rate for the OT map, using the similarity between Laguerre cells estimation and density support estimation, and (ii) proposing a Stochastic Gradient Descent (SGD) algorithm with adaptive entropic regularization and averaging acceleration. To nearly achieve the desired fast rate, characteristic of non-regular parametric problems, we design an entropic regularization scheme decreasing with the number of samples. Another key step in our algorithm consists of using a projection step that permits to leverage the local strong convexity of the regularized OT problem. Our convergence analysis integrates online convex optimization and stochastic gradient techniques, complemented by the specificities of the OT semi-dual. Moreover, while being as computationally and memory efficient as vanilla SGD, our algorithm achieves the unusual fast rates of our theory in numerical experiments.
- [22] arXiv:2405.14532 [pdf, ps, html, other]
-
Title: Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein ProblemComments: 28 pages, 1 figure. Comments are most welcome!Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
The Procrustes-Wasserstein problem consists in matching two high-dimensional point clouds in an unsupervised setting, and has many applications in natural language processing and computer vision. We consider a planted model with two datasets $X,Y$ that consist of $n$ datapoints in $\mathbb{R}^d$, where $Y$ is a noisy version of $X$, up to an orthogonal transformation and a relabeling of the data points. This setting is related to the graph alignment problem in geometric models. In this work, we focus on the euclidean transport cost between the point clouds as a measure of performance for the alignment. We first establish information-theoretic results, in the high ($d \gg \log n$) and low ($d \ll \log n$) dimensional regimes. We then study computational aspects and propose the Ping-Pong algorithm, alternatively estimating the orthogonal transformation and the relabeling, initialized via a Franke-Wolfe convex relaxation. We give sufficient conditions for the method to retrieve the planted signal after one single step. We provide experimental results to compare the proposed approach with the state-of-the-art method of Grave et al. (2019).
- [23] arXiv:2405.14540 [pdf, ps, html, other]
-
Title: This Too Shall Pass: Removing Stale Observations in Dynamic Bayesian OptimizationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bayesian Optimization (BO) has proven to be very successful at optimizing a static, noisy, costly-to-evaluate black-box function $f : \mathcal{S} \to \mathbb{R}$. However, optimizing a black-box which is also a function of time (i.e., a dynamic function) $f : \mathcal{S} \times \mathcal{T} \to \mathbb{R}$ remains a challenge, since a dynamic Bayesian Optimization (DBO) algorithm has to keep track of the optimum over time. This changes the nature of the optimization problem in at least three aspects: (i) querying an arbitrary point in $\mathcal{S} \times \mathcal{T}$ is impossible, (ii) past observations become less and less relevant for keeping track of the optimum as time goes by and (iii) the DBO algorithm must have a high sampling frequency so it can collect enough relevant observations to keep track of the optimum through time. In this paper, we design a Wasserstein distance-based criterion able to quantify the relevancy of an observation with respect to future predictions. Then, we leverage this criterion to build W-DBO, a DBO algorithm able to remove irrelevant observations from its dataset on the fly, thus maintaining simultaneously a good predictive performance and a high sampling frequency, even in continuous-time optimization tasks with unknown horizon. Numerical experiments establish the superiority of W-DBO, which outperforms state-of-the-art methods by a comfortable margin.
- [24] arXiv:2405.14574 [pdf, ps, html, other]
-
Title: Learning with Fitzpatrick LossesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Fenchel-Young losses are a family of convex loss functions, encompassing the squared, logistic and sparsemax losses, among others. Each Fenchel-Young loss is implicitly associated with a link function, for mapping model outputs to predictions. For instance, the logistic loss is associated with the soft argmax link function. Can we build new loss functions associated with the same link function as Fenchel-Young losses? In this paper, we introduce Fitzpatrick losses, a new family of convex loss functions based on the Fitzpatrick function. A well-known theoretical tool in maximal monotone operator theory, the Fitzpatrick function naturally leads to a refined Fenchel-Young inequality, making Fitzpatrick losses tighter than Fenchel-Young losses, while maintaining the same link function for prediction. As an example, we introduce the Fitzpatrick logistic loss and the Fitzpatrick sparsemax loss, counterparts of the logistic and the sparsemax losses. This yields two new tighter losses associated with the soft argmax and the sparse argmax, two of the most ubiquitous output layers used in machine learning. We study in details the properties of Fitzpatrick losses and in particular, we show that they can be seen as Fenchel-Young losses using a modified, target-dependent generating function. We demonstrate the effectiveness of Fitzpatrick losses for label proportion estimation.
- [25] arXiv:2405.14630 [pdf, ps, html, other]
-
Title: Bounds for the smallest eigenvalue of the NTK for arbitrary spherical data of arbitrary dimensionComments: 47 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bounds on the smallest eigenvalue of the neural tangent kernel (NTK) are a key ingredient in the analysis of neural network optimization and memorization. However, existing results require distributional assumptions on the data and are limited to a high-dimensional setting, where the input dimension $d_0$ scales at least logarithmically in the number of samples $n$. In this work we remove both of these requirements and instead provide bounds in terms of a measure of the collinearity of the data: notably these bounds hold with high probability even when $d_0$ is held constant versus $n$. We prove our results through a novel application of the hemisphere transform.
- [26] arXiv:2405.14778 [pdf, ps, html, other]
-
Title: Optimal Rates for Vector-Valued Spectral Regularization Learning AlgorithmsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We study theoretical properties of a broad class of regularized algorithms with vector-valued output. These spectral algorithms include kernel ridge regression, kernel principal component regression, various implementations of gradient descent and many more. Our contributions are twofold. First, we rigorously confirm the so-called saturation effect for ridge regression with vector-valued output by deriving a novel lower bound on learning rates; this bound is shown to be suboptimal when the smoothness of the regression function exceeds a certain level. Second, we present the upper bound for the finite sample risk general vector-valued spectral algorithms, applicable to both well-specified and misspecified scenarios (where the true regression function lies outside of the hypothesis space) which is minimax optimal in various regimes. All of our results explicitly allow the case of infinite-dimensional output variables, proving consistency of recent practical applications.
- [27] arXiv:2405.14840 [pdf, ps, html, other]
-
Title: Differentiable Annealed Importance Sampling Minimizes The Jensen-Shannon Divergence Between Initial and Target DistributionComments: 22 pages, including 9 pages of main text and 11 pages of appendix, conference paper at ICML 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Differentiable annealed importance sampling (DAIS), proposed by Geffner & Domke (2021) and Zhang et al. (2021), allows optimizing, among others, over the initial distribution of AIS. In this paper, we show that, in the limit of many transitions, DAIS minimizes the symmetrized KL divergence (Jensen-Shannon divergence) between the initial and target distribution. Thus, DAIS can be seen as a form of variational inference (VI) in that its initial distribution is a parametric fit to an intractable target distribution. We empirically evaluate the usefulness of the initial distribution as a variational distribution on synthetic and real-world data, observing that it often provides more accurate uncertainty estimates than standard VI (optimizing the reverse KL divergence), importance weighted VI, and Markovian score climbing (optimizing the forward KL divergence).
- [28] arXiv:2405.14848 [pdf, ps, html, other]
-
Title: Local Causal Discovery for Structural Evidence of Direct DiscriminationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Fairness is a critical objective in policy design and algorithmic decision-making. Identifying the causal pathways of unfairness requires knowledge of the underlying structural causal model, which may be incomplete or unavailable. This limits the practicality of causal fairness analysis in complex or low-knowledge domains. To mitigate this practicality gap, we advocate for developing efficient causal discovery methods for fairness applications. To this end, we introduce local discovery for direct discrimination (LD3): a polynomial-time algorithm that recovers structural evidence of direct discrimination. LD3 performs a linear number of conditional independence tests with respect to variable set size. Moreover, we propose a graphical criterion for identifying the weighted controlled direct effect (CDE), a qualitative measure of direct discrimination. We prove that this criterion is satisfied by the knowledge returned by LD3, increasing the accessibility of the weighted CDE as a causal fairness measure. Taking liver transplant allocation as a case study, we highlight the potential impact of LD3 for modeling fairness in complex decision systems. Results on real-world data demonstrate more plausible causal relations than baselines, which took 197x to 5870x longer to execute.
New submissions for Friday, 24 May 2024 (showing 28 of 28 entries )
- [29] arXiv:2405.13153 (cross-list from math.ST) [pdf, ps, html, other]
-
Title: Max-sliced Wasserstein concentration and uniform ratio bounds of empirical measures on RKHSSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
Optimal transport and the Wasserstein distance $\mathcal{W}_p$ have recently seen a number of applications in the fields of statistics, machine learning, data science, and the physical sciences. These applications are however severely restricted by the curse of dimensionality, meaning that the number of data points needed to estimate these problems accurately increases exponentially in the dimension. To alleviate this problem, a number of variants of $\mathcal{W}_p$ have been introduced. We focus here on one of these variants, namely the max-sliced Wasserstein metric $\overline{\mathcal{W}}_p$. This metric reduces the high-dimensional minimization problem given by $\mathcal{W}_p$ to a maximum of one-dimensional measurements in an effort to overcome the curse of dimensionality. In this note we derive concentration results and upper bounds on the expectation of $\overline{\mathcal{W}}_p$ between the true and empirical measure on unbounded reproducing kernel Hilbert spaces. We show that, under quite generic assumptions, probability measures concentrate uniformly fast in one-dimensional subspaces, at (nearly) parametric rates. Our results rely on an improvement of currently known bounds for $\overline{\mathcal{W}}_p$ in the finite-dimensional case.
- [30] arXiv:2405.13346 (cross-list from math.OC) [pdf, ps, html, other]
-
Title: Convergence of the Deep Galerkin Method for Mean Field Control ProblemsComments: 27 pages, 6 figuresSubjects: Optimization and Control (math.OC); Machine Learning (stat.ML)
We establish the convergence of the deep Galerkin method (DGM), a deep learning-based scheme for solving high-dimensional nonlinear PDEs, for Hamilton-Jacobi-Bellman (HJB) equations that arise from the study of mean field control problems (MFCPs). Based on a recent characterization of the value function of the MFCP as the unique viscosity solution of an HJB equation on the simplex, we establish both an existence and convergence result for the DGM. First, we show that the loss functional of the DGM can be made arbitrarily small given that the value function of the MFCP possesses sufficient regularity. Then, we show that if the loss functional of the DGM converges to zero, the corresponding neural network approximators must converge uniformly to the true value function on the simplex. We also provide numerical experiments demonstrating the DGM's ability to generalize to high-dimensional HJB equations.
- [31] arXiv:2405.13353 (cross-list from stat.ME) [pdf, ps, html, other]
-
Title: Adaptive Bayesian Multivariate Spline Knot Inference with Prior Specifications on Model ComplexitySubjects: Methodology (stat.ME); Machine Learning (stat.ML)
In multivariate spline regression, the number and locations of knots influence the performance and interpretability significantly. However, due to non-differentiability and varying dimensions, there is no desirable frequentist method to make inference on knots. In this article, we propose a fully Bayesian approach for knot inference in multivariate spline regression. The existing Bayesian method often uses BIC to calculate the posterior, but BIC is too liberal and it will heavily overestimate the knot number when the candidate model space is large. We specify a new prior on the knot number to take into account the complexity of the model space and derive an analytic formula in the normal model. In the non-normal cases, we utilize the extended Bayesian information criterion to approximate the posterior density. The samples are simulated in the space with differing dimensions via reversible jump Markov chain Monte Carlo. We apply the proposed method in knot inference and manifold denoising. Experiments demonstrate the splendid capability of the algorithm, especially in function fitting with jumping discontinuity.
- [32] arXiv:2405.13375 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Adaptive Data Analysis for Growing DataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Reuse of data in adaptive workflows poses challenges regarding overfitting and the statistical validity of results. Previous work has demonstrated that interacting with data via differentially private algorithms can mitigate overfitting, achieving worst-case generalization guarantees with asymptotically optimal data requirements. However, such past work assumes data is static and cannot accommodate situations where data grows over time. In this paper we address this gap, presenting the first generalization bounds for adaptive analysis in the dynamic data setting. We allow the analyst to adaptively schedule their queries conditioned on the current size of the data, in addition to previous queries and responses. We also incorporate time-varying empirical accuracy bounds and mechanisms, allowing for tighter guarantees as data accumulates. In a batched query setting, the asymptotic data requirements of our bound grows with the square-root of the number of adaptive queries, matching prior works' improvement over data splitting for the static setting. We instantiate our bound for statistical queries with the clipped Gaussian mechanism, where it empirically outperforms baselines composed from static bounds.
- [33] arXiv:2405.13392 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Local convergence of min-max algorithms to differentiable equilibrium on Riemannian manifoldComments: under reviewSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We study min-max algorithms to solve zero-sum differentiable games on Riemannian manifold. The notions of differentiable Stackelberg equilibrium and differentiable Nash equilibrium in Euclidean space are generalized to Riemannian manifold, through an intrinsic definition which does not depend on the choice of local coordinate chart of manifold. We then provide sufficient conditions for the local convergence of the deterministic simultaneous algorithms $\tau$-GDA and $\tau$-SGA near such equilibrium, using a general methodology based on spectral analysis. These algorithms are extended with stochastic gradients and applied to the training of Wasserstein GAN. The discriminator of GAN is constructed from Lipschitz-continuous functions based on Stiefel manifold. We show numerically how the insights obtained from the local convergence analysis may lead to an improvement of GAN models.
- [34] arXiv:2405.13396 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Why In-Context Learning Transformers are Tabular Data ClassifiersComments: 9 pages main body, 22 pages total. Preprint under reviewSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The recently introduced TabPFN pretrains an In-Context Learning (ICL) transformer on synthetic data to perform tabular data classification. As synthetic data does not share features or labels with real-world data, the underlying mechanism that contributes to the success of this method remains unclear. This study provides an explanation by demonstrating that ICL-transformers acquire the ability to create complex decision boundaries during pretraining. To validate our claim, we develop a novel forest dataset generator which creates datasets that are unrealistic, but have complex decision boundaries. Our experiments confirm the effectiveness of ICL-transformers pretrained on this data. Furthermore, we create TabForestPFN, the ICL-transformer pretrained on both the original TabPFN synthetic dataset generator and our forest dataset generator. By fine-tuning this model, we reach the current state-of-the-art on tabular data classification. Code is available at this https URL.
- [35] arXiv:2405.13535 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Generalized Laplace ApproximationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In recent years, the inconsistency in Bayesian deep learning has garnered increasing attention. Tempered or generalized posterior distributions often offer a direct and effective solution to this issue. However, understanding the underlying causes and evaluating the effectiveness of generalized posteriors remain active areas of research. In this study, we introduce a unified theoretical framework to attribute Bayesian inconsistency to model misspecification and inadequate priors. We interpret the generalization of the posterior with a temperature factor as a correction for misspecified models through adjustments to the joint probability model, and the recalibration of priors by redistributing probability mass on models within the hypothesis space using data samples. Additionally, we highlight a distinctive feature of Laplace approximation, which ensures that the generalized normalizing constant can be treated as invariant, unlike the typical scenario in general Bayesian learning where this constant varies with model parameters post-generalization. Building on this insight, we propose the generalized Laplace approximation, which involves a simple adjustment to the computation of the Hessian matrix of the regularized loss function. This method offers a flexible and scalable framework for obtaining high-quality posterior distributions. We assess the performance and properties of the generalized Laplace approximation on state-of-the-art neural networks and real-world datasets.
- [36] arXiv:2405.13682 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Constructive Universal Approximation Theorems for Deep Joint-Equivariant Networks by Schur's LemmaSubjects: Machine Learning (cs.LG); Representation Theory (math.RT); Machine Learning (stat.ML)
We present a unified constructive universal approximation theorem covering a wide range of learning machines including both shallow and deep neural networks based on the group representation theory. Constructive here means that the distribution of parameters is given in a closed-form expression (called the ridgelet transform). Contrary to the case of shallow models, expressive power analysis of deep models has been conducted in a case-by-case manner. Recently, Sonoda et al. (2023a,b) developed a systematic method to show a constructive approximation theorem from scalar-valued joint-group-invariant feature maps, covering a formal deep network. However, each hidden layer was formalized as an abstract group action, so it was not possible to cover real deep networks defined by composites of nonlinear activation function. In this study, we extend the method for vector-valued joint-group-equivariant feature maps, so to cover such real networks.
- [37] arXiv:2405.13691 (cross-list from physics.flu-dyn) [pdf, ps, html, other]
-
Title: Neural Networks-based Random Vortex Methods for Modelling Incompressible FlowsComments: 16 pages, 5 figuresSubjects: Fluid Dynamics (physics.flu-dyn); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
In this paper we introduce a novel Neural Networks-based approach for approximating solutions to the (2D) incompressible Navier--Stokes equations. Our algorithm uses a Physics-informed Neural Network, that approximates the vorticity based on a loss function that uses a computationally efficient formulation of the Random Vortex dynamics. The neural vorticity estimator is then combined with traditional numerical PDE-solvers for the Poisson equation to compute the velocity field. The main advantage of our method compared to standard Physics-informed Neural Networks is that it strictly enforces physical properties, such as incompressibility or boundary conditions, which might otherwise be hard to guarantee with purely Neural Networks-based approaches.
- [38] arXiv:2405.13712 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Learning Diffusion Priors from Observations by Expectation MaximizationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Diffusion models recently proved to be remarkable priors for Bayesian inverse problems. However, training these models typically requires access to large amounts of clean data, which could prove difficult in some settings. In this work, we present a novel method based on the expectation-maximization algorithm for training diffusion models from incomplete and noisy observations only. Unlike previous works, our method leads to proper diffusion models, which is crucial for downstream tasks. As part of our method, we propose and motivate a new posterior sampling scheme for unconditional diffusion models. We present empirical evidence supporting the effectiveness of our method.
- [39] arXiv:2405.13785 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Efficient Two-Stage Gaussian Process Regression Via Automatic Kernel Search and SubsamplingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
Gaussian Process Regression (GPR) is widely used in statistics and machine learning for prediction tasks requiring uncertainty measures. Its efficacy depends on the appropriate specification of the mean function, covariance kernel function, and associated hyperparameters. Severe misspecifications can lead to inaccurate results and problematic consequences, especially in safety-critical applications. However, a systematic approach to handle these misspecifications is lacking in the literature. In this work, we propose a general framework to address these issues. Firstly, we introduce a flexible two-stage GPR framework that separates mean prediction and uncertainty quantification (UQ) to prevent mean misspecification, which can introduce bias into the model. Secondly, kernel function misspecification is addressed through a novel automatic kernel search algorithm, supported by theoretical analysis, that selects the optimal kernel from a candidate set. Additionally, we propose a subsampling-based warm-start strategy for hyperparameter initialization to improve efficiency and avoid hyperparameter misspecification. With much lower computational cost, our subsampling-based strategy can yield competitive or better performance than training exclusively on the full dataset. Combining all these components, we recommend two GPR methods-exact and scalable-designed to match available computational resources and specific UQ requirements. Extensive evaluation on real-world datasets, including UCI benchmarks and a safety-critical medical case study, demonstrates the robustness and precision of our methods.
- [40] arXiv:2405.13844 (cross-list from stat.ME) [pdf, ps, html, other]
-
Title: Causal Inference with CocyclesSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
Many interventions in causal inference can be represented as transformations. We identify a local symmetry property satisfied by a large class of causal models under such interventions. Where present, this symmetry can be characterized by a type of map called a cocycle, an object that is central to dynamical systems theory. We show that such cocycles exist under general conditions and are sufficient to identify interventional and counterfactual distributions. We use these results to derive cocycle-based estimators for causal estimands and show they achieve semiparametric efficiency under typical conditions. Since (infinitely) many distributions can share the same cocycle, these estimators make causal inference robust to mis-specification by sidestepping superfluous modelling assumptions. We demonstrate both robustness and state-of-the-art performance in several simulations, and apply our method to estimate the effects of 401(k) pension plan eligibility on asset accumulation using a real dataset.
- [41] arXiv:2405.13888 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Marrying Causal Representation Learning with Dynamical Systems for ScienceComments: 21 pages, 8 figures, 6 tablesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Causal representation learning promises to extend causal models to hidden causal variables from raw entangled measurements. However, most progress has focused on proving identifiability results in different settings, and we are not aware of any successful real-world application. At the same time, the field of dynamical systems benefited from deep learning and scaled to countless applications but does not allow parameter identification. In this paper, we draw a clear connection between the two and their key assumptions, allowing us to apply identifiable methods developed in causal representation learning to dynamical systems. At the same time, we can leverage scalable differentiable solvers developed for differential equations to build models that are both identifiable and practical. Overall, we learn explicitly controllable models that isolate the trajectory-specific parameters for further downstream tasks such as out-of-distribution classification or treatment effect estimation. We experiment with a wind simulator with partially known factors of variation. We also apply the resulting model to real-world climate data and successfully answer downstream causal questions in line with existing literature on climate change.
- [42] arXiv:2405.13910 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Learning Latent Space Hierarchical EBM Diffusion ModelsSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
This work studies the learning problem of the energy-based prior model and the multi-layer generator model. The multi-layer generator model, which contains multiple layers of latent variables organized in a top-down hierarchical structure, typically assumes the Gaussian prior model. Such a prior model can be limited in modelling expressivity, which results in a gap between the generator posterior and the prior model, known as the prior hole problem. Recent works have explored learning the energy-based (EBM) prior model as a second-stage, complementary model to bridge the gap. However, the EBM defined on a multi-layer latent space can be highly multi-modal, which makes sampling from such marginal EBM prior challenging in practice, resulting in ineffectively learned EBM. To tackle the challenge, we propose to leverage the diffusion probabilistic scheme to mitigate the burden of EBM sampling and thus facilitate EBM learning. Our extensive experiments demonstrate a superior performance of our diffusion-learned EBM prior on various challenging tasks.
- [43] arXiv:2405.13912 (cross-list from math.ST) [pdf, ps, html, other]
-
Title: Matrix Denoising with Doubly Heteroscedastic Noise: Fundamental Limits and Optimal Spectral MethodsSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
We study the matrix denoising problem of estimating the singular vectors of a rank-$1$ signal corrupted by noise with both column and row correlations. Existing works are either unable to pinpoint the exact asymptotic estimation error or, when they do so, the resulting approaches (e.g., based on whitening or singular value shrinkage) remain vastly suboptimal. On top of this, most of the literature has focused on the special case of estimating the left singular vector of the signal when the noise only possesses row correlation (one-sided heteroscedasticity). In contrast, our work establishes the information-theoretic and algorithmic limits of matrix denoising with doubly heteroscedastic noise. We characterize the exact asymptotic minimum mean square error, and design a novel spectral estimator with rigorous optimality guarantees: under a technical condition, it attains positive correlation with the signals whenever information-theoretically possible and, for one-sided heteroscedasticity, it also achieves the Bayes-optimal error. Numerical experiments demonstrate the significant advantage of our theoretically principled method with the state of the art. The proofs draw connections with statistical physics and approximate message passing, departing drastically from standard random matrix theory techniques.
- [44] arXiv:2405.13922 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Towards Certification of Uncertainty Calibration under Adversarial AttacksComments: 11 pages main paper, appendix includedSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Since neural classifiers are known to be sensitive to adversarial perturbations that alter their accuracy, \textit{certification methods} have been developed to provide provable guarantees on the insensitivity of their predictions to such perturbations. Furthermore, in safety-critical applications, the frequentist interpretation of the confidence of a classifier (also known as model calibration) can be of utmost importance. This property can be measured via the Brier score or the expected calibration error. We show that attacks can significantly harm calibration, and thus propose certified calibration as worst-case bounds on calibration under adversarial perturbations. Specifically, we produce analytic bounds for the Brier score and approximate bounds via the solution of a mixed-integer program on the expected calibration error. Finally, we propose novel calibration attacks and demonstrate how they can improve model calibration through \textit{adversarial calibration training}.
- [45] arXiv:2405.13975 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: There is HOPE to Avoid HiPPOs for Long-memory State Space ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
State-space models (SSMs) that utilize linear, time-invariant (LTI) systems are known for their effectiveness in learning long sequences. However, these models typically face several challenges: (i) they require specifically designed initializations of the system matrices to achieve state-of-the-art performance, (ii) they require training of state matrices on a logarithmic scale with very small learning rates to prevent instabilities, and (iii) they require the model to have exponentially decaying memory in order to ensure an asymptotically stable LTI system. To address these issues, we view SSMs through the lens of Hankel operator theory, which provides us with a unified theory for the initialization and training of SSMs. Building on this theory, we develop a new parameterization scheme, called HOPE, for LTI systems that utilizes Markov parameters within Hankel operators. This approach allows for random initializations of the LTI systems and helps to improve training stability, while also provides the SSMs with non-decaying memory capabilities. Our model efficiently implements these innovations by nonuniformly sampling the transfer functions of LTI systems, and it requires fewer parameters compared to canonical SSMs. When benchmarked against HiPPO-initialized models such as S4 and S4D, an SSM parameterized by Hankel operators demonstrates improved performance on Long-Range Arena (LRA) tasks. Moreover, we use a sequential CIFAR-10 task with padded noise to empirically corroborate our SSM's long memory capacity.
- [46] arXiv:2405.13977 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Removing Bias from Maximum Likelihood Estimation with Model AutophagyComments: 9 Pages, submission for NeurIPS 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose autophagy penalized likelihood estimation (PLE), an unbiased alternative to maximum likelihood estimation (MLE) which is more fair and less susceptible to model autophagy disorder (madness). Model autophagy refers to models trained on their own output; PLE ensures the statistics of these outputs coincide with the data statistics. This enables PLE to be statistically unbiased in certain scenarios where MLE is biased. When biased, MLE unfairly penalizes minority classes in unbalanced datasets and exacerbates the recently discovered issue of self-consuming generative modeling. Theoretical and empirical results show that 1) PLE is more fair to minority classes and 2) PLE is more stable in a self-consumed setting. Furthermore, we provide a scalable and portable implementation of PLE with a hypernetwork framework, allowing existing deep learning architectures to be easily trained with PLE. Finally, we show PLE can bridge the gap between Bayesian and frequentist paradigms in statistics.
- [47] arXiv:2405.13987 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Analysis of Corrected Graph ConvolutionsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Statistics Theory (math.ST); Machine Learning (stat.ML)
Machine learning for node classification on graphs is a prominent area driven by applications such as recommendation systems. State-of-the-art models often use multiple graph convolutions on the data, as empirical evidence suggests they can enhance performance. However, it has been shown empirically and theoretically, that too many graph convolutions can degrade performance significantly, a phenomenon known as oversmoothing. In this paper, we provide a rigorous theoretical analysis, based on the contextual stochastic block model (CSBM), of the performance of vanilla graph convolution from which we remove the principal eigenvector to avoid oversmoothing. We perform a spectral analysis for $k$ rounds of corrected graph convolutions, and we provide results for partial and exact classification. For partial classification, we show that each round of convolution can reduce the misclassification error exponentially up to a saturation level, after which performance does not worsen. For exact classification, we show that the separability threshold can be improved exponentially up to $O({\log{n}}/{\log\log{n}})$ corrected convolutions.
- [48] arXiv:2405.13998 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Bridging Operator Learning and Conditioned Neural Fields: A Unifying PerspectiveComments: 23 pages, 13 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Operator learning is an emerging area of machine learning which aims to learn mappings between infinite dimensional function spaces. Here we uncover a connection between operator learning architectures and conditioned neural fields from computer vision, providing a unified perspective for examining differences between popular operator learning models. We find that many commonly used operator learning models can be viewed as neural fields with conditioning mechanisms restricted to point-wise and/or global information. Motivated by this, we propose the Continuous Vision Transformer (CViT), a novel neural operator architecture that employs a vision transformer encoder and uses cross-attention to modulate a base field constructed with a trainable grid-based positional encoding of query coordinates. Despite its simplicity, CViT achieves state-of-the-art results across challenging benchmarks in climate modeling and fluid dynamics. Our contributions can be viewed as a first step towards adapting advanced computer vision architectures for building more flexible and accurate machine learning models in physical sciences.
- [49] arXiv:2405.14066 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Online Classification with PredictionsComments: 24 pagesSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
We study online classification when the learner has access to predictions about future examples. We design an online learner whose expected regret is never worse than the worst-case regret, gracefully improves with the quality of the predictions, and can be significantly better than the worst-case regret when the predictions of future examples are accurate. As a corollary, we show that if the learner is always guaranteed to observe data where future examples are easily predictable, then online learning can be as easy as transductive online learning. Our results complement recent work in online algorithms with predictions and smoothed online classification, which go beyond a worse-case analysis by using machine-learned predictions and distributional assumptions respectively.
- [50] arXiv:2405.14088 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: High-dimensional Learning with Noisy LabelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
This paper provides theoretical insights into high-dimensional binary classification with class-conditional noisy labels. Specifically, we study the behavior of a linear classifier with a label noisiness aware loss function, when both the dimension of data $p$ and the sample size $n$ are large and comparable. Relying on random matrix theory by supposing a Gaussian mixture data model, the performance of the linear classifier when $p,n\to \infty$ is shown to converge towards a limit, involving scalar statistics of the data. Importantly, our findings show that the low-dimensional intuitions to handle label noise do not hold in high-dimension, in the sense that the optimal classifier in low-dimension dramatically fails in high-dimension. Based on our derivations, we design an optimized method that is shown to be provably more efficient in handling noisy labels in high dimensions. Our theoretical conclusions are further confirmed by experiments on real datasets, where we show that our optimized approach outperforms the considered baselines.
- [51] arXiv:2405.14094 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Attending to Topological Spaces: The Cellular TransformerRubén Ballester, Pablo Hernández-García, Mathilde Papillon, Claudio Battiloro, Nina Miolane, Tolga Birdal, Carles Casacuberta, Sergio Escalera, Mustafa HajijSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT); Machine Learning (stat.ML)
Topological Deep Learning seeks to enhance the predictive performance of neural network models by harnessing topological structures in input data. Topological neural networks operate on spaces such as cell complexes and hypergraphs, that can be seen as generalizations of graphs. In this work, we introduce the Cellular Transformer (CT), a novel architecture that generalizes graph-based transformers to cell complexes. First, we propose a new formulation of the usual self- and cross-attention mechanisms, tailored to leverage incidence relations in cell complexes, e.g., edge-face and node-edge relations. Additionally, we propose a set of topological positional encodings specifically designed for cell complexes. By transforming three graph datasets into cell complex datasets, our experiments reveal that CT not only achieves state-of-the-art performance, but it does so without the need for more complex enhancements such as virtual nodes, in-domain structural encodings, or graph rewiring.
- [52] arXiv:2405.14392 (cross-list from stat.ME) [pdf, ps, html, other]
-
Title: Markovian Flow Matching: Accelerating MCMC with Continuous Normalizing FlowsSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Continuous normalizing flows (CNFs) learn the probability path between a reference and a target density by modeling the vector field generating said path using neural networks. Recently, Lipman et al. (2022) introduced a simple and inexpensive method for training CNFs in generative modeling, termed flow matching (FM). In this paper, we re-purpose this method for probabilistic inference by incorporating Markovian sampling methods in evaluating the FM objective and using the learned probability path to improve Monte Carlo sampling. We propose a sequential method, which uses samples from a Markov chain to fix the probability path defining the FM objective. We augment this scheme with an adaptive tempering mechanism that allows the discovery of multiple modes in the target. Under mild assumptions, we establish convergence to a local optimum of the FM objective, discuss improvements in the convergence rate, and illustrate our methods on synthetic and real-world examples.
- [53] arXiv:2405.14402 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Exact Gauss-Newton Optimization for Training Deep Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We present EGN, a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges to an $\epsilon$-stationary point at a linear rate. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, and SGN optimizers across various supervised and reinforcement learning tasks.
- [54] arXiv:2405.14425 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: When predict can also explain: few-shot prediction to select better neural latentsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Latent variable models serve as powerful tools to infer underlying dynamics from observed neural activity. However, due to the absence of ground truth data, prediction benchmarks are often employed as proxies. In this study, we reveal the limitations of the widely-used 'co-smoothing' prediction framework and propose an improved few-shot prediction approach that encourages more accurate latent dynamics. Utilizing a student-teacher setup with Hidden Markov Models, we demonstrate that the high co-smoothing model space can encompass models with arbitrary extraneous dynamics within their latent representations. To address this, we introduce a secondary metric -- a few-shot version of co-smoothing. This involves performing regression from the latent variables to held-out channels in the data using fewer trials. Our results indicate that among models with near-optimal co-smoothing, those with extraneous dynamics underperform in the few-shot co-smoothing compared to 'minimal' models devoid of such dynamics. We also provide analytical insights into the origin of this phenomenon. We further validate our findings on real neural data using two state-of-the-art methods: LFADS and STNDT. In the absence of ground truth, we suggest a proxy measure to quantify extraneous dynamics. By cross-decoding the latent variables of all model pairs with high co-smoothing, we identify models with minimal extraneous dynamics. We find a correlation between few-shot co-smoothing performance and this new measure. In summary, we present a novel prediction metric designed to yield latent variables that more accurately reflect the ground truth, offering a significant improvement for latent dynamics inference.
- [55] arXiv:2405.14440 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Bayesian Adaptive Calibration and Optimal DesignComments: Preprint, currently under reviewSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The process of calibrating computer models of natural phenomena is essential for applications in the physical sciences, where plenty of domain knowledge can be embedded into simulations and then calibrated against real observations. Current machine learning approaches, however, mostly rely on rerunning simulations over a fixed set of designs available in the observed data, potentially neglecting informative correlations across the design space and requiring a large amount of simulations. Instead, we consider the calibration process from the perspective of Bayesian adaptive experimental design and propose a data-efficient algorithm to run maximally informative simulations within a batch-sequential process. At each round, the algorithm jointly estimates the parameters of the posterior distribution and optimal designs by maximising a variational lower bound of the expected information gain. The simulator is modelled as a sample from a Gaussian process, which allows us to correlate simulations and observed data with the unknown calibration parameters. We show the benefits of our method when compared to related approaches across synthetic and real-data problems.
- [56] arXiv:2405.14468 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Neural Collapse versus Low-rank Bias: Is Deep Neural Collapse Really Optimal?Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Deep neural networks (DNNs) exhibit a surprising structure in their final layer known as neural collapse (NC), and a growing body of works has currently investigated the propagation of neural collapse to earlier layers of DNNs -- a phenomenon called deep neural collapse (DNC). However, existing theoretical results are restricted to special cases: linear models, only two layers or binary classification. In contrast, we focus on non-linear models of arbitrary depth in multi-class classification and reveal a surprising qualitative shift. As soon as we go beyond two layers or two classes, DNC stops being optimal for the deep unconstrained features model (DUFM) -- the standard theoretical framework for the analysis of collapse. The main culprit is a low-rank bias of multi-layer regularization schemes: this bias leads to optimal solutions of even lower rank than the neural collapse. We support our theoretical findings with experiments on both DUFM and real data, which show the emergence of the low-rank structure in the solution found by gradient descent.
- [57] arXiv:2405.14492 (cross-list from stat.ME) [pdf, ps, html, other]
-
Title: Iterative Methods for Full-Scale Gaussian Process Approximations for Large Spatial DataSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Gaussian processes are flexible probabilistic regression models which are widely used in statistics and machine learning. However, a drawback is their limited scalability to large data sets. To alleviate this, we consider full-scale approximations (FSAs) that combine predictive process methods and covariance tapering, thus approximating both global and local structures. We show how iterative methods can be used to reduce the computational costs for calculating likelihoods, gradients, and predictive distributions with FSAs. We introduce a novel preconditioner and show that it accelerates the conjugate gradient method's convergence speed and mitigates its sensitivity with respect to the FSA parameters and the eigenvalue structure of the original covariance matrix, and we demonstrate empirically that it outperforms a state-of-the-art pivoted Cholesky preconditioner. Further, we present a novel, accurate, and fast way to calculate predictive variances relying on stochastic estimations and iterative methods. In both simulated and real-world data experiments, we find that our proposed methodology achieves the same accuracy as Cholesky-based computations with a substantial reduction in computational time. Finally, we also compare different approaches for determining inducing points in predictive process and FSA models. All methods are implemented in a free C++ software library with high-level Python and R packages.
- [58] arXiv:2405.14508 (cross-list from q-bio.QM) [pdf, ps, html, other]
-
Title: Prediction of cancer dynamics under treatment using Bayesian neural networks: A simulated studyEven Moa Myklebust, Arnoldo Frigessi, Fredrik Schjesvold, Jasmine Foo, Kevin Leder, Alvaro Köhn-LuqueComments: 22 pages, 10 figuresSubjects: Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
Predicting cancer dynamics under treatment is challenging due to high inter-patient heterogeneity, lack of predictive biomarkers, and sparse and noisy longitudinal data. Mathematical models can summarize cancer dynamics by a few interpretable parameters per patient. Machine learning methods can then be trained to predict the model parameters from baseline covariates, but do not account for uncertainty in the parameter estimates. Instead, hierarchical Bayesian modeling can model the relationship between baseline covariates to longitudinal measurements via mechanistic parameters while accounting for uncertainty in every part of the model.
The mapping from baseline covariates to model parameters can be modeled in several ways. A linear mapping simplifies inference but fails to capture nonlinear covariate effects and scale poorly for interaction modeling when the number of covariates is large. In contrast, Bayesian neural networks can potentially discover interactions between covariates automatically, but at a substantial cost in computational complexity.
In this work, we develop a hierarchical Bayesian model of subpopulation dynamics that uses baseline covariate information to predict cancer dynamics under treatment, inspired by cancer dynamics in multiple myeloma (MM), where serum M protein is a well-known proxy of tumor burden. As a working example, we apply the model to a simulated dataset and compare its ability to predict M protein trajectories to a model with linear covariate effects. Our results show that the Bayesian neural network covariate effect model predicts cancer dynamics more accurately than a linear covariate effect model when covariate interactions are present. The framework can also be applied to other types of cancer or other time series prediction problems that can be described with a parametric model. - [59] arXiv:2405.14522 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Explaining Black-box Model Predictions via Two-level Nested Feature Attributions with Consistency PropertySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Techniques that explain the predictions of black-box machine learning models are crucial to make the models transparent, thereby increasing trust in AI systems. The input features to the models often have a nested structure that consists of high- and low-level features, and each high-level feature is decomposed into multiple low-level features. For such inputs, both high-level feature attributions (HiFAs) and low-level feature attributions (LoFAs) are important for better understanding the model's decision. In this paper, we propose a model-agnostic local explanation method that effectively exploits the nested structure of the input to estimate the two-level feature attributions simultaneously. A key idea of the proposed method is to introduce the consistency property that should exist between the HiFAs and LoFAs, thereby bridging the separate optimization problems for estimating them. Thanks to this consistency property, the proposed method can produce HiFAs and LoFAs that are both faithful to the black-box models and consistent with each other, using a smaller number of queries to the models. In experiments on image classification in multiple instance learning and text classification using language models, we demonstrate that the HiFAs and LoFAs estimated by the proposed method are accurate, faithful to the behaviors of the black-box models, and provide consistent explanations.
- [60] arXiv:2405.14544 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Nuclear Norm Regularization for Deep LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Penalizing the nuclear norm of a function's Jacobian encourages it to locally behave like a low-rank linear map. Such functions vary locally along only a handful of directions, making the Jacobian nuclear norm a natural regularizer for machine learning problems. However, this regularizer is intractable for high-dimensional problems, as it requires computing a large Jacobian matrix and taking its singular value decomposition. We show how to efficiently penalize the Jacobian nuclear norm using techniques tailor-made for deep learning. We prove that for functions parametrized as compositions $f = g \circ h$, one may equivalently penalize the average squared Frobenius norm of $Jg$ and $Jh$. We then propose a denoising-style approximation that avoids the Jacobian computations altogether. Our method is simple, efficient, and accurate, enabling Jacobian nuclear norm regularization to scale to high-dimensional deep learning problems. We complement our theory with an empirical study of our regularizer's performance and investigate applications to denoising and representation learning.
- [61] arXiv:2405.14547 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Causal Effect Identification in a Sub-Population with Latent VariablesComments: 19 pages, 5 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The s-ID problem seeks to compute a causal effect in a specific sub-population from the observational data pertaining to the same sub population (Abouei et al., 2023). This problem has been addressed when all the variables in the system are observable. In this paper, we consider an extension of the s-ID problem that allows for the presence of latent variables. To tackle the challenges induced by the presence of latent variables in a sub-population, we first extend the classical relevant graphical definitions, such as c-components and Hedges, initially defined for the so-called ID problem (Pearl, 1995; Tian & Pearl, 2002), to their new counterparts. Subsequently, we propose a sound algorithm for the s-ID problem with latent variables.
- [62] arXiv:2405.14657 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Heteroscedastic Preferential Bayesian Optimization with Informative Noise DistributionsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Preferential Bayesian optimization (PBO) is a sample-efficient framework for learning human preferences between candidate designs. PBO classically relies on homoscedastic noise models to represent human aleatoric uncertainty. Yet, such noise fails to accurately capture the varying levels of human aleatoric uncertainty, particularly when the user possesses partial knowledge among different pairs of candidates. For instance, a chemist with solid expertise in glucose-related molecules may easily compare two compounds from that family while struggling to compare alcohol-related molecules. Currently, PBO overlooks this uncertainty during the search for a new candidate through the maximization of the acquisition function, consequently underestimating the risk associated with human uncertainty. To address this issue, we propose a heteroscedastic noise model to capture human aleatoric uncertainty. This model adaptively assigns noise levels based on the distance of a specific input to a predefined set of reliable inputs known as anchors provided by the human. Anchors encapsulate partial knowledge and offer insight into the comparative difficulty of evaluating different candidate pairs. Such a model can be seamlessly integrated into the acquisition function, thus leading to candidate design pairs that elegantly trade informativeness and ease of comparison for the human expert. We perform an extensive empirical evaluation of the proposed approach, demonstrating a consistent improvement over homoscedastic PBO.
- [63] arXiv:2405.14681 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Recursive PAC-Bayes: A Frequentist Approach to Sequential Prior Updates with No Information LossSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
PAC-Bayesian analysis is a frequentist framework for incorporating prior knowledge into learning. It was inspired by Bayesian learning, which allows sequential data processing and naturally turns posteriors from one processing step into priors for the next. However, despite two and a half decades of research, the ability to update priors sequentially without losing confidence information along the way remained elusive for PAC-Bayes. While PAC-Bayes allows construction of data-informed priors, the final confidence intervals depend only on the number of points that were not used for the construction of the prior, whereas confidence information in the prior, which is related to the number of points used to construct the prior, is lost. This limits the possibility and benefit of sequential prior updates, because the final bounds depend only on the size of the final batch.
We present a novel and, in retrospect, surprisingly simple and powerful PAC-Bayesian procedure that allows sequential prior updates with no information loss. The procedure is based on a novel decomposition of the expected loss of randomized classifiers. The decomposition rewrites the loss of the posterior as an excess loss relative to a downscaled loss of the prior plus the downscaled loss of the prior, which is bounded recursively. As a side result, we also present a generalization of the split-kl and PAC-Bayes-split-kl inequalities to discrete random variables, which we use for bounding the excess losses, and which can be of independent interest. In empirical evaluation the new procedure significantly outperforms state-of-the-art. - [64] arXiv:2405.14711 (cross-list from stat.ME) [pdf, ps, html, other]
-
Title: Zero-inflation in the Multivariate Poisson Lognormal FamilyComments: 27 pages including appendices. 8 figures, 1 tableSubjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)
Analyzing high-dimensional count data is a challenge and statistical model-based approaches provide an adequate and efficient framework that preserves explainability. The (multivariate) Poisson-Log-Normal (PLN) model is one such model: it assumes count data are driven by an underlying structured latent Gaussian variable, so that the dependencies between counts solely stems from the latent dependencies. However PLN doesn't account for zero-inflation, a feature frequently observed in real-world datasets. Here we introduce the Zero-Inflated PLN (ZIPLN) model, adding a multivariate zero-inflated component to the model, as an additional Bernoulli latent variable. The Zero-Inflation can be fixed, site-specific, feature-specific or depends on covariates. We estimate model parameters using variational inference that scales up to datasets with a few thousands variables and compare two approximations: (i) independent Gaussian and Bernoulli variational distributions or (ii) Gaussian variational distribution conditioned on the Bernoulli one. The method is assessed on synthetic data and the efficiency of ZIPLN is established even when zero-inflation concerns up to $90\%$ of the observed counts. We then apply both ZIPLN and PLN to a cow microbiome dataset, containing $90.6\%$ of zeroes. Accounting for zero-inflation significantly increases log-likelihood and reduces dispersion in the latent space, thus leading to improved group discrimination.
- [65] arXiv:2405.14780 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Metric Flow Matching for Smooth Interpolations on the Data ManifoldKacper Kapusniak, Peter Potaptchik, Teodora Reu, Leo Zhang, Alexander Tong, Michael Bronstein, Avishek Joey Bose, Francesco Di GiovanniSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Matching objectives underpin the success of modern generative models and rely on constructing conditional paths that transform a source distribution into a target distribution. Despite being a fundamental building block, conditional paths have been designed principally under the assumption of Euclidean geometry, resulting in straight interpolations. However, this can be particularly restrictive for tasks such as trajectory inference, where straight paths might lie outside the data manifold, thus failing to capture the underlying dynamics giving rise to the observed marginals. In this paper, we propose Metric Flow Matching (MFM), a novel simulation-free framework for conditional flow matching where interpolants are approximate geodesics learned by minimizing the kinetic energy of a data-induced Riemannian metric. This way, the generative model matches vector fields on the data manifold, which corresponds to lower uncertainty and more meaningful interpolations. We prescribe general metrics to instantiate MFM, independent of the task, and test it on a suite of challenging problems including LiDAR navigation, unpaired image translation, and modeling cellular dynamics. We observe that MFM outperforms the Euclidean baselines, particularly achieving SOTA on single-cell trajectory prediction.
- [66] arXiv:2405.14806 (cross-list from physics.data-an) [pdf, ps, html, other]
-
Title: Lorentz-Equivariant Geometric Algebra Transformers for High-Energy PhysicsComments: 10+12 pages, 5+2 figures, 2 tablesSubjects: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Machine Learning (stat.ML)
Extracting scientific understanding from particle-physics experiments requires solving diverse learning problems with high precision and good data efficiency. We propose the Lorentz Geometric Algebra Transformer (L-GATr), a new multi-purpose architecture for high-energy physics. L-GATr represents high-energy data in a geometric algebra over four-dimensional space-time and is equivariant under Lorentz transformations, the symmetry group of relativistic kinematics. At the same time, the architecture is a Transformer, which makes it versatile and scalable to large systems. L-GATr is first demonstrated on regression and classification tasks from particle physics. We then construct the first Lorentz-equivariant generative model: a continuous normalizing flow based on an L-GATr network, trained with Riemannian flow matching. Across our experiments, L-GATr is on par with or outperforms strong domain-specific baselines.
- [67] arXiv:2405.14822 (cross-list from cs.CV) [pdf, ps, html, other]
-
Title: PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion TeacherDongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano ErmonSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyond that of the original teacher DM. Our key insight is that a pre-trained, low-resolution DM can be used to deterministically encode high-resolution data to a structured latent space by solving the PF-ODE forward in time (data-to-noise), starting from an appropriately down-sampled image. Using this frozen encoder in an auto-encoder framework, we train a decoder by progressively growing its resolution. From the nature of progressively growing decoder, PaGoDA avoids re-training teacher/student models when we upsample the student model, making the whole training pipeline much cheaper. In experiments, we used our progressively growing decoder to upsample from the pre-trained model's 64x64 resolution to generate 512x512 samples, achieving 2x faster inference compared to single-step distilled Stable Diffusion like LCM. PaGoDA also achieved state-of-the-art FIDs on ImageNet across all resolutions from 64x64 to 512x512. Additionally, we demonstrated PaGoDA's effectiveness in solving inverse problems and enabling controllable generation.
- [68] arXiv:2405.14861 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Adapting to Unknown Low-Dimensional Structures in Score-Based Diffusion ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
This paper investigates score-based diffusion models when the underlying target distribution is concentrated on or near low-dimensional manifolds within the higher-dimensional space in which they formally reside, a common characteristic of natural image distributions. Despite previous efforts to understand the data generation process of diffusion models, existing theoretical support remains highly suboptimal in the presence of low-dimensional structure, which we strengthen in this paper. For the popular Denoising Diffusion Probabilistic Model (DDPM), we find that the dependency of the error incurred within each denoising step on the ambient dimension $d$ is in general unavoidable. We further identify a unique design of coefficients that yields a converges rate at the order of $O(k^{2}/\sqrt{T})$ (up to log factors), where $k$ is the intrinsic dimension of the target distribution and $T$ is the number of steps. This represents the first theoretical demonstration that the DDPM sampler can adapt to unknown low-dimensional structures in the target distribution, highlighting the critical importance of coefficient design. All of this is achieved by a novel set of analysis tools that characterize the algorithmic dynamics in a more deterministic manner.
Cross submissions for Friday, 24 May 2024 (showing 40 of 40 entries )
- [69] arXiv:2202.00563 (replaced) [pdf, ps, html, other]
-
Title: On the Limitations of General Purpose Domain Generalisation MethodsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We investigate the fundamental performance limitations of learning algorithms in several Domain Generalisation (DG) settings. Motivated by the difficulty with which previously proposed methods have in reliably outperforming Empirical Risk Minimisation (ERM), we derive upper bounds on the excess risk of ERM, and lower bounds on the minimax excess risk. Our findings show that in all the DG settings we consider, it is not possible to significantly outperform ERM. Our conclusions are limited not only to the standard covariate shift setting, but also two other settings with additional restrictions on how domains can differ. The first constrains all domains to have a non-trivial bound on pairwise distances, as measured by a broad class of integral probability metrics. The second alternate setting considers a restricted class of DG problems where all domains have the same underlying support. Our analysis also suggests how different strategies can be used to optimise the performance of ERM in each of these DG setting. We also experimentally explore hypotheses suggested by our theoretical analysis.
- [70] arXiv:2204.07747 (replaced) [pdf, ps, html, other]
-
Title: A Variational Approach to Bayesian Phylogenetic InferenceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo (MCMC) with simple proposal mechanisms. This hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper, we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We propose combining subsplit Bayesian networks, an expressive graphical model for tree topology distributions, and a structured amortization of the branch lengths over tree topologies for a suitable variational family of distributions. We train the variational approximation via stochastic gradient ascent and adopt gradient estimators for continuous and discrete variational parameters separately to deal with the composite latent space of phylogenetic models. We show that our variational approach provides competitive performance to MCMC, while requiring much fewer (though more costly) iterations due to a more efficient exploration mechanism enabled by variational inference. Experiments on a benchmark of challenging real data Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our methods.
- [71] arXiv:2302.12565 (replaced) [pdf, ps, html, other]
-
Title: Variational Linearized Laplace Approximation for Bayesian Deep LearningComments: 22 pages, 8 figures, ICML 2024Journal-ref: PMLR 235 (2024)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The Linearized Laplace Approximation (LLA) has been recently used to perform uncertainty estimation on the predictions of pre-trained deep neural networks (DNNs). However, its widespread application is hindered by significant computational costs, particularly in scenarios with a large number of training points or DNN parameters. Consequently, additional approximations of LLA, such as Kronecker-factored or diagonal approximate GGN matrices, are utilized, potentially compromising the model's performance. To address these challenges, we propose a new method for approximating LLA using a variational sparse Gaussian Process (GP). Our method is based on the dual RKHS formulation of GPs and retains, as the predictive mean, the output of the original DNN. Furthermore, it allows for efficient stochastic optimization, which results in sub-linear training time in the size of the training dataset. Specifically, its training cost is independent of the number of training points. We compare our proposed method against accelerated LLA (ELLA), which relies on the Nyström approximation, as well as other LLA variants employing the sample-then-optimize principle. Experimental results, both on regression and classification datasets, show that our method outperforms these already existing efficient variants of LLA, both in terms of the quality of the predictive distribution and in terms of total computational time.
- [72] arXiv:2305.17028 (replaced) [pdf, ps, html, other]
-
Title: Better Batch for Deep Probabilistic Time Series ForecastingComments: 11 pages, 3 figures, 3 tables, The 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024); We corrected some misleading notations in the published versionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Deep probabilistic time series forecasting has gained attention for its ability to provide nonlinear approximation and valuable uncertainty quantification for decision-making. However, existing models often oversimplify the problem by assuming a time-independent error process and overlooking serial correlation. To overcome this limitation, we propose an innovative training method that incorporates error autocorrelation to enhance probabilistic forecasting accuracy. Our method constructs a mini-batch as a collection of $D$ consecutive time series segments for model training. It explicitly learns a time-varying covariance matrix over each mini-batch, encoding error correlation among adjacent time steps. The learned covariance matrix can be used to improve prediction accuracy and enhance uncertainty quantification. We evaluate our method on two different neural forecasting models and multiple public datasets. Experimental results confirm the effectiveness of the proposed approach in improving the performance of both models across a range of datasets, resulting in notable improvements in predictive accuracy.
- [73] arXiv:2306.00096 (replaced) [pdf, ps, html, other]
-
Title: Learning the Pareto Front Using Bootstrapped Observation SamplesComments: 37 pages including appendixSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We consider Pareto front identification (PFI) for linear bandits (PFILin), i.e., the goal is to identify a set of arms with undominated mean reward vectors when the mean reward vector is a linear function of the context. PFILin includes the best arm identification problem and multi-objective active learning as special cases. The sample complexity of our proposed algorithm is optimal up to a logarithmic factor. In addition, the regret incurred by our algorithm during the estimation is within a logarithmic factor of the optimal regret among all algorithms that identify the Pareto front. Our key contribution is a new estimator that in every round updates the estimate for the unknown parameter along multiple context directions -- in contrast to the conventional estimator that only updates the parameter estimate along the chosen context. This allows us to use low-regret arms to collect information about Pareto optimal arms. Our key innovation is to reuse the exploration samples multiple times; in contrast to conventional estimators that use each sample only once. Numerical experiments demonstrate that the proposed algorithm successfully identifies the Pareto front while controlling the regret.
- [74] arXiv:2306.06844 (replaced) [pdf, ps, html, other]
-
Title: Provably Efficient Bayesian Optimization with Unbiased Gaussian Process Hyperparameter EstimationComments: 25 pages, 5 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Gaussian process (GP) based Bayesian optimization (BO) is a powerful method for optimizing black-box functions efficiently. The practical performance and theoretical guarantees of this approach depend on having the correct GP hyperparameter values, which are usually unknown in advance and need to be estimated from the observed data. However, in practice, these estimations could be incorrect due to biased data sampling strategies used in BO. This can lead to degraded performance and break the sub-linear global convergence guarantee of BO. To address this issue, we propose a new BO method that can sub-linearly converge to the objective function's global optimum even when the true GP hyperparameters are unknown in advance and need to be estimated from the observed data. Our method uses a multi-armed bandit technique (EXP3) to add random data points to the BO process, and employs a novel training loss function for the GP hyperparameter estimation process that ensures consistent estimation. We further provide theoretical analysis of our proposed method. Finally, we demonstrate empirically that our method outperforms existing approaches on various synthetic and real-world problems.
- [75] arXiv:2306.11895 (replaced) [pdf, ps, html, other]
-
Title: Learning Elastic Costs to Shape Monge DisplacementsMichal Klein, Aram-Alexandre Pooladian, Pierre Ablin, Eugène Ndiaye, Jonathan Niles-Weed, Marco CuturiSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Given a source and a target probability measure supported on $\mathbb{R}^d$, the Monge problem asks to find the most efficient way to map one distribution to the other. This efficiency is quantified by defining a \textit{cost} function between source and target data. Such a cost is often set by default in the machine learning literature to the squared-Euclidean distance, $\ell^2_2(\mathbf{x},\mathbf{y})=\tfrac12|\mathbf{x}-\mathbf{y}|_2^2$. Recently, Cuturi et. al '23 highlighted the benefits of using elastic costs, defined through a regularizer $\tau$ as $c(\mathbf{x},\mathbf{y})=\ell^2_2(\mathbf{x},\mathbf{y})+\tau(\mathbf{x}-\mathbf{y})$. Such costs shape the \textit{displacements} of Monge maps $T$, i.e., the difference between a source point and its image $T(\mathbf{x})-\mathbf{x})$, by giving them a structure that matches that of the proximal operator of $\tau$. In this work, we make two important contributions to the study of elastic costs: (i) For any elastic cost, we propose a numerical method to compute Monge maps that are provably optimal. This provides a much-needed routine to create synthetic problems where the ground truth OT map is known, by analogy to the Brenier theorem, which states that the gradient of any convex potential is always a valid Monge map for the $\ell_2^2$ cost; (ii) We propose a loss to \textit{learn} the parameter $\theta$ of a parameterized regularizer $\tau_\theta$, and apply it in the case where $\tau_{A}(\mathbf{z})=|A^\perp \mathbf{z}|^2_2$. This regularizer promotes displacements that lie on a low dimensional subspace of $\mathbb{R}^d$, spanned by the $p$ rows of $A\in\mathbb{R}^{p\times d}$.
- [76] arXiv:2310.18449 (replaced) [pdf, ps, html, other]
-
Title: Conditional Generative Representation for Black-Box Optimization with Implicit ConstraintsSubjects: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Black-box optimization (BBO) has become increasingly relevant for tackling complex decision-making problems, especially in public policy domains such as police districting. However, its broader application in public policymaking is hindered by the complexity of defining feasible regions and the high-dimensionality of decisions. This paper introduces a novel BBO framework, termed as the Conditional And Generative Black-box Optimization (CageBO). This approach leverages a conditional variational autoencoder to learn the distribution of feasible decisions, enabling a two-way mapping between the original decision space and a simplified, constraint-free latent space. The CageBO efficiently handles the implicit constraints often found in public policy applications, allowing for optimization in the latent space while evaluating objectives in the original space. We validate our method through a case study on large-scale police districting problems in Atlanta, Georgia. Our results reveal that our CageBO offers notable improvements in performance and efficiency compared to the baselines.
- [77] arXiv:2402.00168 (replaced) [pdf, ps, html, other]
-
Title: Continuous Treatment Effects with Surrogate OutcomesZhenghao Zeng, David Arbour, Avi Feller, Raghavendra Addanki, Ryan Rossi, Ritwik Sinha, Edward H. KennedyComments: 30 pages, 7 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish the asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance.
- [78] arXiv:2403.03850 (replaced) [pdf, ps, html, other]
-
Title: Conformal prediction for multi-dimensional time series by ellipsoidal setsComments: Accepted by the Forty-first International Conference on Machine Learning (ICML 2024)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Conformal prediction (CP) has been a popular method for uncertainty quantification because it is distribution-free, model-agnostic, and theoretically sound. For forecasting problems in supervised learning, most CP methods focus on building prediction intervals for univariate responses. In this work, we develop a sequential CP method called $\texttt{MultiDimSPCI}$ that builds prediction $\textit{regions}$ for a multivariate response, especially in the context of multivariate time series, which are not exchangeable. Theoretically, we estimate $\textit{finite-sample}$ high-probability bounds on the conditional coverage gap. Empirically, we demonstrate that $\texttt{MultiDimSPCI}$ maintains valid coverage on a wide range of multivariate time series while producing smaller prediction regions than CP and non-CP baselines.
- [79] arXiv:2405.07552 (replaced) [pdf, ps, html, other]
-
Title: Distributed High-Dimensional Quantile Regression: Estimation Efficiency and Support RecoveryComments: Forty-first International Conference on Machine Learning (ICML 2024), 27 pages, 4 figures, 14 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
In this paper, we focus on distributed estimation and support recovery for high-dimensional linear quantile regression. Quantile regression is a popular alternative tool to the least squares regression for robustness against outliers and data heterogeneity. However, the non-smoothness of the check loss function poses big challenges to both computation and theory in the distributed setting. To tackle these problems, we transform the original quantile regression into the least-squares optimization. By applying a double-smoothing approach, we extend a previous Newton-type distributed approach without the restrictive independent assumption between the error term and covariates. An efficient algorithm is developed, which enjoys high computation and communication efficiency. Theoretically, the proposed distributed estimator achieves a near-oracle convergence rate and high support recovery accuracy after a constant number of iterations. Extensive experiments on synthetic examples and a real data application further demonstrate the effectiveness of the proposed method.
- [80] arXiv:2405.09493 (replaced) [pdf, ps, html, other]
-
Title: C-Learner: Constrained Learning for Causal Inference and Semiparametric StatisticsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Causal estimation (e.g. of the average treatment effect) requires estimating complex nuisance parameters (e.g. outcome models). To adjust for errors in nuisance parameter estimation, we present a novel correction method that solves for the best plug-in estimator under the constraint that the first-order error of the estimator with respect to the nuisance parameter estimate is zero. Our constrained learning framework provides a unifying perspective to prominent first-order correction approaches including one-step estimation (a.k.a. augmented inverse probability weighting) and targeting (a.k.a. targeted maximum likelihood estimation). Our semiparametric inference approach, which we call the "C-Learner", can be implemented with modern machine learning methods such as neural networks and tree ensembles, and enjoys standard guarantees like semiparametric efficiency and double robustness. Empirically, we demonstrate our approach on several datasets, including those with text features that require fine-tuning language models. We observe the C-Learner matches or outperforms other asymptotically optimal estimators, with better performance in settings with less estimated overlap.
- [81] arXiv:2405.09584 (replaced) [pdf, ps, html, other]
-
Title: Restless Bandit Problem with Rewards Generated by a Linear Gaussian Dynamical SystemSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
Decision-making under uncertainty is a fundamental problem encountered frequently and can be formulated as a stochastic multi-armed bandit problem. In the problem, the learner interacts with an environment by choosing an action at each round, where a round is an instance of an interaction. In response, the environment reveals a reward, which is sampled from a stochastic process, to the learner. The goal of the learner is to maximize cumulative reward. In this work, we assume that the rewards are the inner product of an action vector and a state vector generated by a linear Gaussian dynamical system. To predict the reward for each action, we propose a method that takes a linear combination of previously observed rewards for predicting each action's next reward. We show that, regardless of the sequence of previous actions chosen, the reward sampled for any previously chosen action can be used for predicting another action's future reward, i.e. the reward sampled for action 1 at round $t-1$ can be used for predicting the reward for action $2$ at round $t$. This is accomplished by designing a modified Kalman filter with a matrix representation that can be learned for reward prediction. Numerical evaluations are carried out on a set of linear Gaussian dynamical systems and are compared with 2 other well-known stochastic multi-armed bandit algorithms.
- [82] arXiv:2405.09831 (replaced) [pdf, ps, html, other]
-
Title: Nearly Minimax Optimal Regret for Multinomial Logistic BanditComments: Preprint. Under reviewSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the feature dimension $d$ and the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{\smash[b]{T/K}})$. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
- [83] arXiv:2405.10301 (replaced) [pdf, ps, html, other]
-
Title: Conformal Alignment: Knowing When to Trust Foundation Models with GuaranteesSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
- [84] arXiv:2006.13456 (replaced) [pdf, ps, other]
-
Title: Likelihood-Free Gaussian Process for RegressionComments: There were errors in the proposed methodSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Gaussian process regression can flexibly represent the posterior distribution of an interest parameter given sufficient information on the likelihood. However, in some cases, we have little knowledge regarding the probability model. For example, when investing in a financial instrument, the probability model of cash flow is generally unknown. In this paper, we propose a novel framework called the likelihood-free Gaussian process (LFGP), which allows representation of the posterior distributions of interest parameters for scalable problems without directly setting their likelihood functions. The LFGP establishes clusters in which the value of the interest parameter can be considered approximately identical, and it approximates the likelihood of the interest parameter in each cluster to a Gaussian using the asymptotic normality of the maximum likelihood estimator. We expect that the proposed framework will contribute significantly to likelihood-free modeling, particularly by reducing the assumptions for the probability model and the computational costs for scalable problems.
- [85] arXiv:2205.15049 (replaced) [pdf, ps, html, other]
-
Title: Metrizing FairnessSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We study supervised learning problems for predicting properties of individuals who belong to one of two demographic groups, and we seek predictors that are fair according to statistical parity. This means that the distributions of the predictions within the two groups should be close with respect to the Kolmogorov distance, and fairness is achieved by penalizing the dissimilarity of these two distributions in the objective function of the learning problem. In this paper, we showcase conceptual and computational benefits of measuring unfairness with integral probability metrics (IPMs) other than the Kolmogorov distance. Conceptually, we show that the generator of any IPM can be interpreted as a family of utility functions and that unfairness with respect to this IPM arises if individuals in the two demographic groups have diverging expected utilities. We also prove that the unfairness-regularized prediction loss admits unbiased gradient estimators if unfairness is measured by the squared $\mathcal L^2$-distance or by a squared maximum mean discrepancy. In this case, the fair learning problem is susceptible to efficient stochastic gradient descent (SGD) algorithms. Numerical experiments on real data show that these SGD algorithms outperform state-of-the-art methods for fair learning in that they achieve superior accuracy-unfairness trade-offs -- sometimes orders of magnitude faster. Finally, we identify conditions under which statistical parity can improve prediction accuracy.
- [86] arXiv:2209.14440 (replaced) [pdf, ps, html, other]
-
Title: GeONet: a neural operator for learning the Wasserstein geodesicSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Optimal transport (OT) offers a versatile framework to compare complex data distributions in a geometrically meaningful way. Traditional methods for computing the Wasserstein distance and geodesic between probability measures require mesh-specific domain discretization and suffer from the curse-of-dimensionality. We present GeONet, a mesh-invariant deep neural operator network that learns the non-linear mapping from the input pair of initial and terminal distributions to the Wasserstein geodesic connecting the two endpoint distributions. In the offline training stage, GeONet learns the saddle point optimality conditions for the dynamic formulation of the OT problem in the primal and dual spaces that are characterized by a coupled PDE system. The subsequent inference stage is instantaneous and can be deployed for real-time predictions in the online learning setting. We demonstrate that GeONet achieves comparable testing accuracy to the standard OT solvers on simulation examples and the MNIST dataset with considerably reduced inference-stage computational cost by orders of magnitude.
- [87] arXiv:2211.07482 (replaced) [pdf, ps, html, other]
-
Title: Unifying O(3) Equivariant Neural Networks Design with Tensor-Network FormalismComments: 10 pages + 12-page supplementary materials, many figuresJournal-ref: Mach. Learn.: Sci. Technol. 5 025044, 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph); Machine Learning (stat.ML)
Many learning tasks, including learning potential energy surfaces from ab initio calculations, involve global spatial symmetries and permutational symmetry between atoms or general particles. Equivariant graph neural networks are a standard approach to such problems, with one of the most successful methods employing tensor products between various tensors that transform under the spatial group. However, as the number of different tensors and the complexity of relationships between them increase, maintaining parsimony and equivariance becomes increasingly challenging. In this paper, we propose using fusion diagrams, a technique widely employed in simulating SU($2$)-symmetric quantum many-body problems, to design new equivariant components for equivariant neural networks. This results in a diagrammatic approach to constructing novel neural network architectures. When applied to particles within a given local neighborhood, the resulting components, which we term "fusion blocks," serve as universal approximators of any continuous equivariant function defined in the neighborhood. We incorporate a fusion block into pre-existing equivariant architectures (Cormorant and MACE), leading to improved performance with fewer parameters on a range of challenging chemical problems. Furthermore, we apply group-equivariant neural networks to study non-adiabatic molecular dynamics of stilbene cis-trans isomerization. Our approach, which combines tensor networks with equivariant neural networks, suggests a potentially fruitful direction for designing more expressive equivariant neural networks.
- [88] arXiv:2302.07185 (replaced) [pdf, ps, html, other]
-
Title: When mitigating bias is unfair: multiplicity and arbitrariness in algorithmic group fairnessSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Most research on fair machine learning has prioritized optimizing criteria such as Demographic Parity and Equalized Odds. Despite these efforts, there remains a limited understanding of how different bias mitigation strategies affect individual predictions and whether they introduce arbitrariness into the debiasing process. This paper addresses these gaps by exploring whether models that achieve comparable fairness and accuracy metrics impact the same individuals and mitigate bias in a consistent manner. We introduce the FRAME (FaiRness Arbitrariness and Multiplicity Evaluation) framework, which evaluates bias mitigation through five dimensions: Impact Size (how many people were affected), Change Direction (positive versus negative changes), Decision Rates (impact on models' acceptance rates), Affected Subpopulations (who was affected), and Neglected Subpopulations (where unfairness persists). This framework is intended to help practitioners understand the impacts of debiasing processes and make better-informed decisions regarding model selection. Applying FRAME to various bias mitigation approaches across key datasets allows us to exhibit significant differences in the behaviors of debiasing methods. These findings highlight the limitations of current fairness criteria and the inherent arbitrariness in the debiasing process.
- [89] arXiv:2304.07278 (replaced) [pdf, ps, html, other]
-
Title: Minimax-Optimal Reward-Agnostic Exploration in Reinforcement LearningComments: accepted for presentation in COLT 2024Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Systems and Control (eess.SY); Statistics Theory (math.ST); Machine Learning (stat.ML)
This paper studies reward-agnostic exploration in reinforcement learning (RL) -- a scenario where the learner is unware of the reward functions during the exploration stage -- and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of \begin{align*}
\frac{SAH^3}{\varepsilon^2} \text{ sample episodes (up to log factor)} \end{align*} without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as ``reward-free exploration.'' The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms. - [90] arXiv:2304.14606 (replaced) [pdf, ps, html, other]
-
Title: Algorithmic Recourse with Missing ValuesComments: 30 pages, 15 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper proposes a new framework of algorithmic recourse (AR) that works even in the presence of missing values. AR aims to provide a recourse action for altering the undesired prediction result given by a classifier. Existing AR methods assume that we can access complete information on the features of an input instance. However, we often encounter missing values in a given instance (e.g., due to privacy concerns), and previous studies have not discussed such a practical situation. In this paper, we first empirically and theoretically show the risk that a naive approach with a single imputation technique fails to obtain good actions regarding their validity, cost, and features to be changed. To alleviate this risk, we formulate the task of obtaining a valid and low-cost action for a given incomplete instance by incorporating the idea of multiple imputation. Then, we provide some theoretical analyses of our task and propose a practical solution based on mixed-integer linear optimization. Experimental results demonstrated the efficacy of our method in the presence of missing values compared to the baselines.
- [91] arXiv:2305.15912 (replaced) [pdf, ps, html, other]
-
Title: ReLU Characteristic Activation AnalysisComments: code available at: this https URLSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce a novel approach for analyzing the training dynamics of ReLU networks by examining the characteristic activation boundaries of individual ReLU neurons. Our proposed analysis reveals a critical instability in common neural network parameterizations and normalizations during stochastic optimization, which impedes fast convergence and hurts generalization performance. Addressing this, we propose Geometric Parameterization (GmP), a novel neural network parameterization technique that effectively separates the radial and angular components of weights in the hyperspherical coordinate system. We show theoretically that GmP resolves the aforementioned instability issue. We report empirical results on various models and benchmarks to verify GmP's theoretical advantages of optimization stability, convergence speed and generalization performance.
- [92] arXiv:2306.11697 (replaced) [pdf, ps, html, other]
-
Title: Treatment Effects in Extreme RegimesSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Understanding treatment effects in extreme regimes is important for characterizing risks associated with different interventions. This is hindered by the unavailability of counterfactual outcomes and the rarity and difficulty of collecting extreme data in practice. To address this issue, we propose a new framework based on extreme value theory for estimating treatment effects in extreme regimes. We quantify these effects using variations in tail decay rates of potential outcomes in the presence and absence of treatments. We establish algorithms for calculating these quantities and develop related theoretical results. We demonstrate the efficacy of our approach on various standard synthetic and semi-synthetic datasets.
- [93] arXiv:2306.13580 (replaced) [pdf, ps, html, other]
-
Title: Lower Complexity Adaptation for Empirical Entropic Optimal TransportComments: 51 pages, 4 figures, proof of LCA for sub-Gaussian measures (Theorem 3.13) correctedSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
Entropic optimal transport (EOT) presents an effective and computationally viable alternative to unregularized optimal transport (OT), offering diverse applications for large-scale data analysis. In this work, we derive novel statistical bounds for empirical plug-in estimators of the EOT cost and show that their statistical performance in the entropy regularization parameter $\epsilon$ and the sample size $n$ only depends on the simpler of the two probability measures. For instance, under sufficiently smooth costs this yields the parametric rate $n^{-1/2}$ with factor $\epsilon^{-d/2}$, where $d$ is the minimum dimension of the two population measures. This confirms that empirical EOT also adheres to the lower complexity adaptation principle, a hallmark feature only recently identified for unregularized OT. As a consequence of our theory, we show that the empirical entropic Gromov-Wasserstein distance and its unregularized version for measures on Euclidean spaces also obey this principle. Additionally, we comment on computational aspects and complement our findings with Monte Carlo simulations. Our techniques employ empirical process theory and rely on a dual formulation of EOT over a single function class. Crucial to our analysis is the observation that the entropic cost-transformation of a function class does not increase its uniform metric entropy by much.
- [94] arXiv:2306.16564 (replaced) [pdf, ps, html, other]
-
Title: Pareto Optimal Learning for Estimating Large Language Model ErrorsSubjects: Computation and Language (cs.CL); Machine Learning (stat.ML)
Large Language Models (LLMs) have shown impressive abilities in many applications. When a concrete and precise answer is desired, it is important to have a quantitative estimation of the potential error rate. However, this can be challenging due to the text-in-text-out nature of generative models. We present a method based on Pareto optimization that generates a risk score to estimate the probability of error in an LLM response by integrating multiple sources of information. We prove theoretically that the error estimator optimized in our framework aligns with the LLM and the information sources in an Pareto optimal manner. Experimental results show that the risk scores estimated by our method are well correlated with the true LLM error rate, thus facilitating error correction. By dynamically combining with prompting strategies such as self-verification and information retrieval, we demonstrate the proposed method can be utilized to increase the performance of an LLM, surpassing state-of-the-art task specific models.
- [95] arXiv:2307.01198 (replaced) [pdf, ps, html, other]
-
Title: Improved sampling via learned diffusionsComments: Accepted at ICLR 2024Journal-ref: International Conference on Learning Representations, 2024Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
Recently, a series of papers proposed deep learning-based approaches to sample from target distributions using controlled diffusion processes, being trained only on the unnormalized target densities without access to samples. Building on previous work, we identify these approaches as special cases of a generalized Schrödinger bridge problem, seeking a stochastic evolution between a given prior distribution and the specified target. We further generalize this framework by introducing a variational formulation based on divergences between path space measures of time-reversed diffusion processes. This abstract perspective leads to practical losses that can be optimized by gradient-based algorithms and includes previous objectives as special cases. At the same time, it allows us to consider divergences other than the reverse Kullback-Leibler divergence that is known to suffer from mode collapse. In particular, we propose the so-called log-variance loss, which exhibits favorable numerical properties and leads to significantly improved performance across all considered approaches.
- [96] arXiv:2307.04191 (replaced) [pdf, ps, html, other]
-
Title: On the sample complexity of parameter estimation in logistic regression with normal designSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
The logistic regression model is one of the most popular data generation model in noisy binary classification problems. In this work, we study the sample complexity of estimating the parameters of the logistic regression model up to a given $\ell_2$ error, in terms of the dimension and the inverse temperature, with standard normal covariates. The inverse temperature controls the signal-to-noise ratio of the data generation process. While both generalization bounds and asymptotic performance of the maximum-likelihood estimator for logistic regression are well-studied, the non-asymptotic sample complexity that shows the dependence on error and the inverse temperature for parameter estimation is absent from previous analyses. We show that the sample complexity curve has two change-points in terms of the inverse temperature, clearly separating the low, moderate, and high temperature regimes.
- [97] arXiv:2310.11011 (replaced) [pdf, ps, html, other]
-
Title: From Identifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative ModelingComments: Published in Transactions on Machine Learning Research (TMLR) (05/2024); 72 pages, 27 figures, 4 tablesJournal-ref: Transactions on Machine Learning Research, 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Deep generative models have shown tremendous capability in data density estimation and data generation from finite samples. While these models have shown impressive performance by learning correlations among features in the data, some fundamental shortcomings are their lack of explainability, tendency to induce spurious correlations, and poor out-of-distribution extrapolation. To remedy such challenges, recent work has proposed a shift toward causal generative models. Causal models offer several beneficial properties to deep generative models, such as distribution shift robustness, fairness, and interpretability. Structural causal models (SCMs) describe data-generating processes and model complex causal relationships and mechanisms among variables in a system. Thus, SCMs can naturally be combined with deep generative models. We provide a technical survey on causal generative modeling categorized into causal representation learning and controllable counterfactual generation methods. We focus on fundamental theory, methodology, drawbacks, datasets, and metrics. Then, we cover applications of causal generative models in fairness, privacy, out-of-distribution generalization, precision medicine, and biological sciences. Lastly, we discuss open problems and fruitful research directions for future work in the field.
- [98] arXiv:2310.17571 (replaced) [pdf, ps, html, other]
-
Title: Inside the black box: Neural network-based real-time prediction of US recessionsSubjects: Econometrics (econ.EM); Machine Learning (stat.ML)
Long short-term memory (LSTM) and gated recurrent unit (GRU) are used to model US recessions from 1967 to 2021. Their predictive performances are compared to those of the traditional linear models. The out-of-sample performance suggests the application of LSTM and GRU in recession forecasting, especially for longer-term forecasts. The Shapley additive explanations (SHAP) method is applied to both groups of models. The SHAP-based different weight assignments imply the capability of these types of neural networks to capture the business cycle asymmetries and nonlinearities. The SHAP method delivers key recession indicators, such as the S&P 500 index for short-term forecasting up to 3 months and the term spread for longer-term forecasting up to 12 months. These findings are robust against other interpretation methods, such as the local interpretable model-agnostic explanations (LIME) and the marginal effects.
- [99] arXiv:2311.05025 (replaced) [pdf, ps, html, other]
-
Title: Unbiased Kinetic Langevin Monte Carlo with Inexact GradientsComments: 99 Pages, 13 FiguresSubjects: Computation (stat.CO); Numerical Analysis (math.NA); Methodology (stat.ME); Machine Learning (stat.ML)
We present an unbiased method for Bayesian posterior means based on kinetic Langevin dynamics that combines advanced splitting methods with enhanced gradient approximations. Our approach avoids Metropolis correction by coupling Markov chains at different discretization levels in a multilevel Monte Carlo approach. Theoretical analysis demonstrates that our proposed estimator is unbiased, attains finite variance, and satisfies a central limit theorem. It can achieve accuracy $\epsilon>0$ for estimating expectations of Lipschitz functions in $d$ dimensions with $\mathcal{O}(d^{1/4}\epsilon^{-2})$ expected gradient evaluations, without assuming warm start. We exhibit similar bounds using both approximate and stochastic gradients, and our method's computational cost is shown to scale independently of the size of the dataset. The proposed method is tested using a multinomial regression problem on the MNIST dataset and a Poisson regression model for soccer scores. Experiments indicate that the number of gradient evaluations per effective sample is independent of dimension, even when using inexact gradients. For product distributions, we give dimension-independent variance bounds. Our results demonstrate that the unbiased algorithm we present can be much more efficient than the ``gold-standard" randomized Hamiltonian Monte Carlo.
- [100] arXiv:2311.11153 (replaced) [pdf, ps, html, other]
-
Title: Biarchetype analysis: simultaneous learning of observations and features based on extremesSubjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)
We introduce a novel exploratory technique, termed biarchetype analysis, which extends archetype analysis to simultaneously identify archetypes of both observations and features. This innovative unsupervised machine learning tool aims to represent observations and features through instances of pure types, or biarchetypes, which are easily interpretable as they embody mixtures of observations and features. Furthermore, the observations and features are expressed as mixtures of the biarchetypes, which makes the structure of the data easier to understand. We propose an algorithm to solve biarchetype analysis. Although clustering is not the primary aim of this technique, biarchetype analysis is demonstrated to offer significant advantages over biclustering methods, particularly in terms of interpretability. This is attributed to biarchetypes being extreme instances, in contrast to the centroids produced by biclustering, which inherently enhances human comprehension. The application of biarchetype analysis across various machine learning challenges underscores its value, and both the source code and examples are readily accessible in R and Python at this https URL.
- [101] arXiv:2311.18672 (replaced) [pdf, ps, html, other]
-
Title: A Comparison Between Invariant and Equivariant Classical and Quantum Graph Neural NetworksRoy T. Forestano, Marçal Comajoan Cara, Gopal Ramesh Dahale, Zhongtian Dong, Sergei Gleyzer, Daniel Justice, Kyoungchul Kong, Tom Magorsch, Konstantin T. Matchev, Katia Matcheva, Eyup B. UnluComments: 15 pages, 7 figures, 3 appendicesJournal-ref: Axioms 13 (2024) 160Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Machine Learning (stat.ML)
Machine learning algorithms are heavily relied on to understand the vast amounts of data from high-energy particle collisions at the CERN Large Hadron Collider (LHC). The data from such collision events can naturally be represented with graph structures. Therefore, deep geometric methods, such as graph neural networks (GNNs), have been leveraged for various data analysis tasks in high-energy physics. One typical task is jet tagging, where jets are viewed as point clouds with distinct features and edge connections between their constituent particles. The increasing size and complexity of the LHC particle datasets, as well as the computational models used for their analysis, greatly motivate the development of alternative fast and efficient computational paradigms such as quantum computation. In addition, to enhance the validity and robustness of deep networks, one can leverage the fundamental symmetries present in the data through the use of invariant inputs and equivariant layers. In this paper, we perform a fair and comprehensive comparison between classical graph neural networks (GNNs) and equivariant graph neural networks (EGNNs) and their quantum counterparts: quantum graph neural networks (QGNNs) and equivariant quantum graph neural networks (EQGNN). The four architectures were benchmarked on a binary classification task to classify the parton-level particle initiating the jet. Based on their AUC scores, the quantum networks were shown to outperform the classical networks. However, seeing the computational advantage of the quantum networks in practice may have to wait for the further development of quantum technology and its associated APIs.
- [102] arXiv:2312.05134 (replaced) [pdf, ps, html, other]
-
Title: Optimal Multi-Distribution LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension d, we propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon^2 (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory are further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle.
Additionally, we establish the necessity of randomization, revealing a large sample size barrier when only deterministic hypotheses are permitted. These findings resolve three open problems presented in COLT 2023 (i.e., citet[Problems 1, 3 and 4]{awasthi2023sample}). - [103] arXiv:2402.00847 (replaced) [pdf, ps, html, other]
-
Title: BootsTAP: Bootstrapped Training for Tracking-Any-PointCarl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew ZissermanSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a selfsupervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at this https URL
- [104] arXiv:2402.01454 (replaced) [pdf, ps, html, other]
-
Title: Integrating Large Language Models in Causal Discovery: A Statistical Causal ApproachMasayuki Takayama, Tadahisa Okuda, Thong Pham, Tatsuyoshi Ikenoue, Shingo Fukuma, Shohei Shimizu, Akiyoshi SannaiSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
In practical statistical causal discovery (SCD), embedding domain expert knowledge as constraints into the algorithm is significant for creating consistent meaningful causal models, despite the challenges in systematic acquisition of the background knowledge. To overcome these challenges, this paper proposes a novel methodology for causal inference, in which SCD methods and knowledge based causal inference (KBCI) with a large language model (LLM) are synthesized through ``statistical causal prompting (SCP)'' for LLMs and prior knowledge augmentation for SCD. Experiments have revealed that GPT-4 can cause the output of the LLM-KBCI and the SCD result with prior knowledge from LLM-KBCI to approach the ground truth, and that the SCD result can be further improved, if GPT-4 undergoes SCP. Furthermore, by using an unpublished real-world dataset, we have demonstrated that the background knowledge provided by the LLM can improve SCD on this dataset, even if this dataset has never been included in the training data of the LLM. The proposed approach can thus address challenges such as dataset biases and limitations, illustrating the potential of LLMs to improve data-driven causal inference across diverse scientific domains.
- [105] arXiv:2402.01929 (replaced) [pdf, ps, html, other]
-
Title: Sample, estimate, aggregate: A recipe for causal discovery foundation modelsComments: Preprint. Under reviewSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Causal discovery, the task of inferring causal structure from data, promises to accelerate scientific research, inform policy making, and more. However, causal discovery algorithms over larger sets of variables tend to be brittle against misspecification or when data are limited. To mitigate these challenges, we train a supervised model that learns to predict a larger causal graph from the outputs of classical causal discovery algorithms run over subsets of variables, along with other statistical hints like inverse covariance. Our approach is enabled by the observation that typical errors in the outputs of classical methods remain comparable across datasets. Theoretically, we show that this model is well-specified, in the sense that it can recover a causal graph consistent with graphs over subsets. Empirically, we train the model to be robust to erroneous estimates using diverse synthetic data. Experiments on real and synthetic data demonstrate that this model maintains high accuracy in the face of misspecification or distribution shift, and can be adapted at low cost to different discovery algorithms or choice of statistics.
- [106] arXiv:2402.02239 (replaced) [pdf, ps, html, other]
-
Title: Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-WassersteinComments: 38 pages, 15 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets. Traditionally, this involves using dimensionality reduction (DR) methods to project data onto lower-dimensional spaces or organizing points into meaningful clusters (clustering). In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem. This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem. We empirically demonstrate its relevance to the identification of low-dimensional prototypes representing data at different scales, across multiple image and genomic datasets.
- [107] arXiv:2402.02277 (replaced) [pdf, ps, html, other]
-
Title: Causal Bayesian Optimization via Exogenous Distribution LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Maximizing a target variable as an operational objective in a structural causal model is an important problem. Existing Causal Bayesian Optimization~(CBO) methods either rely on hard interventions that alter the causal structure to maximize the reward; or introduce action nodes to endogenous variables so that the data generation mechanisms are adjusted to achieve the objective. In this paper, a novel method is introduced to learn the distribution of exogenous variables, which is typically ignored or marginalized through expectation by existing methods. Exogenous distribution learning improves the approximation accuracy of structural causal models in a surrogate model that is usually trained with limited observational data. Moreover, the learned exogenous distribution extends existing CBO to general causal schemes beyond Additive Noise Models~(ANM). The recovery of exogenous variables allows us to use a more flexible prior for noise or unobserved hidden variables. We develop a new CBO method by leveraging the learned exogenous distribution. Experiments on different datasets and applications show the benefits of our proposed method.
- [108] arXiv:2402.02851 (replaced) [pdf, ps, html, other]
-
Title: Enhancing Compositional Generalization via Compositional Feature AlignmentComments: Published in Transactions on Machine Learning Research (TMLR). The code is released at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Real-world applications of machine learning models often confront data distribution shifts, wherein discrepancies exist between the training and test data distributions. In the common multi-domain multi-class setup, as the number of classes and domains scales up, it becomes infeasible to gather training data for every domain-class combination. This challenge naturally leads the quest for models with Compositional Generalization (CG) ability, where models can generalize to unseen domain-class combinations. To delve into the CG challenge, we develop CG-Bench, a suite of CG benchmarks derived from existing real-world image datasets, and observe that the prevalent pretraining-finetuning paradigm on foundational models, such as CLIP and DINOv2, struggles with the challenge. To address this challenge, we propose Compositional Feature Alignment (CFA), a simple two-stage finetuning technique that i) learns two orthogonal linear heads on a pretrained encoder with respect to class and domain labels, and ii) fine-tunes the encoder with the newly learned head frozen. We theoretically and empirically justify that CFA encourages compositional feature learning of pretrained models. We further conduct extensive experiments on CG-Bench for CLIP and DINOv2, two powerful pretrained vision foundation models. Experiment results show that CFA outperforms common finetuning techniques in compositional generalization, corroborating CFA's efficacy in compositional feature learning.
- [109] arXiv:2402.03587 (replaced) [pdf, ps, html, other]
-
Title: Information-Theoretic Active Correlation ClusteringSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study correlation clustering where the pairwise similarities are not known in advance. For this purpose, we employ active learning to query pairwise similarities in a cost-efficient way. We propose a number of effective information-theoretic acquisition functions based on entropy and information gain. We extensively investigate the performance of our methods in different settings and demonstrate their superior performance compared to the alternatives.
- [110] arXiv:2402.03687 (replaced) [pdf, ps, html, other]
-
Title: Pard: Permutation-Invariant Autoregressive Diffusion for Graph GenerationComments: Diffusion Model on GraphsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Graph generation has been dominated by autoregressive models due to their simplicity and effectiveness, despite their sensitivity to ordering. Yet diffusion models have garnered increasing attention, as they offer comparable performance while being permutation-invariant. Current graph diffusion models generate graphs in a one-shot fashion, but they require extra features and thousands of denoising steps to achieve optimal performance. We introduce PARD, a Permutation-invariant Auto Regressive Diffusion model that integrates diffusion models with autoregressive methods. PARD harnesses the effectiveness and efficiency of the autoregressive model while maintaining permutation invariance without ordering sensitivity. Specifically, we show that contrary to sets, elements in a graph are not entirely unordered and there is a unique partial order for nodes and edges. With this partial order, PARD generates a graph in a block-by-block, autoregressive fashion, where each block's probability is conditionally modeled by a shared diffusion model with an equivariant network. To ensure efficiency while being expressive, we further propose a higher-order graph transformer, which integrates transformer with PPGN. Like GPT, we extend the higher-order graph transformer to support parallel training of all blocks. Without any extra features, PARD achieves state-of-the-art performance on molecular and non-molecular datasets, and scales to large datasets like MOSES containing 1.9M molecules. Pard is open-sourced at this https URL.
- [111] arXiv:2402.03985 (replaced) [pdf, ps, html, other]
-
Title: A Bias-Variance Decomposition for Ensembles over Multiple Synthetic DatasetsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning, from increased accuracy to more effective model selection and uncertainty estimation. These benefits have clear empirical support, but the theoretical understanding of them is currently very light. We seek to increase the theoretical understanding by deriving bias-variance decompositions for several settings of using multiple synthetic datasets, including differentially private synthetic data. Our theory predicts multiple synthetic datasets to be especially beneficial for high-variance downstream predictors, and yields a simple rule of thumb to select the appropriate number of synthetic datasets in the case of mean-squared error and Brier score. We investigate how our theory works in practice by evaluating the performance of an ensemble over many synthetic datasets for several real datasets and downstream predictors. The results follow our theory, showing that our insights are practically relevant.
- [112] arXiv:2402.04906 (replaced) [pdf, ps, html, other]
-
Title: Conformal Convolution and Monte Carlo Meta-learners for Predictive Inference of Individual Treatment EffectsComments: 25 pages, 14 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Knowledge of the effect of interventions, known as the treatment effect, is paramount for decision-making. Approaches to estimating this treatment effect using conditional average treatment effect (CATE) meta-learners often provide only a point estimate of this treatment effect, while additional uncertainty quantification is frequently desired to enhance decision-making confidence. To address this, we introduce two novel approaches: the conformal convolution T-learner (CCT-learner) and conformal Monte Carlo (CMC) meta-learners. The approaches leverage weighted conformal predictive systems (WCPS), Monte Carlo sampling, and CATE meta-learners to generate predictive distributions of individual treatment effect (ITE) that could enhance individualized decision-making. Although we show how assumptions about the noise distribution of the outcome influence the uncertainty predictions, our experiments demonstrate that the CCT- and CMC meta-learners achieve strong coverage while maintaining narrow interval widths. They also generate probabilistically calibrated predictive distributions, providing reliable ranges of ITEs across various synthetic and semi-synthetic datasets.
Code: this https URL - [113] arXiv:2402.05569 (replaced) [pdf, ps, html, other]
-
Title: Simplifying Hypergraph Neural NetworksSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Machine Learning (stat.ML)
Hypergraphs are crucial for modeling higher-order interactions in real-world data. Hypergraph neural networks (HNNs) effectively utilise these structures by message passing to generate informative node features for various downstream tasks like node classification. However, the message passing block in existing HNNs typically requires a computationally intensive training process, which limits their practical use. To tackle this challenge, we propose an alternative approach by decoupling the usage of the hypergraph structural information from the model training stage. The proposed model, simplified hypergraph neural network (SHNN), contains a training-free message-passing block that can be precomputed before the training of SHNN, thereby reducing the computational burden. We theoretically support the efficiency and effectiveness of SHNN by showing that: 1) It is more training-efficient compared to existing HNNs; 2) It utilises as much information as existing HNNs for node feature generation; and 3) It is robust against the oversmoothing issue while using long-range interactions. Experiments based on six real-world hypergraph benchmarks in node classification and hyperlink prediction present that, compared to state-of-the-art HNNs, SHNN shows both competitive performance and superior training efficiency. Specifically, on Cora-CA, SHNN achieves the highest node classification accuracy with just 2% training time of the best baseline.
- [114] arXiv:2402.08193 (replaced) [pdf, ps, html, other]
-
Title: Gaussian Ensemble Belief Propagation for Efficient Inference in High-Dimensional SystemsComments: Under conference submissionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Efficient inference in high-dimensional models remains a central challenge in machine learning. This paper introduces the Gaussian Ensemble Belief Propagation (GEnBP) algorithm, a fusion of the Ensemble Kalman filter and Gaussian Belief Propagation (GaBP) methods. GEnBP updates ensembles by passing low-rank local messages over a graphical model. This combination inherits favourable qualities from each method. Ensemble techniques allow GEnBP to handle high-dimensional states, parameters and intricate, noisy, black-box generation processes. The use of local messages in a graphical model structure ensures that the approach can efficiently handle complex dependence structures. GEnBP is advantageous when the ensemble size may be considerably smaller than the inference dimension. This scenario often arises in fields such as spatiotemporal modelling, image processing and physical model inversion. GEnBP can be applied to general problem structures, including data assimilation, system identification and hierarchical models.
Supporting code is available at this https URL - [115] arXiv:2402.10445 (replaced) [pdf, ps, html, other]
-
Title: Collaborative Learning with Different Labeling FunctionsComments: To appear at ICML 2024; v2 and v3 included additional discussion on related workSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
We study a variant of Collaborative PAC Learning, in which we aim to learn an accurate classifier for each of the $n$ data distributions, while minimizing the number of samples drawn from them in total. Unlike in the usual collaborative learning setup, it is not assumed that there exists a single classifier that is simultaneously accurate for all distributions.
We show that, when the data distributions satisfy a weaker realizability assumption, which appeared in [Crammer and Mansour, 2012] in the context of multi-task learning, sample-efficient learning is still feasible. We give a learning algorithm based on Empirical Risk Minimization (ERM) on a natural augmentation of the hypothesis class, and the analysis relies on an upper bound on the VC dimension of this augmented class.
In terms of the computational efficiency, we show that ERM on the augmented hypothesis class is NP-hard, which gives evidence against the existence of computationally efficient learners in general. On the positive side, for two special cases, we give learners that are both sample- and computationally-efficient. - [116] arXiv:2402.12875 (replaced) [pdf, ps, html, other]
-
Title: Chain of Thought Empowers Transformers to Solve Inherently Serial ProblemsComments: 38 pages, 10 figures. Accepted by ICLR 2024Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
- [117] arXiv:2402.13380 (replaced) [pdf, ps, html, other]
-
Title: Toward TransfORmers: Revolutionizing the Solution of Mixed Integer Programs with TransformersSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Combinatorics (math.CO); Optimization and Control (math.OC); Machine Learning (stat.ML)
In this study, we introduce an innovative deep learning framework that employs a transformer model to address the challenges of mixed-integer programs, specifically focusing on the Capacitated Lot Sizing Problem (CLSP). Our approach, to our knowledge, is the first to utilize transformers to predict the binary variables of a mixed-integer programming (MIP) problem. Specifically, our approach harnesses the encoder decoder transformer's ability to process sequential data, making it well-suited for predicting binary variables indicating production setup decisions in each period of the CLSP. This problem is inherently dynamic, and we need to handle sequential decision making under constraints. We present an efficient algorithm in which CLSP solutions are learned through a transformer neural network. The proposed post-processed transformer algorithm surpasses the state-of-the-art solver, CPLEX and Long Short-Term Memory (LSTM) in solution time, optimal gap, and percent infeasibility over 240K benchmark CLSP instances tested. After the ML model is trained, conducting inference on the model, reduces the MIP into a linear program (LP). This transforms the ML-based algorithm, combined with an LP solver, into a polynomial-time approximation algorithm to solve a well-known NP-Hard problem, with almost perfect solution quality.
- [118] arXiv:2402.17826 (replaced) [pdf, ps, html, other]
-
Title: Prediction-Powered Ranking of Large Language ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
Large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a (small) set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with the distribution of human pairwise preferences asymptotically. Using pairwise comparisons made by humans in the LMSYS Chatbot Arena platform and pairwise comparisons made by three strong large language models, we empirically demonstrate the effectivity of our framework and show that the rank-sets constructed using only pairwise comparisons by the strong large language models are often inconsistent with (the distribution of) human pairwise preferences.
- [119] arXiv:2403.02957 (replaced) [pdf, ps, html, other]
-
Title: On the Asymptotic Mean Square Error Optimality of Diffusion ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Diffusion models (DMs) as generative priors have recently shown great potential for denoising tasks but lack theoretical understanding with respect to their mean square error (MSE) optimality. This paper proposes a novel denoising strategy inspired by the structure of the MSE-optimal conditional mean estimator (CME). The resulting DM-based denoiser can be conveniently employed using a pre-trained DM, being particularly fast by truncating reverse diffusion steps and not requiring stochastic re-sampling. We present a comprehensive (non-)asymptotic optimality analysis of the proposed diffusion-based denoiser, demonstrating polynomial-time convergence to the CME under mild conditions. Our analysis also derives a novel Lipschitz constant that depends solely on the DM's hyperparameters. Further, we offer a new perspective on DMs, showing that they inherently combine an asymptotically optimal denoiser with a powerful generator, modifiable by switching re-sampling in the reverse process on or off. The theoretical findings are thoroughly validated with experiments based on various benchmark datasets.
- [120] arXiv:2405.00675 (replaced) [pdf, ps, html, other]
-
Title: Self-Play Probabilistic Preference Optimization for Language Model AlignmentComments: 26 pages, 4 figures, 5 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed \textit{Self-play Probabilistic Preference Optimization} (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys a theoretical convergence guarantee. Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53\% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.
- [121] arXiv:2405.06627 (replaced) [pdf, ps, html, other]
-
Title: Conformal Validity Guarantees Exist for Any Data DistributionComments: ICML 2024. Code available at this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
As machine learning (ML) gains widespread adoption, practitioners are increasingly seeking means to quantify and control the risk these systems incur. This challenge is especially salient when ML systems have autonomy to collect their own data, such as in black-box optimization and active learning, where their actions induce sequential feedback-loop shifts in the data distribution. Conformal prediction has emerged as a promising approach to uncertainty and risk quantification, but prior variants' validity guarantees have assumed some form of ``quasi-exchangeability'' on the data distribution, thereby excluding many types of sequential shifts. In this paper we prove that conformal prediction can theoretically be extended to \textit{any} joint data distribution, not just exchangeable or quasi-exchangeable ones, although it is exceedingly impractical to compute in the most general case. For practical applications, we outline a procedure for deriving specific conformal algorithms for any data distribution, and we use this procedure to derive tractable algorithms for a series of ML-agent-induced covariate shifts. We evaluate the proposed algorithms empirically on synthetic black-box optimization and active learning tasks.
- [122] arXiv:2405.09784 (replaced) [pdf, ps, html, other]
-
Title: Online bipartite matching with imperfect adviceComments: Accepted into ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
We study the problem of online unweighted bipartite matching with $n$ offline vertices and $n$ online vertices where one wishes to be competitive against the optimal offline algorithm. While the classic RANKING algorithm of Karp et al. [1990] provably attains competitive ratio of $1-1/e > 1/2$, we show that no learning-augmented method can be both 1-consistent and strictly better than $1/2$-robust under the adversarial arrival model. Meanwhile, under the random arrival model, we show how one can utilize methods from distribution testing to design an algorithm that takes in external advice about the online vertices and provably achieves competitive ratio interpolating between any ratio attainable by advice-free methods and the optimal ratio of 1, depending on the advice quality.
- [123] arXiv:2405.10093 (replaced) [pdf, ps, html, other]
-
Title: LaT-PFN: A Joint Embedding Predictive Architecture for In-context Time-series ForecastingComments: 9 pages plus references and appendix, 2 tables, 11 figures, added seeds, correctionsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We introduce LatentTimePFN (LaT-PFN), a foundational Time Series model with a strong embedding space that enables zero-shot forecasting. To achieve this, we perform in-context learning in latent space utilizing a novel integration of the Prior-data Fitted Networks (PFN) and Joint Embedding Predictive Architecture (JEPA) frameworks. We leverage the JEPA framework to create a prediction-optimized latent representation of the underlying stochastic process that generates time series and combines it with contextual learning, using a PFN. Furthermore, we improve on preceding works by utilizing related time series as a context and introducing a normalized abstract time axis. This reduces training time and increases the versatility of the model by allowing any time granularity and forecast horizon. We show that this results in superior zero-shot predictions compared to established baselines. We also demonstrate our latent space produces informative embeddings of both individual time steps and fixed-length summaries of entire series. Finally, we observe the emergence of multi-step patch embeddings without explicit training, suggesting the model actively learns discrete tokens that encode local structures in the data, analogous to vision transformers.
- [124] arXiv:2405.10289 (replaced) [pdf, ps, html, other]
-
Title: Subgradient Convergence Implies Subdifferential Convergence on Weakly Convex Functions: With Uniform Rates GuaranteesComments: This revision adds Lemma 1 and corrects several typosSubjects: Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
In nonsmooth, nonconvex stochastic optimization, understanding the uniform convergence of subdifferential mappings is crucial for analyzing stationary points of sample average approximations of risk as they approach the population risk. Yet, characterizing this convergence remains a fundamental challenge.
This work introduces a novel perspective by connecting the uniform convergence of subdifferential mappings to that of subgradient mappings as empirical risk converges to the population risk. We prove that, for stochastic weakly-convex objectives, and within any open set, a uniform bound on the convergence of subgradients -- chosen arbitrarily from the corresponding subdifferential sets -- translates to a uniform bound on the convergence of the subdifferential sets itself, measured by the Hausdorff metric.
Using this technique, we derive uniform convergence rates for subdifferential sets of stochastic convex-composite objectives. Our results do not rely on key distributional assumptions in the literature, which require the population and finite sample subdifferentials to be continuous in the Hausdorff metric, yet still provide tight convergence rates. These guarantees lead to new insights into the nonsmooth landscapes of such objectives within finite samples.