Machine Learning
See recent articles
- [1] arXiv:2406.15500 [pdf, html, other]
-
Title: Hidden Variables unseen by Random ForestsComments: arXiv admin note: substantial text overlap with arXiv:2309.01460Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Random Forests are widely claimed to capture interactions well. However, some simple examples suggest that they perform poorly in the presence of certain pure interactions that the conventional CART criterion struggles to capture during tree construction. We argue that simple alternative partitioning schemes used in the tree growing procedure can enhance identification of these interactions. In a simulation study we compare these variants to conventional Random Forests and Extremely Randomized trees. Our results validate that the modifications considered enhance the model's fitting ability in scenarios where pure interactions play a crucial role.
- [2] arXiv:2406.15661 [pdf, html, other]
-
Title: The Stochastic Occupation Kernel Method for System IdentificationComments: 8 pages, 3 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
The method of occupation kernels has been used to learn ordinary differential equations from data in a non-parametric way. We propose a two-step method for learning the drift and diffusion of a stochastic differential equation given snapshots of the process. In the first step, we learn the drift by applying the occupation kernel algorithm to the expected value of the process. In the second step, we learn the diffusion given the drift using a semi-definite program. Specifically, we learn the diffusion squared as a non-negative function in a RKHS associated with the square of a kernel. We present examples and simulations.
- [3] arXiv:2406.15664 [pdf, html, other]
-
Title: Flat Posterior Does Matter For Bayesian Transfer LearningSungjun Lim, Jeyoon Yeom, Sooyon Kim, Hoyoon Byun, Jinho Kang, Yohan Jung, Jiyoung Jung, Kyungwoo SongSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The large-scale pre-trained neural network has achieved notable success in enhancing performance for downstream tasks. Another promising approach for generalization is Bayesian Neural Network (BNN), which integrates Bayesian methods into neural network architectures, offering advantages such as Bayesian Model averaging (BMA) and uncertainty quantification. Despite these benefits, transfer learning for BNNs has not been widely investigated and shows limited improvement. We hypothesize that this issue arises from the inability to find flat minima, which is crucial for generalization performance. To address this, we evaluate the sharpness of BNNs in various settings, revealing their insufficiency in seeking flat minima and the influence of flatness on BMA performance. Therefore, we propose Sharpness-aware Bayesian Model Averaging (SA-BMA), a Bayesian-fitting flat posterior seeking optimizer integrated with Bayesian transfer learning. SA-BMA calculates the divergence between posteriors in the parameter space, aligning with the nature of BNNs, and serves as a generalized version of existing sharpness-aware optimizers. We validate that SA-BMA improves generalization performance in few-shot classification and distribution shift scenarios by ensuring flatness.
- [4] arXiv:2406.16032 [pdf, html, other]
-
Title: Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary DistributionComments: 28 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We consider a variant of the stochastic gradient descent (SGD) with a random learning rate and reveal its convergence properties. SGD is a widely used stochastic optimization algorithm in machine learning, especially deep learning. Numerous studies reveal the convergence properties of SGD and its simplified variants. Among these, the analysis of convergence using a stationary distribution of updated parameters provides generalizable results. However, to obtain a stationary distribution, the update direction of the parameters must not degenerate, which limits the applicable variants of SGD. In this study, we consider a novel SGD variant, Poisson SGD, which has degenerated parameter update directions and instead utilizes a random learning rate. Consequently, we demonstrate that a distribution of a parameter updated by Poisson SGD converges to a stationary distribution under weak assumptions on a loss function. Based on this, we further show that Poisson SGD finds global minima in non-convex optimization problems and also evaluate the generalization error using this method. As a proof technique, we approximate the distribution by Poisson SGD with that of the bouncy particle sampler (BPS) and derive its stationary distribution, using the theoretical advance of the piece-wise deterministic Markov process (PDMP).
- [5] arXiv:2406.16045 [pdf, html, other]
-
Title: Combine and Conquer: A Meta-Analysis on Data Shift and Out-of-Distribution DetectionComments: Accepted for publication in Transactions on Machine Learning Research (TMLR)Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This paper introduces a universal approach to seamlessly combine out-of-distribution (OOD) detection scores. These scores encompass a wide range of techniques that leverage the self-confidence of deep learning models and the anomalous behavior of features in the latent space. Not surprisingly, combining such a varied population using simple statistics proves inadequate. To overcome this challenge, we propose a quantile normalization to map these scores into p-values, effectively framing the problem into a multi-variate hypothesis test. Then, we combine these tests using established meta-analysis tools, resulting in a more effective detector with consolidated decision boundaries. Furthermore, we create a probabilistic interpretable criterion by mapping the final statistics into a distribution with known parameters. Through empirical investigation, we explore different types of shifts, each exerting varying degrees of impact on data. Our results demonstrate that our approach significantly improves overall robustness and performance across diverse OOD detection scenarios. Notably, our framework is easily extensible for future developments in detection scores and stands as the first to combine decision boundaries in this context. The code and artifacts associated with this work are publicly available\footnote{\url{this https URL}}.
- [6] arXiv:2406.16227 [pdf, html, other]
-
Title: VICatMix: variational Bayesian clustering and variable selection for discrete biomedical dataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Effective clustering of biomedical data is crucial in precision medicine, enabling accurate stratifiction of patients or samples. However, the growth in availability of high-dimensional categorical data, including `omics data, necessitates computationally efficient clustering algorithms. We present VICatMix, a variational Bayesian finite mixture model designed for the clustering of categorical data. The use of variational inference (VI) in its training allows the model to outperform competitors in term of efficiency, while maintaining high accuracy. VICatMix furthermore performs variable selection, enhancing its performance on high-dimensional, noisy data. The proposed model incorporates summarisation and model averaging to mitigate poor local optima in VI, allowing for improved estimation of the true number of clusters simultaneously with feature saliency. We demonstrate the performance of VICatMix with both simulated and real-world data, including applications to datasets from The Cancer Genome Atlas (TCGA), showing its use in cancer subtyping and driver gene discovery. We demonstrate VICatMix's utility in integrative cluster analysis with different `omics datasets, enabling the discovery of novel subtypes.
\textbf{Availability:} VICatMix is freely available as an R package, incorporating C++ for faster computation, at \url{this https URL}. - [7] arXiv:2406.16484 [pdf, html, other]
-
Title: Robust prediction under missingness shiftsPatrick Rockenschaub, Zhicong Xian, Alireza Zamanian, Marta Piperno, Octavia-Andreea Ciora, Elisabeth Pachl, Narges AhmidiSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Prediction becomes more challenging with missing covariates. What method is chosen to handle missingness can greatly affect how models perform. In many real-world problems, the best prediction performance is achieved by models that can leverage the informative nature of a value being missing. Yet, the reasons why a covariate goes missing can change once a model is deployed in practice. If such a missingness shift occurs, the conditional probability of a value being missing differs in the target data. Prediction performance in the source data may no longer be a good selection criterion, and approaches that do not rely on informative missingness may be preferable. However, we show that the Bayes predictor remains unchanged by ignorable shifts for which the probability of missingness only depends on observed data. Any consistent estimator of the Bayes predictor may therefore result in robust prediction under those conditions, although we show empirically that different methods appear robust to different types of shifts. If the missingness shift is non-ignorable, the Bayes predictor may change due to the shift. While neither approach recovers the Bayes predictor in this case, we found empirically that disregarding missingness was most beneficial when it was highly informative.
- [8] arXiv:2406.16525 [pdf, html, other]
-
Title: OAML: Outlier Aware Metric Learning for OOD Detection EnhancementSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Out-of-distribution (OOD) detection methods have been developed to identify objects that a model has not seen during training. The Outlier Exposure (OE) methods use auxiliary datasets to train OOD detectors directly. However, the collection and learning of representative OOD samples may pose challenges. To tackle these issues, we propose the Outlier Aware Metric Learning (OAML) framework. The main idea of our method is to use the k-NN algorithm and Stable Diffusion model to generate outliers for training at the feature level without making any distributional assumptions. To increase feature discrepancies in the semantic space, we develop a mutual information-based contrastive learning approach for learning from OOD data effectively. Both theoretical and empirical results confirm the effectiveness of this contrastive learning technique. Furthermore, we incorporate knowledge distillation into our learning framework to prevent degradation of in-distribution classification accuracy. The combination of contrastive learning and knowledge distillation algorithms significantly enhances the performance of OOD detection. Experimental results across various datasets show that our method significantly outperforms previous OE methods.
- [9] arXiv:2406.16530 [pdf, html, other]
-
Title: Conditional Bayesian QuadratureJournal-ref: Conference on Uncertainty in Artificial Intelligence (UAI) 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
We propose a novel approach for estimating conditional or parametric expectations in the setting where obtaining samples or evaluating integrands is costly. Through the framework of probabilistic numerical methods (such as Bayesian quadrature), our novel approach allows to incorporates prior information about the integrands especially the prior smoothness knowledge about the integrands and the conditional expectation. As a result, our approach provides a way of quantifying uncertainty and leads to a fast convergence rate, which is confirmed both theoretically and empirically on challenging tasks in Bayesian sensitivity analysis, computational finance and decision making under uncertainty.
- [10] arXiv:2406.16590 [pdf, html, other]
-
Title: Forecasting with Deep Learning: Beyond Average of Average of Average PerformanceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Accurate evaluation of forecasting models is essential for ensuring reliable predictions. Current practices for evaluating and comparing forecasting models focus on summarising performance into a single score, using metrics such as SMAPE. We hypothesize that averaging performance over all samples dilutes relevant information about the relative performance of models. Particularly, conditions in which this relative performance is different than the overall accuracy. We address this limitation by proposing a novel framework for evaluating univariate time series forecasting models from multiple perspectives, such as one-step ahead forecasting versus multi-step ahead forecasting. We show the advantages of this framework by comparing a state-of-the-art deep learning approach with classical forecasting techniques. While classical methods (e.g. ARIMA) are long-standing approaches to forecasting, deep neural networks (e.g. NHITS) have recently shown state-of-the-art forecasting performance in benchmark datasets. We conducted extensive experiments that show NHITS generally performs best, but its superiority varies with forecasting conditions. For instance, concerning the forecasting horizon, NHITS only outperforms classical approaches for multi-step ahead forecasting. Another relevant insight is that, when dealing with anomalies, NHITS is outperformed by methods such as Theta. These findings highlight the importance of aspect-based model evaluation.
- [11] arXiv:2406.16766 [pdf, html, other]
-
Title: Conformal time series decomposition with component-wise exchangeabilityComments: Accepted at COPA 2024; 34 pages, 14 figures, 8 tables (incl. appendix)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
Conformal prediction offers a practical framework for distribution-free uncertainty quantification, providing finite-sample coverage guarantees under relatively mild assumptions on data exchangeability. However, these assumptions cease to hold for time series due to their temporally correlated nature. In this work, we present a novel use of conformal prediction for time series forecasting that incorporates time series decomposition. This approach allows us to model different temporal components individually. By applying specific conformal algorithms to each component and then merging the obtained prediction intervals, we customize our methods to account for the different exchangeability regimes underlying each component. Our decomposition-based approach is thoroughly discussed and empirically evaluated on synthetic and real-world data. We find that the method provides promising results on well-structured time series, but can be limited by factors such as the decomposition step for more complex data.
- [12] arXiv:2406.16834 [pdf, html, other]
-
Title: Concentration Inequalities for $(f,\Gamma)$-GANsComments: 21 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Generative adversarial networks (GANs) are unsupervised learning methods for training a generator distribution to produce samples that approximate those drawn from a target distribution. Many such methods can be formulated as minimization of a metric or divergence. Recent works have proven the statistical consistency of GANs that are based on integral probability metrics (IPMs), e.g., WGAN which is based on the 1-Wasserstein metric. IPMs are defined by optimizing a linear functional (difference of expectations) over a space of discriminators. A much larger class of GANs, which allow for the use of nonlinear objective functionals, can be constructed using $(f,\Gamma)$-divergences; these generalize and interpolate between IPMs and $f$-divergences (e.g., KL or $\alpha$-divergences). Instances of $(f,\Gamma)$-GANs have been shown to exhibit improved performance in a number of applications. In this work we study the statistical consistency of $(f,\Gamma)$-GANs for general $f$ and $\Gamma$. Specifically, we derive finite-sample concentration inequalities. These derivations require novel arguments due to nonlinearity of the objective functional. We demonstrate that our new results reduce to the known results for IPM-GANs in the appropriate limit while also significantly extending the domain of applicability of this theory.
New submissions for Tuesday, 25 June 2024 (showing 12 of 12 entries )
- [13] arXiv:2406.15523 (cross-list from cs.LG) [pdf, html, other]
-
Title: Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A BenchmarkSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating a gap that hinders the application and evaluation of methods from one to the other. To bridge the gap, in this work, we present a Unified Benchmark for unsupervised Graph-level OOD and anomaly Detection (our method), a comprehensive evaluation framework that unifies GLAD and GLOD under the concept of generalized graph-level OOD detection. Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 16 representative GLAD/GLOD methods. We conduct multi-dimensional analyses to explore the effectiveness, generalizability, robustness, and efficiency of existing methods, shedding light on their strengths and limitations. Furthermore, we provide an open-source codebase (this https URL) of our method to foster reproducible research and outline potential directions for future investigations based on our insights.
- [14] arXiv:2406.15567 (cross-list from cs.LG) [pdf, html, other]
-
Title: SAIL: Self-Improving Efficient Online Alignment of Large Language ModelsMucong Ding, Souradip Chakraborty, Vibhu Agrawal, Zora Che, Alec Koppel, Mengdi Wang, Amrit Bedi, Furong HuangComments: 24 pages, 6 figures, 3 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. However, current offline alignment approaches like DPO, IPO, and SLiC rely heavily on fixed preference datasets, which can lead to sub-optimal performance. On the other hand, recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation and suffers from distribution shift issues. To address this, we establish that online LLM alignment is underpinned by bilevel optimization. By reducing this formulation to an efficient single-level first-order method (using the reward-policy equivalence), our approach generates new samples and iteratively refines model alignment by exploring responses and regulating preference labels. In doing so, we permit alignment methods to operate in an online and self-improving manner, as well as generalize prior online RLHF methods as special cases. Compared to state-of-the-art iterative RLHF methods, our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
- [15] arXiv:2406.15575 (cross-list from cs.LG) [pdf, html, other]
-
Title: Sketch-GNN: Scalable Graph Neural Networks with Sublinear Training ComplexityComments: NeurIPS 2022Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Graph Neural Networks (GNNs) are widely applied to graph learning problems such as node classification. When scaling up the underlying graphs of GNNs to a larger size, we are forced to either train on the complete graph and keep the full graph adjacency and node embeddings in memory (which is often infeasible) or mini-batch sample the graph (which results in exponentially growing computational complexities with respect to the number of GNN layers). Various sampling-based and historical-embedding-based methods are proposed to avoid this exponential growth of complexities. However, none of these solutions eliminates the linear dependence on graph size. This paper proposes a sketch-based algorithm whose training time and memory grow sublinearly with respect to graph size by training GNNs atop a few compact sketches of graph adjacency and node embeddings. Based on polynomial tensor-sketch (PTS) theory, our framework provides a novel protocol for sketching non-linear activations and graph convolution matrices in GNNs, as opposed to existing methods that sketch linear weights or gradients in neural networks. In addition, we develop a locality-sensitive hashing (LSH) technique that can be trained to improve the quality of sketches. Experiments on large-graph benchmarks demonstrate the scalability and competitive performance of our Sketch-GNNs versus their full-size GNN counterparts.
- [16] arXiv:2406.15648 (cross-list from cs.LG) [pdf, html, other]
-
Title: Testing the Feasibility of Linear Programs with Bandit FeedbackComments: Spotlight presentation at ICML 2024Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
While the recent literature has seen a surge in the study of constrained bandit problems, all existing methods for these begin by assuming the feasibility of the underlying problem. We initiate the study of testing such feasibility assumptions, and in particular address the problem in the linear bandit setting, thus characterising the costs of feasibility testing for an unknown linear program using bandit feedback. Concretely, we test if $\exists x: Ax \ge 0$ for an unknown $A \in \mathbb{R}^{m \times d}$, by playing a sequence of actions $x_t\in \mathbb{R}^d$, and observing $Ax_t + \mathrm{noise}$ in response. By identifying the hypothesis as determining the sign of the value of a minimax game, we construct a novel test based on low-regret algorithms and a nonasymptotic law of iterated logarithms. We prove that this test is reliable, and adapts to the `signal level,' $\Gamma,$ of any instance, with mean sample costs scaling as $\widetilde{O}(d^2/\Gamma^2)$. We complement this by a minimax lower bound of $\Omega(d/\Gamma^2)$ for sample costs of reliable tests, dominating prior asymptotic lower bounds by capturing the dependence on $d$, and thus elucidating a basic insight missing in the extant literature on such problems.
- [17] arXiv:2406.15753 (cross-list from cs.LG) [pdf, other]
-
Title: The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low RegretComments: 58 pages, 1 figureSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the training distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. Our theoretical results highlight the importance of developing new ways to measure the quality of learned reward models.
- [18] arXiv:2406.15760 (cross-list from cs.LG) [pdf, html, other]
-
Title: ICM Ensemble with Novel Betting Functions for Concept DriftSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This study builds upon our previous work by introducing a refined Inductive Conformal Martingale (ICM) approach for addressing Concept Drift (CD). Specifically, we enhance our previously proposed CAUTIOUS betting function to incorporate multiple density estimators for improving detection ability. We also combine this betting function with two base estimators that have not been previously utilized within the ICM framework: the Interpolated Histogram and Nearest Neighbor Density Estimators. We assess these extensions using both a single ICM and an ensemble of ICMs. For the latter, we conduct a comprehensive experimental investigation into the influence of the ensemble size on prediction accuracy and the number of available predictions. Our experimental results on four benchmark datasets demonstrate that the proposed approach surpasses our previous methodology in terms of performance while matching or in many cases exceeding that of three contemporary state-of-the-art techniques.
- [19] arXiv:2406.15762 (cross-list from cs.LG) [pdf, html, other]
-
Title: Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient FlowZhichao Chen, Haoxuan Li, Fangyikang Wang, Odin Zhang, Hu Xu, Xiaoyu Jiang, Zhihuan Song, Eric H. WangSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within the realm of numerical tabular datasets, we introduce a novel principled approach termed Kernelized Negative Entropy-regularized Wasserstein gradient flow Imputation (KnewImp). Specifically, based on Wasserstein gradient flow (WGF) framework, we first prove that issue (1) stems from the cost functionals implicitly maximized in DM-based MDI are equivalent to the MDI's objective plus diversification-promoting non-negative terms. Based on this, we then design a novel cost functional with diversification-discouraging negative entropy and derive our KnewImp approach within WGF framework and reproducing kernel Hilbert space. After that, we prove that the imputation procedure of KnewImp can be derived from another cost functional related to the joint distribution, eliminating the need for the mask matrix and hence naturally addressing issue (2). Extensive experiments demonstrate that our proposed KnewImp approach significantly outperforms existing state-of-the-art methods.
- [20] arXiv:2406.15893 (cross-list from cs.LG) [pdf, html, other]
-
Title: Statistical Models of Top-$k$ Partial OrdersComments: 9 pages, 5 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In many contexts involving ranked preferences, agents submit partial orders over available alternatives. Statistical models often treat these as marginal in the space of total orders, but this approach overlooks information contained in the list length itself. In this work, we introduce and taxonomize approaches for jointly modeling distributions over top-$k$ partial orders and list lengths $k$, considering two classes of approaches: composite models that view a partial order as a truncation of a total order, and augmented ranking models that model the construction of the list as a sequence of choice decisions, including the decision to stop. For composite models, we consider three dependency structures for joint modeling of order and truncation length. For augmented ranking models, we consider different assumptions on how the stop-token choice is modeled. Using data consisting of partial rankings from San Francisco school choice and San Francisco ranked choice elections, we evaluate how well the models predict observed data and generate realistic synthetic datasets. We find that composite models, explicitly modeling length as a categorical variable, produce synthetic datasets with accurate length distributions, and an augmented model with position-dependent item utilities jointly models length and preferences in the training data best, as measured by negative log loss. Methods from this work have significant implications on the simulation and evaluation of real-world social systems that solicit ranked preferences.
- [21] arXiv:2406.15904 (cross-list from cs.LG) [pdf, html, other]
-
Title: Learning When the Concept Shifts: Confounding, Invariance, and Dimension ReductionSubjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Practitioners often deploy a learned prediction model in a new environment where the joint distribution of covariate and response has shifted. In observational data, the distribution shift is often driven by unobserved confounding factors lurking in the environment, with the underlying mechanism unknown. Confounding can obfuscate the definition of the best prediction model (concept shift) and shift covariates to domains yet unseen (covariate shift). Therefore, a model maximizing prediction accuracy in the source environment could suffer a significant accuracy drop in the target environment. This motivates us to study the domain adaptation problem with observational data: given labeled covariate and response pairs from a source environment, and unlabeled covariates from a target environment, how can one predict the missing target response reliably? We root the adaptation problem in a linear structural causal model to address endogeneity and unobserved confounding. We study the necessity and benefit of leveraging exogenous, invariant covariate representations to cure concept shifts and improve target prediction. This further motivates a new representation learning method for adaptation that optimizes for a lower-dimensional linear subspace and, subsequently, a prediction model confined to that subspace. The procedure operates on a non-convex objective-that naturally interpolates between predictability and stability/invariance-constrained on the Stiefel manifold. We study the optimization landscape and prove that, when the regularization is sufficient, nearly all local optima align with an invariant linear subspace resilient to both concept and covariate shift. In terms of predictability, we show a model that uses the learned lower-dimensional subspace can incur a nearly ideal gap between target and source risk. Three real-world data sets are investigated to validate our method and theory.
- [22] arXiv:2406.15916 (cross-list from cs.LG) [pdf, html, other]
-
Title: Credit Attribution and Stable CompressionComments: 15 pages, 1 figureSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Credit attribution is crucial across various fields. In academic research, proper citation acknowledges prior work and establishes original contributions. Similarly, in generative models, such as those trained on existing artworks or music, it is important to ensure that any generated content influenced by these works appropriately credits the original creators.
We study credit attribution by machine learning algorithms. We propose new definitions--relaxations of Differential Privacy--that weaken the stability guarantees for a designated subset of $k$ datapoints. These $k$ datapoints can be used non-stably with permission from their owners, potentially in exchange for compensation. Meanwhile, the remaining datapoints are guaranteed to have no significant influence on the algorithm's output.
Our framework extends well-studied notions of stability, including Differential Privacy ($k = 0$), differentially private learning with public data (where the $k$ public datapoints are fixed in advance), and stable sample compression (where the $k$ datapoints are selected adaptively by the algorithm). We examine the expressive power of these stability notions within the PAC learning framework, provide a comprehensive characterization of learnability for algorithms adhering to these principles, and propose directions and questions for future research. - [23] arXiv:2406.15941 (cross-list from cs.LG) [pdf, html, other]
-
Title: Towards Exact Computation of Inductive BiasComments: Published at IJCAI 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Much research in machine learning involves finding appropriate inductive biases (e.g. convolutional neural networks, momentum-based optimizers, transformers) to promote generalization on tasks. However, quantification of the amount of inductive bias associated with these architectures and hyperparameters has been limited. We propose a novel method for efficiently computing the inductive bias required for generalization on a task with a fixed training data budget; formally, this corresponds to the amount of information required to specify well-generalizing models within a specific hypothesis space of models. Our approach involves modeling the loss distribution of random hypotheses drawn from a hypothesis space to estimate the required inductive bias for a task relative to these hypotheses. Unlike prior work, our method provides a direct estimate of inductive bias without using bounds and is applicable to diverse hypothesis spaces. Moreover, we derive approximation error bounds for our estimation approach in terms of the number of sampled hypotheses. Consistent with prior results, our empirical results demonstrate that higher dimensional tasks require greater inductive bias. We show that relative to other expressive model classes, neural networks as a model class encode large amounts of inductive bias. Furthermore, our measure quantifies the relative difference in inductive bias between different neural network architectures. Our proposed inductive bias metric provides an information-theoretic interpretation of the benefits of specific model architectures for certain tasks and provides a quantitative guide to developing tasks requiring greater inductive bias, thereby encouraging the development of more powerful inductive biases.
- [24] arXiv:2406.15958 (cross-list from eess.IV) [pdf, html, other]
-
Title: Bone Fracture Classification using Transfer LearningComments: code is publicly available at - this https URLSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
The manual examination of X-ray images for fractures is a time-consuming process that is prone to human error. In this work, we introduce a robust yet simple training loop for the classification of fractures, which significantly outperforms existing methods. Our method achieves superior performance in less than ten epochs and utilizes the latest dataset to deliver the best-performing model for this task. We emphasize the importance of training deep learning models responsibly and efficiently, as well as the critical role of selecting high-quality datasets.
- [25] arXiv:2406.15972 (cross-list from cs.LG) [pdf, html, other]
-
Title: EVCL: Elastic Variational Continual Learning with Weight ConsolidationComments: Accepted at ICML 2024 Workshop on Structured Probabilistic Inference & Generative ModelingSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Continual learning aims to allow models to learn new tasks without forgetting what has been learned before. This work introduces Elastic Variational Continual Learning with Weight Consolidation (EVCL), a novel hybrid model that integrates the variational posterior approximation mechanism of Variational Continual Learning (VCL) with the regularization-based parameter-protection strategy of Elastic Weight Consolidation (EWC). By combining the strengths of both methods, EVCL effectively mitigates catastrophic forgetting and enables better capture of dependencies between model parameters and task-specific data. Evaluated on five discriminative tasks, EVCL consistently outperforms existing baselines in both domain-incremental and task-incremental learning scenarios for deep discriminative models.
- [26] arXiv:2406.16052 (cross-list from cs.LG) [pdf, html, other]
-
Title: Pivotal Auto-Encoder via Self-Normalizing ReLUSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
Sparse auto-encoders are useful for extracting low-dimensional representations from high-dimensional data. However, their performance degrades sharply when the input noise at test time differs from the noise employed during training. This limitation hinders the applicability of auto-encoders in real-world scenarios where the level of noise in the input is unpredictable. In this paper, we formalize single hidden layer sparse auto-encoders as a transform learning problem. Leveraging the transform modeling interpretation, we propose an optimization problem that leads to a predictive model invariant to the noise level at test time. In other words, the same pre-trained model is able to generalize to different noise levels. The proposed optimization algorithm, derived from the square root lasso, is translated into a new, computationally efficient auto-encoding architecture. After proving that our new method is invariant to the noise level, we evaluate our approach by training networks using the proposed architecture for denoising tasks. Our experimental results demonstrate that the trained models yield a significant improvement in stability against varying types of noise compared to commonly used architectures.
- [27] arXiv:2406.16199 (cross-list from econ.GN) [pdf, other]
-
Title: Reinterpreting Economic Complexity: A co-clustering approachComments: 19 pages, 4 figuresSubjects: General Economics (econ.GN); Applications (stat.AP); Machine Learning (stat.ML)
Economic growth results from countries' accumulation of organizational and technological capabilities. The Economic and Product Complexity Indices, introduced as an attempt to measure these capabilities from a country's basket of exported products, have become popular to study economic development, the geography of innovation, and industrial policies. Despite this reception, the interpretation of these indicators proved difficult. Although the original Method of Reflections suggested a direct interconnection between country and product metrics, it has been proved that the Economic and Product Complexity Indices result from a spectral clustering algorithm that separately groups similar countries or similar products, respectively. This recent approach to economic and product complexity conflicts with the original one and treats separately countries and products. However, building on previous interpretations of the indices and the recent evolution in spectral clustering, we show that these indices simultaneously identify two co-clusters of similar countries and products. This viewpoint reconciles the spectral clustering interpretation of the indices with the original Method of Reflections interpretation. By proving the often neglected intimate relationship between country and product complexity, this approach emphasizes the role of a selected set of products in determining economic development while extending the range of applications of these indicators in economics.
- [28] arXiv:2406.16206 (cross-list from cs.LG) [pdf, html, other]
-
Title: Zero-Inflated Tweedie Boosted Trees with CatBoost for Insurance Loss AnalyticsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper, we explore advanced modifications to the Tweedie regression model in order to address its limitations in modeling aggregate claims for various types of insurance such as automobile, health, and liability. Traditional Tweedie models, while effective in capturing the probability and magnitude of claims, usually fall short in accurately representing the large incidence of zero claims. Our recommended approach involves a refined modeling of the zero-claim process, together with the integration of boosting methods in order to help leverage an iterative process to enhance predictive accuracy. Despite the inherent slowdown in learning algorithms due to this iteration, several efficient implementation techniques that also help precise tuning of parameter like XGBoost, LightGBM, and CatBoost have emerged. Nonetheless, we chose to utilize CatBoost, a efficient boosting approach that effectively handles categorical and other special types of data. The core contribution of our paper is the assembly of separate modeling for zero claims and the application of tree-based boosting ensemble methods within a CatBoost framework, assuming that the inflated probability of zero is a function of the mean parameter. The efficacy of our enhanced Tweedie model is demonstrated through the application of an insurance telematics dataset, which presents the additional complexity of compositional feature variables. Our modeling results reveal a marked improvement in model performance, showcasing its potential to deliver more accurate predictions suitable for insurance claim analytics.
- [29] arXiv:2406.16306 (cross-list from cs.CL) [pdf, html, other]
-
Title: Cascade Reward Sampling for Efficient Decoding-Time AlignmentSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Aligning large language models (LLMs) with human preferences is critical for their deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play technique that requires no fine-tuning of model parameters. However, generating text that achieves both high reward and high likelihood remains a significant challenge. Existing methods often fail to generate high-reward text or incur substantial computational costs. In this paper, we propose Cascade Reward Sampling (CARDS) to address both issues, guaranteeing the generation of high-reward and high-likelihood text with significantly low costs. Based on our analysis of reward models (RMs) on incomplete text and our observation that high-reward prefixes induce high-reward complete text, we use rejection sampling to iteratively generate small semantic segments to form such prefixes. The segment length is dynamically determined by the predictive uncertainty of LLMs. This strategy guarantees desirable prefixes for subsequent generations and significantly reduces wasteful token re-generations and the number of reward model scoring. Our experiments demonstrate substantial gains in both generation efficiency and alignment ratings compared to the baselines, achieving five times faster text generation and 99\% win-ties in GPT-4/Claude-3 helpfulness evaluation.
- [30] arXiv:2406.16468 (cross-list from cs.LG) [pdf, html, other]
-
Title: The Hidden Pitfalls of the Cosine Similarity LossSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We show that the gradient of the cosine similarity between two points goes to zero in two under-explored settings: (1) if a point has large magnitude or (2) if the points are on opposite ends of the latent space. Counterintuitively, we prove that optimizing the cosine similarity between points forces them to grow in magnitude. Thus, (1) is unavoidable in practice. We then observe that these derivations are extremely general -- they hold across deep learning architectures and for many of the standard self-supervised learning (SSL) loss functions. This leads us to propose cut-initialization: a simple change to network initialization that helps all studied SSL methods converge faster.
- [31] arXiv:2406.16507 (cross-list from stat.ME) [pdf, html, other]
-
Title: Statistical ranking with dynamic covariatesComments: 40 pages; 8 figuresSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
We consider a covariate-assisted ranking model grounded in the Plackett--Luce framework. Unlike existing works focusing on pure covariates or individual effects with fixed covariates, our approach integrates individual effects with dynamic covariates. This added flexibility enhances realistic ranking yet poses significant challenges for analyzing the associated estimation procedures. This paper makes an initial attempt to address these challenges. We begin by discussing the sufficient and necessary condition for the model's identifiability. We then introduce an efficient alternating maximization algorithm to compute the maximum likelihood estimator (MLE). Under suitable assumptions on the topology of comparison graphs and dynamic covariates, we establish a quantitative uniform consistency result for the MLE with convergence rates characterized by the asymptotic graph connectivity. The proposed graph topology assumption holds for several popular random graph models under optimal leading-order sparsity conditions. A comprehensive numerical study is conducted to corroborate our theoretical findings and demonstrate the application of the proposed model to real-world datasets, including horse racing and tennis competitions.
- [32] arXiv:2406.16552 (cross-list from cs.LG) [pdf, html, other]
-
Title: Inference of Sequential Patterns for Neural Message Passing in Temporal GraphsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
The modelling of temporal patterns in dynamic graphs is an important current research issue in the development of time-aware GNNs. Whether or not a specific sequence of events in a temporal graph constitutes a temporal pattern not only depends on the frequency of its occurrence. We consider whether it deviates from what is expected in a temporal graph where timestamps are randomly shuffled. While accounting for such a random baseline is important to model temporal patterns, it has mostly been ignored by current temporal graph neural networks. To address this issue we propose HYPA-DBGNN, a novel two-step approach that combines (i) the inference of anomalous sequential patterns in time series data on graphs based on a statistically principled null model, with (ii) a neural message passing approach that utilizes a higher-order De Bruijn graph whose edges capture overrepresented sequential patterns. Our method leverages hypergeometric graph ensembles to identify anomalous edges within both first- and higher-order De Bruijn graphs, which encode the temporal ordering of events. The model introduces an inductive bias that enhances model interpretability. We evaluate our approach for static node classification using benchmark datasets and a synthetic dataset that showcases its ability to incorporate the observed inductive bias regarding over- and under-represented temporal edges. We demonstrate the framework's effectiveness in detecting similar patterns within empirical datasets, resulting in superior performance compared to baseline methods in node classification tasks. To the best of our knowledge, our work is the first to introduce statistically informed GNNs that leverage temporal and causal sequence anomalies. HYPA-DBGNN represents a path for bridging the gap between statistical graph inference and neural graph representation learning, with potential applications to static GNNs.
- [33] arXiv:2406.16689 (cross-list from cs.LG) [pdf, html, other]
-
Title: Coding schemes in neural networks learning classification tasksSubjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (stat.ML)
Neural networks posses the crucial ability to generate meaningful representations of task-dependent features. Indeed, with appropriate scaling, supervised learning in neural networks can result in strong, task-dependent feature learning. However, the nature of the emergent representations, which we call the `coding scheme', is still unclear. To understand the emergent coding scheme, we investigate fully-connected, wide neural networks learning classification tasks using the Bayesian framework where learning shapes the posterior distribution of the network weights. Consistent with previous findings, our analysis of the feature learning regime (also known as `non-lazy', `rich', or `mean-field' regime) shows that the networks acquire strong, data-dependent features. Surprisingly, the nature of the internal representations depends crucially on the neuronal nonlinearity. In linear networks, an analog coding scheme of the task emerges. Despite the strong representations, the mean predictor is identical to the lazy case. In nonlinear networks, spontaneous symmetry breaking leads to either redundant or sparse coding schemes. Our findings highlight how network properties such as scaling of weights and neuronal nonlinearity can profoundly influence the emergent representations.
- [34] arXiv:2406.16745 (cross-list from cs.LG) [pdf, html, other]
-
Title: Bandits with Preference Feedback: A Stackelberg Game PerspectiveComments: 30 pages, 8 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
Bandits with preference feedback present a powerful tool for optimizing unknown target functions when only pairwise comparisons are allowed instead of direct value queries. This model allows for incorporating human feedback into online inference and optimization and has been employed in systems for fine-tuning large language models. The problem is well understood in simplified settings with linear target functions or over finite small domains that limit practical interest. Taking the next step, we consider infinite domains and nonlinear (kernelized) rewards. In this setting, selecting a pair of actions is quite challenging and requires balancing exploration and exploitation at two levels: within the pair, and along the iterations of the algorithm. We propose MAXMINLCB, which emulates this trade-off as a zero-sum Stackelberg game, and chooses action pairs that are informative and yield favorable rewards. MAXMINLCB consistently outperforms existing algorithms and satisfies an anytime-valid rate-optimal regret guarantee. This is due to our novel preference-based confidence sequences for kernelized logistic estimators.
- [35] arXiv:2406.16749 (cross-list from cs.LG) [pdf, html, other]
-
Title: Inferring stochastic low-rank recurrent neural networks from neural dataSubjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
A central aim in computational neuroscience is to relate the activity of large populations of neurons to an underlying dynamical system. Models of these neural dynamics should ideally be both interpretable and fit the observed data well. Low-rank recurrent neural networks (RNNs) exhibit such interpretability by having tractable dynamics. However, it is unclear how to best fit low-rank RNNs to data consisting of noisy observations of an underlying stochastic system. Here, we propose to fit stochastic low-rank RNNs with variational sequential Monte Carlo methods. We validate our method on several datasets consisting of both continuous and spiking neural data, where we obtain lower dimensional latent dynamics than current state of the art methods. Additionally, for low-rank models with piecewise linear nonlinearities, we show how to efficiently identify all fixed points in polynomial rather than exponential cost in the number of units, making analysis of the inferred dynamics tractable for large RNNs. Our method both elucidates the dynamical systems underlying experimental recordings and provides a generative model whose trajectories match observed trial-to-trial variability.
- [36] arXiv:2406.16802 (cross-list from cs.LG) [pdf, html, other]
-
Title: Improved Regret Bounds for Bandits with Expert AdviceSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this research note, we revisit the bandits with expert advice problem. Under a restricted feedback model, we prove a lower bound of order $\sqrt{K T \ln(N/K)}$ for the worst-case regret, where $K$ is the number of actions, $N>K$ the number of experts, and $T$ the time horizon. This matches a previously known upper bound of the same order and improves upon the best available lower bound of $\sqrt{K T (\ln N) / (\ln K)}$. For the standard feedback model, we prove a new instance-based upper bound that depends on the agreement between the experts and provides a logarithmic improvement compared to prior results.
- [37] arXiv:2406.16846 (cross-list from cs.LG) [pdf, html, other]
-
Title: Data Debiasing with Datamodels (D3M): Improving Subgroup Robustness via Data SelectionSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
Machine learning models can fail on subgroups that are underrepresented during training. While techniques such as dataset balancing can improve performance on underperforming groups, they require access to training group annotations and can end up removing large portions of the dataset. In this paper, we introduce Data Debiasing with Datamodels (D3M), a debiasing approach which isolates and removes specific training examples that drive the model's failures on minority groups. Our approach enables us to efficiently train debiased classifiers while removing only a small number of examples, and does not require training group annotations or additional hyperparameter tuning.
Cross submissions for Tuesday, 25 June 2024 (showing 25 of 25 entries )
- [38] arXiv:1909.06511 (replaced) [pdf, html, other]
-
Title: A new model for natural groupings in high-dimensional dataComments: 17 pages, 11 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Clustering aims to divide a set of points into groups. The current paradigm assumes that the grouping is well-defined (unique) given the probability model from which the data is drawn. Yet, recent experiments have uncovered several high-dimensional datasets that form different binary groupings after projecting the data to randomly chosen one-dimensional subspaces. This paper describes a probability model for the data that could explain this phenomenon. It is a simple model to serve as a proof of concept for understanding the geometry of high-dimensional data. We start by building a rescaled multivariate Bernouilli model (stretched hypercube) so to create several overlapping grouping structures in the data. The size of each scaling parameter is related to the likelihood of uncovering the corresponding grouping by random 1D projection. Clusters in the original space are then created by adding noise to this cluster-free model. In high dimension, these clusters would hardly be observable given a sample set from the distribution because of the curse of dimensionality, but the binary groupings are clear. Our construction makes it clear that one needs to make a distinction between "groupings" and "clusters" in the original space. It also highlights the need to interpret any clustering found in projected data as merely one among potentially many other groupings in a dataset.
- [39] arXiv:2210.17405 (replaced) [pdf, html, other]
-
Title: Exact and Approximate Conformal Inference for Multi-Output RegressionComments: 20 pages, 6 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO); Other Statistics (stat.OT)
It is common in machine learning to estimate a response $y$ given covariate information $x$. However, these predictions alone do not quantify any uncertainty associated with said predictions. One way to overcome this deficiency is with conformal inference methods, which construct a set containing the unobserved response $y$ with a prescribed probability. Unfortunately, even with a one-dimensional response, conformal inference is computationally expensive despite recent encouraging advances. In this paper, we explore multi-output regression, delivering exact derivations of conformal inference $p$-values when the predictive model can be described as a linear function of $y$. Additionally, we propose \texttt{unionCP} and a multivariate extension of \texttt{rootCP} as efficient ways of approximating the conformal prediction region for a wide array of multi-output predictors, both linear and nonlinear, while preserving computational advantages. We also provide both theoretical and empirical evidence of the effectiveness of these methods using both real-world and simulated data.
- [40] arXiv:2310.13393 (replaced) [pdf, html, other]
-
Title: Optimal Best Arm Identification with Fixed Confidence in Restless BanditsComments: Accepted to the IEEE Transactions on Information TheorySubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
We study best arm identification in a restless multi-armed bandit setting with finitely many arms. The discrete-time data generated by each arm forms a homogeneous Markov chain taking values in a common, finite state space. The state transitions in each arm are captured by an ergodic transition probability matrix (TPM) that is a member of a single-parameter exponential family of TPMs. The real-valued parameters of the arm TPMs are unknown and belong to a given space. Given a function $f$ defined on the common state space of the arms, the goal is to identify the best arm -- the arm with the largest average value of $f$ evaluated under the arm's stationary distribution -- with the fewest number of samples, subject to an upper bound on the decision's error probability (i.e., the fixed-confidence regime). A lower bound on the growth rate of the expected stopping time is established in the asymptote of a vanishing error probability. Furthermore, a policy for best arm identification is proposed, and its expected stopping time is proved to have an asymptotic growth rate that matches the lower bound. It is demonstrated that tracking the long-term behavior of a certain Markov decision process and its state-action visitation proportions are the key ingredients in analyzing the converse and achievability bounds. It is shown that under every policy, the state-action visitation proportions satisfy a specific approximate flow conservation constraint and that these proportions match the optimal proportions dictated by the lower bound under any asymptotically optimal policy. The prior studies on best arm identification in restless bandits focus on independent observations from the arms, rested Markov arms, and restless Markov arms with known arm TPMs. In contrast, this work is the first to study best arm identification in restless bandits with unknown arm TPMs.
- [41] arXiv:2310.14188 (replaced) [pdf, html, other]
-
Title: A General Theory for Softmax Gating Multinomial Logistic Mixture of ExpertsComments: Accepted to ICML 2024, 32 pages, 3 figures, 3 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Mixture-of-experts (MoE) model incorporates the power of multiple submodels via gating functions to achieve greater performance in numerous regression and classification applications. From a theoretical perspective, while there have been previous attempts to comprehend the behavior of that model under the regression settings through the convergence analysis of maximum likelihood estimation in the Gaussian MoE model, such analysis under the setting of a classification problem has remained missing in the literature. We close this gap by establishing the convergence rates of density estimation and parameter estimation in the softmax gating multinomial logistic MoE model. Notably, when part of the expert parameters vanish, these rates are shown to be slower than polynomial rates owing to an inherent interaction between the softmax gating and expert functions via partial differential equations. To address this issue, we propose using a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions. As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.
- [42] arXiv:2401.13875 (replaced) [pdf, other]
-
Title: Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts?Comments: Accepted to ICML 2024, 47 pages, 2 figures, 2 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Dense-to-sparse gating mixture of experts (MoE) has recently become an effective alternative to a well-known sparse MoE. Rather than fixing the number of activated experts as in the latter model, which could limit the investigation of potential experts, the former model utilizes the temperature to control the softmax weight distribution and the sparsity of the MoE during training in order to stabilize the expert specialization. Nevertheless, while there are previous attempts to theoretically comprehend the sparse MoE, a comprehensive analysis of the dense-to-sparse gating MoE has remained elusive. Therefore, we aim to explore the impacts of the dense-to-sparse gate on the maximum likelihood estimation under the Gaussian MoE in this paper. We demonstrate that due to interactions between the temperature and other model parameters via some partial differential equations, the convergence rates of parameter estimations are slower than any polynomial rates, and could be as slow as $\mathcal{O}(1/\log(n))$, where $n$ denotes the sample size. To address this issue, we propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function. By imposing linearly independence conditions on the activation function and its derivatives, we show that the parameter estimation rates are significantly improved to polynomial rates. Finally, we conduct a simulation study to empirically validate our theoretical results.
- [43] arXiv:2402.01092 (replaced) [pdf, html, other]
-
Title: A Dynamical Model of Neural Scaling LawsComments: ICML Camera Ready. Included online SGD section with additional simulations and its connection to large sample limit of our gradient flow theory. Fixed typo in Appendix eq 112Subjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
On a variety of tasks, the performance of neural networks predictably improves with training time, dataset size and model size across many orders of magnitude. This phenomenon is known as a neural scaling law. Of fundamental importance is the compute-optimal scaling law, which reports the performance as a function of units of compute when choosing model sizes optimally. We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. This reproduces many observations about neural scaling laws. First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinite-width dynamics at a rate $1/\textit{width}$ but at late time exhibit a rate $\textit{width}^{-c}$, where $c$ depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
- [44] arXiv:2402.02952 (replaced) [pdf, html, other]
-
Title: On Least Square Estimation in Softmax Gating Mixture of ExpertsComments: Accepted to ICML 2024, 29 pages, 2 figures, 2 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous theoretical works have primarily focused on probabilistic MoE models by imposing the impractical assumption that the data are generated from a Gaussian MoE model. In this work, we investigate the performance of the least squares estimators (LSE) under a deterministic MoE model where the data are sampled according to a regression model, a setting that has remained largely unexplored. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. We demonstrate that the rates for estimating strongly identifiable experts, namely the widely used feed-forward networks with activation functions $\mathrm{sigmoid}(\cdot)$ and $\tanh(\cdot)$, are substantially faster than those of polynomial experts, which we show to exhibit a surprising slow estimation rate. Our findings have important practical implications for expert selection.
- [45] arXiv:2402.05220 (replaced) [pdf, html, other]
-
Title: On Parameter Estimation in Deviated Gaussian Mixture of ExpertsComments: Accepted to AISTATS 2024, 32 pages, 2 figures, 1 tableSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We consider the parameter estimation problem in the deviated Gaussian mixture of experts in which the data are generated from $(1 - \lambda^{\ast}) g_0(Y| X)+ \lambda^{\ast} \sum_{i = 1}^{k_{\ast}} p_{i}^{\ast} f(Y|(a_{i}^{\ast})^{\top}X+b_i^{\ast},\sigma_{i}^{\ast})$, where $X, Y$ are respectively a covariate vector and a response variable, $g_{0}(Y|X)$ is a known function, $\lambda^{\ast} \in [0, 1]$ is true but unknown mixing proportion, and $(p_{i}^{\ast}, a_{i}^{\ast}, b_{i}^{\ast}, \sigma_{i}^{\ast})$ for $1 \leq i \leq k^{\ast}$ are unknown parameters of the Gaussian mixture of experts. This problem arises from the goodness-of-fit test when we would like to test whether the data are generated from $g_{0}(Y|X)$ (null hypothesis) or they are generated from the whole mixture (alternative hypothesis). Based on the algebraic structure of the expert functions and the distinguishability between $g_0$ and the mixture part, we construct novel Voronoi-based loss functions to capture the convergence rates of maximum likelihood estimation (MLE) for our models. We further demonstrate that our proposed loss functions characterize the local convergence rates of parameter estimation more accurately than the generalized Wasserstein, a loss function being commonly used for estimating parameters in the Gaussian mixture of experts.
- [46] arXiv:2402.05271 (replaced) [pdf, html, other]
-
Title: Feature learning as alignment: a structural property of gradient descent in non-linear neural networksSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Understanding the mechanisms through which neural networks extract statistics from input-label pairs through feature learning is one of the most important unsolved problems in supervised learning. Prior works demonstrated that the gram matrices of the weights (the neural feature matrices, NFM) and the average gradient outer products (AGOP) become correlated during training, in a statement known as the neural feature ansatz (NFA). Through the NFA, the authors introduce mapping with the AGOP as a general mechanism for neural feature learning. However, these works do not provide a theoretical explanation for this correlation or its origins. In this work, we further clarify the nature of this correlation, and explain its emergence. We show that this correlation is equivalent to alignment between the left singular structure of the weight matrices and the newly defined pre-activation tangent features at each layer. We further establish that the alignment is driven by the interaction of weight changes induced by SGD with the pre-activation features, and analyze the resulting dynamics analytically at early times in terms of simple statistics of the inputs and labels. Finally, motivated by the observation that the NFA is driven by this centered correlation, we introduce a simple optimization rule that dramatically increases the NFA correlations at any given layer and improves the quality of features learned.
- [47] arXiv:2402.10429 (replaced) [pdf, html, other]
-
Title: Fixed Confidence Best Arm Identification in the Bayesian SettingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We consider the fixed-confidence best arm identification (FC-BAI) problem in the Bayesian setting. This problem aims to find the arm of the largest mean with a fixed confidence level when the bandit model has been sampled from the known prior. Most studies on the FC-BAI problem have been conducted in the frequentist setting, where the bandit model is predetermined before the game starts. We show that the traditional FC-BAI algorithms studied in the frequentist setting, such as track-and-stop and top-two algorithms, result in arbitrarily suboptimal performances in the Bayesian setting. We also obtain a lower bound of the expected number of samples in the Bayesian setting and introduce a variant of successive elimination that has a matching performance with the lower bound up to a logarithmic factor. Simulations verify the theoretical results.
- [48] arXiv:2403.00423 (replaced) [pdf, html, other]
-
Title: Validation of ML-UQ calibration statistics using simulated reference values: a sensitivity analysisSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
Some popular Machine Learning Uncertainty Quantification (ML-UQ) calibration statistics do not have predefined reference values and are mostly used in comparative studies. In consequence, calibration is almost never validated and the diagnostic is left to the appreciation of the reader. Simulated reference values, based on synthetic calibrated datasets derived from actual uncertainties, have been proposed to palliate this problem. As the generative probability distribution for the simulation of synthetic errors is often not constrained, the sensitivity of simulated reference values to the choice of generative distribution might be problematic, shedding a doubt on the calibration diagnostic. This study explores various facets of this problem, and shows that some statistics are excessively sensitive to the choice of generative distribution to be used for validation when the generative distribution is unknown. This is the case, for instance, of the correlation coefficient between absolute errors and uncertainties (CC) and of the expected normalized calibration error (ENCE). A robust validation workflow to deal with simulated reference values is proposed.
- [49] arXiv:2403.07454 (replaced) [pdf, html, other]
-
Title: Fast, accurate and lightweight sequential simulation-based inference using Gaussian locally linear mappingsComments: 69 pages, 66 figures: new case study added (Biological model of the translation kinetics after mRNA transfection)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bayesian inference for complex models with an intractable likelihood can be tackled using algorithms performing many calls to computer simulators. These approaches are collectively known as "simulation-based inference" (SBI). Recent SBI methods have made use of neural networks (NN) to provide approximate, yet expressive constructs for the unavailable likelihood function and the posterior distribution. However, the trade-off between accuracy and computational demand leaves much space for improvement. In this work, we propose an alternative that provides both approximations to the likelihood and the posterior distribution, using structured mixtures of probability distributions. Our approach produces accurate posterior inference when compared to state-of-the-art NN-based SBI methods, even for multimodal posteriors, while exhibiting a much smaller computational footprint. We illustrate our results on several benchmark models from the SBI literature and on a biological model of the translation kinetics after mRNA transfection.
- [50] arXiv:2405.00592 (replaced) [pdf, html, other]
-
Title: Scaling and renormalization in high-dimensional regressionComments: 68 pages, 17 figuresSubjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models using the basic tools of random matrix theory and free probability. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning. Analytic formulas for the training and generalization errors are obtained in a few lines of algebra directly from the properties of the $S$-transform of free probability. This allows for a straightforward identification of the sources of power-law scaling in model performance. We compute the generalization error of a broad class of random feature models. We find that in all models, the $S$-transform corresponds to the train-test generalization gap, and yields an analogue of the generalized-cross-validation estimator. Using these techniques, we derive fine-grained bias-variance decompositions for a very general class of random feature models with structured covariates. These novel results allow us to discover a scaling regime for random feature models where the variance due to the features limits performance in the overparameterized setting. We also demonstrate how anisotropic weight structure in random feature models can limit performance and lead to nontrivial exponents for finite-width corrections in the overparameterized setting. Our results extend and provide a unifying perspective on earlier models of neural scaling laws.
- [51] arXiv:2406.12560 (replaced) [pdf, html, other]
-
Title: Towards Bayesian Data SelectionComments: 5th Workshop on Data-Centric Machine Learning Research (DMLR) at ICML 2024Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
A wide range of machine learning algorithms iteratively add data to the training sample. Examples include semi-supervised learning, active learning, multi-armed bandits, and Bayesian optimization. We embed this kind of data addition into decision theory by framing data selection as a decision problem. This paves the way for finding Bayes-optimal selections of data. For the illustrative case of self-training in semi-supervised learning, we derive the respective Bayes criterion. We further show that deploying this criterion mitigates the issue of confirmation bias by empirically assessing our method for generalized linear models, semi-parametric generalized additive models, and Bayesian neural networks on simulated and real-world data.
- [52] arXiv:2406.13154 (replaced) [pdf, html, other]
-
Title: Conditional score-based diffusion models for solving inverse problems in mechanicsAgnimitra Dasgupta, Harisankar Ramaswamy, Javier Murgoitio Esandi, Ken Foo, Runze Li, Qifa Zhou, Brendan Kennedy, Assad OberaiSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We propose a framework to perform Bayesian inference using conditional score-based diffusion models to solve a class of inverse problems in mechanics involving the inference of a specimen's spatially varying material properties from noisy measurements of its mechanical response to loading. Conditional score-based diffusion models are generative models that learn to approximate the score function of a conditional distribution using samples from the joint distribution. More specifically, the score functions corresponding to multiple realizations of the measurement are approximated using a single neural network, the so-called score network, which is subsequently used to sample the posterior distribution using an appropriate Markov chain Monte Carlo scheme based on Langevin dynamics. Training the score network only requires simulating the forward model. Hence, the proposed approach can accommodate black-box forward models and complex measurement noise. Moreover, once the score network has been trained, it can be re-used to solve the inverse problem for different realizations of the measurements. We demonstrate the efficacy of the proposed approach on a suite of high-dimensional inverse problems in mechanics that involve inferring heterogeneous material properties from noisy measurements. Some examples we consider involve synthetic data, while others include data collected from actual elastography experiments. Further, our applications demonstrate that the proposed approach can handle different measurement modalities, complex patterns in the inferred quantities, non-Gaussian and non-additive noise models, and nonlinear black-box forward models. The results show that the proposed framework can solve large-scale physics-based inverse problems efficiently.
- [53] arXiv:2406.14003 (replaced) [pdf, html, other]
-
Title: Deep Optimal Experimental Design for Parameter Estimation ProblemsSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
Optimal experimental design is a well studied field in applied science and engineering. Techniques for estimating such a design are commonly used within the framework of parameter estimation. Nonetheless, in recent years parameter estimation techniques are changing rapidly with the introduction of deep learning techniques to replace traditional estimation methods. This in turn requires the adaptation of optimal experimental design that is associated with these new techniques. In this paper we investigate a new experimental design methodology that uses deep learning. We show that the training of a network as a Likelihood Free Estimator can be used to significantly simplify the design process and circumvent the need for the computationally expensive bi-level optimization problem that is inherent in optimal experimental design for non-linear systems. Furthermore, deep design improves the quality of the recovery process for parameter estimation problems. As proof of concept we apply our methodology to two different systems of Ordinary Differential Equations.
- [54] arXiv:1911.04872 (replaced) [pdf, html, other]
-
Title: Two Ridge Solutions for the Incremental Broad Learning System on Added NodesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The original Broad Learning System (BLS) on new added nodes and its existing efficient implementation both assume the ridge parameter lambda -> 0 in the ridge inverse to approximate the generalized inverse, and compute the generalized inverse solution for the output weights. In this paper, we propose two ridge solutions for the output weights in the BLS on added nodes, where lambda -> 0 is no longer assumed, and lambda can be any positive real number. One of the proposed ridge solutions computes the output weights from the inverse Cholesky factor, which is updated efficiently by extending the existing inverse Cholesky factorization. The other proposed ridge solution computes the output weights from the ridge inverse, and updates the ridge inverse by extending the Greville's method that is a classical tool to compute the generalized inverse of partitioned matrices. For the proposed efficient ridge solution based on the inverse Cholesky factor, we also develop another implementation that is numerically more stable when the ridge parameter lambda is very small. The proposed ridge solution based on the ridge inverse and the numerically more stable implementation of the proposed efficient ridge solution require the same complexity as the original BLS and the existing efficient BLS, respectively. Moreover, the speedups of the proposed efficient ridge solution to the original BLS and the existing efficient BLS are 3 and more than 1.67 respectively, when the computational complexities for each update are compared, and the speedups are 1.99 - 2.52 and 1.31 - 1.58, respectively, when the total training time is compared by numerical experiments. On the other hand, our numerical experiments show that both the proposed ridge solutions for BLS achieve better testing accuracies than the original BLS and the existing efficient BLS.
- [55] arXiv:2002.01987 (replaced) [pdf, html, other]
-
Title: Function approximation by neural nets in the mean-field regime: Entropic regularization and controlled McKean-Vlasov dynamicsComments: 30 pages; note the change of titleSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
We consider the problem of function approximation by two-layer neural nets with random weights that are "nearly Gaussian" in the sense of Kullback-Leibler divergence. Our setting is the mean-field limit, where the finite population of neurons in the hidden layer is replaced by a continuous ensemble. We show that the problem can be phrased as global minimization of a free energy functional on the space of (finite-length) paths over probability measures on the weights. This functional trades off the $L^2$ approximation risk of the terminal measure against the KL divergence of the path with respect to an isotropic Brownian motion prior. We characterize the unique global minimizer and examine the dynamics in the space of probability measures over weights that can achieve it. In particular, we show that the optimal path-space measure corresponds to the Föllmer drift, the solution to a McKean-Vlasov optimal control problem closely related to the classic Schrödinger bridge problem. While the Föllmer drift cannot in general be obtained in closed form, thus limiting its potential algorithmic utility, we illustrate the viability of the mean-field Langevin diffusion as a finite-time approximation under various conditions on entropic regularization. Specifically, we show that it closely tracks the Föllmer drift when the regularization is such that the minimizing density is log-concave.
- [56] arXiv:2009.03527 (replaced) [pdf, html, other]
-
Title: Approximate Multiplication of Sparse Matrices with Limited SpaceComments: v2 matches the camera-ready version for AAAI2021 betterSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Approximate matrix multiplication with limited space has received ever-increasing attention due to the emergence of large-scale applications. Recently, based on a popular matrix sketching algorithm -- frequent directions, previous work has introduced co-occuring directions (COD) to reduce the approximation error for this problem. Although it enjoys the space complexity of $O((m_x+m_y)\ell)$ for two input matrices $X\in\mathbb{R}^{m_x\times n}$ and $Y\in\mathbb{R}^{m_y\times n}$ where $\ell$ is the sketch size, its time complexity is $O\left(n(m_x+m_y+\ell)\ell\right)$, which is still very high for large input matrices. In this paper, we propose to reduce the time complexity by exploiting the sparsity of the input matrices. The key idea is to employ an approximate singular value decomposition (SVD) method which can utilize the sparsity, to reduce the number of QR decompositions required by COD. In this way, we develop sparse co-occuring directions, which reduces the time complexity to $\widetilde{O}\left((\nnz(X)+\nnz(Y))\ell+n\ell^2\right)$ in expectation while keeps the same space complexity as $O((m_x+m_y)\ell)$, where $\nnz(X)$ denotes the number of non-zero entries in $X$ and the $\widetilde{O}$ notation hides constant factors as well as polylogarithmic factors. Theoretical analysis reveals that the approximation error of our algorithm is almost the same as that of COD. Furthermore, we empirically verify the efficiency and effectiveness of our algorithm.
- [57] arXiv:2302.12111 (replaced) [pdf, html, other]
-
Title: Communication-Efficient Distributed Estimation and Inference for Cox's ModelSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP); Machine Learning (stat.ML)
Motivated by multi-center biomedical studies that cannot share individual data due to privacy and ownership concerns, we develop communication-efficient iterative distributed algorithms for estimation and inference in the high-dimensional sparse Cox proportional hazards model. We demonstrate that our estimator, even with a relatively small number of iterations, achieves the same convergence rate as the ideal full-sample estimator under very mild conditions. To construct confidence intervals for linear combinations of high-dimensional hazard regression coefficients, we introduce a novel debiased method, establish central limit theorems, and provide consistent variance estimators that yield asymptotically valid distributed confidence intervals. In addition, we provide valid and powerful distributed hypothesis tests for any coordinate element based on a decorrelated score test. We allow time-dependent covariates as well as censored survival times. Extensive numerical experiments on both simulated and real data lend further support to our theory and demonstrate that our communication-efficient distributed estimators, confidence intervals, and hypothesis tests improve upon alternative methods.
- [58] arXiv:2303.00890 (replaced) [pdf, html, other]
-
Title: Comparison of High-Dimensional Bayesian Optimization Algorithms on BBOBSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Bayesian Optimization (BO) is a class of black-box, surrogate-based heuristics that can efficiently optimize problems that are expensive to evaluate, and hence admit only small evaluation budgets. BO is particularly popular for solving numerical optimization problems in industry, where the evaluation of objective functions often relies on time-consuming simulations or physical experiments. However, many industrial problems depend on a large number of parameters. This poses a challenge for BO algorithms, whose performance is often reported to suffer when the dimension grows beyond 15 variables. Although many new algorithms have been proposed to address this problem, it is not well understood which one is the best for which optimization scenario.
In this work, we compare five state-of-the-art high-dimensional BO algorithms, with vanilla BO and CMA-ES on the 24 BBOB functions of the COCO environment at increasing dimensionality, ranging from 10 to 60 variables. Our results confirm the superiority of BO over CMA-ES for limited evaluation budgets and suggest that the most promising approach to improve BO is the use of trust regions. However, we also observe significant performance differences for different function landscapes and budget exploitation phases, indicating improvement potential, e.g., through hybridization of algorithmic components. - [59] arXiv:2306.03066 (replaced) [pdf, html, other]
-
Title: Of Mice and Mates: Automated Classification and Modelling of Mouse Behaviour in Groups using a Single Model across CagesComments: International Journal of Computer Vision (2024)Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Behavioural experiments often happen in specialised arenas, but this may confound the analysis. To address this issue, we provide tools to study mice in the home-cage environment, equipping biologists with the possibility to capture the temporal aspect of the individual's behaviour and model the interaction and interdependence between cage-mates with minimal human intervention. Our main contribution is the novel Group Behaviour Model (GBM) which summarises the joint behaviour of groups of mice across cages, using a permutation matrix to match the mouse identities in each cage to the model. In support of the above, we also (a) developed the Activity Labelling Module (ALM) to automatically classify mouse behaviour from video, and (b) released two datasets, ABODe for training behaviour classifiers and IMADGE for modelling behaviour.
- [60] arXiv:2307.12797 (replaced) [pdf, html, other]
-
Title: Causal Fair Machine Learning via Rank-Preserving Interventional DistributionsJournal-ref: Proceedings of the 1st Workshop on Fairness and Bias in AI co-located with 26th European Conference on Artificial Intelligence (ECAI 2023), CEUR Workshop Proceedings, https://ceur-ws.org/Vol-3523/Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
A decision can be defined as fair if equal individuals are treated equally and unequals unequally. Adopting this definition, the task of designing machine learning (ML) models that mitigate unfairness in automated decision-making systems must include causal thinking when introducing protected attributes: Following a recent proposal, we define individuals as being normatively equal if they are equal in a fictitious, normatively desired (FiND) world, where the protected attributes have no (direct or indirect) causal effect on the target. We propose rank-preserving interventional distributions to define a specific FiND world in which this holds and a warping method for estimation. Evaluation criteria for both the method and the resulting ML model are presented and validated through simulations. Experiments on empirical data showcase the practical application of our method and compare results with "fairadapt" (Plečko and Meinshausen, 2020), a different approach for mitigating unfairness by causally preprocessing data that uses quantile regression forests. With this, we show that our warping approach effectively identifies the most discriminated individuals and mitigates unfairness.
- [61] arXiv:2310.06312 (replaced) [pdf, html, other]
-
Title: Discovering Mixtures of Structural Causal Models from Time Series DataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Discovering causal relationships from time series data is significant in fields such as finance, climate science, and neuroscience. However, contemporary techniques rely on the simplifying assumption that data originates from the same causal model, while in practice, data is heterogeneous and can stem from different causal models. In this work, we relax this assumption and perform causal discovery from time series data originating from a mixture of causal models. We propose a general variational inference-based framework called MCD to infer the underlying causal models as well as the mixing probability of each sample. Our approach employs an end-to-end training process that maximizes an evidence-lower bound for the data likelihood. We present two variants: MCD-Linear for linear relationships and independent noise, and MCD-Nonlinear for nonlinear causal relationships and history-dependent noise. We demonstrate that our method surpasses state-of-the-art benchmarks in causal discovery tasks through extensive experimentation on synthetic and real-world datasets, particularly when the data emanates from diverse underlying causal graphs. Theoretically, we prove the identifiability of such a model under some mild assumptions.
- [62] arXiv:2310.17638 (replaced) [pdf, html, other]
-
Title: Generative Fractional Diffusion ModelsGabriel Nobis, Maximilian Springenberg, Marco Aversa, Michael Detzel, Rembert Daems, Roderick Murray-Smith, Shinichi Nakajima, Sebastian Lapuschkin, Stefano Ermon, Tolga Birdal, Manfred Opper, Christoph Knochenhauer, Luis Oala, Wojciech SamekSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce the first continuous-time score-based generative model that leverages fractional diffusion processes for its underlying dynamics. Although diffusion models have excelled at capturing data distributions, they still suffer from various limitations such as slow convergence, mode-collapse on imbalanced data, and lack of diversity. These issues are partially linked to the use of light-tailed Brownian motion (BM) with independent increments. In this paper, we replace BM with an approximation of its non-Markovian counterpart, fractional Brownian motion (fBM), characterized by correlated increments and Hurst index $H \in (0,1)$, where $H=1/2$ recovers the classical BM. To ensure tractable inference and learning, we employ a recently popularized Markov approximation of fBM (MA-fBM) and derive its reverse time model, resulting in generative fractional diffusion models (GFDMs). We characterize the forward dynamics using a continuous reparameterization trick and propose an augmented score matching loss to efficiently learn the score-function, which is partly known in closed form, at minimal added cost. The ability to drive our diffusion model via fBM provides flexibility and control. $H \leq 1/2$ enters the regime of rough paths whereas $H>1/2$ regularizes diffusion paths and invokes long-term memory as well as a heavy-tailed behaviour (super-diffusion). The Markov approximation allows added control by varying the number of Markov processes linearly combined to approximate fBM. Our evaluations on real image datasets demonstrate that GFDM achieves greater pixel-wise diversity and enhanced image quality, as indicated by a lower FID, offering a promising alternative to traditional diffusion models.
- [63] arXiv:2311.05009 (replaced) [pdf, html, other]
-
Title: Consensus-based construction of high-dimensional free energy surfaceSubjects: Computational Physics (physics.comp-ph); Numerical Analysis (math.NA); Machine Learning (stat.ML)
One essential problem in quantifying the collective behaviors of molecular systems lies in the accurate construction of free energy surfaces (FESs). The main challenges arise from the prevalence of energy barriers and the high dimensionality. Existing approaches are often based on sophisticated enhanced sampling methods to establish efficient exploration of the full-phase space. On the other hand, the collection of optimal sample points for the numerical approximation of FESs remains largely under-explored, where the discretization error could become dominant for systems with a large number of collective variables (CVs). We propose a consensus sampling-based approach by reformulating the construction as a minimax problem which simultaneously optimizes the function representation and the training set. In particular, the maximization step establishes a stochastic interacting particle system to achieve the adaptive sampling of the max-residue regime by modulating the exploitation of the Laplace approximation of the current loss function and the exploration of the uncharted phase space; the minimization step updates the FES approximation with the new training set. By iteratively solving the minimax problem, the present method essentially achieves an adversarial learning of the FESs with unified tasks for both phase space exploration and posterior error-enhanced sampling. We demonstrate the method by constructing the FESs of molecular systems with a number of CVs up to 30.
- [64] arXiv:2403.07379 (replaced) [pdf, html, other]
-
Title: Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and RedundancyComments: Preprint, 57 pagesSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich directional structure of optimization trajectories, represented by their pointwise parameters. Towards this end, we introduce some natural notions of the complexity of optimization trajectories, both qualitative and quantitative, which hallmark the directional nature of optimization in neural networks: when is there redundancy, and when exploration. We use them to reveal the inherent nuance and interplay involved between various optimization choices, such as momentum and weight decay. Further, the trajectory perspective helps us see the effect of scale on regularizing the directional nature of trajectories, and as a by-product, we also observe an intriguing heterogeneity of Q,K,V dynamics in the middle attention layers in LLMs and which is homogenized by scale. Importantly, we put the significant directional redundancy observed to the test by demonstrating that training only scalar batchnorm parameters some while into training matches the performance of training the entire network, which thus exhibits the potential of hybrid optimization schemes that are geared towards efficiency.
- [65] arXiv:2403.16369 (replaced) [pdf, html, other]
-
Title: Learning Action-based Representations Using InvarianceComments: Published at the Reinforcement Learning Conference 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Robust reinforcement learning agents using high-dimensional observations must be able to identify relevant state features amidst many exogeneous distractors. A representation that captures controllability identifies these state elements by determining what affects agent control. While methods such as inverse dynamics and mutual information capture controllability for a limited number of timesteps, capturing long-horizon elements remains a challenging problem. Myopic controllability can capture the moment right before an agent crashes into a wall, but not the control-relevance of the wall while the agent is still some distance away. To address this we introduce action-bisimulation encoding, a method inspired by the bisimulation invariance pseudometric, that extends single-step controllability with a recursive invariance constraint. By doing this, action-bisimulation learns a multi-step controllability metric that smoothly discounts distant state features that are relevant for control. We demonstrate that action-bisimulation pretraining on reward-free, uniformly random data improves sample efficiency in several environments, including a photorealistic 3D simulation domain, Habitat. Additionally, we provide theoretical analysis and qualitative results demonstrating the information captured by action-bisimulation.
- [66] arXiv:2405.18979 (replaced) [pdf, html, other]
-
Title: MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution ShiftsComments: The three first authors contributed equallySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Leveraging the models' outputs, specifically the logits, is a common approach to estimating the test accuracy of a pre-trained neural network on out-of-distribution (OOD) samples without requiring access to the corresponding ground truth labels. Despite their ease of implementation and computational efficiency, current logit-based methods are vulnerable to overconfidence issues, leading to prediction bias, especially under the natural shift. In this work, we first study the relationship between logits and generalization performance from the view of low-density separation assumption. Our findings motivate our proposed method MaNo which (1) applies a data-dependent normalization on the logits to reduce prediction bias, and (2) takes the $L_p$ norm of the matrix of normalized logits as the estimation score. Our theoretical analysis highlights the connection between the provided score and the model's uncertainty. We conduct an extensive empirical study on common unsupervised accuracy estimation benchmarks and demonstrate that MaNo achieves state-of-the-art performance across various architectures in the presence of synthetic, natural, or subpopulation shifts.
- [67] arXiv:2406.01561 (replaced) [pdf, html, other]
-
Title: Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Diffusion-based text-to-image generation models trained on extensive text-image pairs have shown the capacity to generate photorealistic images consistent with textual descriptions. However, a significant limitation of these models is their slow sample generation, which requires iterative refinement through the same network. In this paper, we enhance Score identity Distillation (SiD) by developing long and short classifier-free guidance (LSG) to efficiently distill pretrained Stable Diffusion models without using real training data. SiD aims to optimize a model-based explicit score matching loss, utilizing a score-identity-based approximation alongside the proposed LSG for practical computation. By training exclusively with fake images synthesized with its one-step generator, SiD equipped with LSG rapidly improves FID and CLIP scores, achieving state-of-the-art FID performance while maintaining a competitive CLIP score. Specifically, its data-free distillation of Stable Diffusion 1.5 achieves a record low FID of 8.15 on the COCO-2014 validation set, with a CLIP score of 0.304 at an LSG scale of 1.5, and a FID of 9.56 with a CLIP score of 0.313 at an LSG scale of 2. Our SiD-LSG code and distilled one-step text-to-image generators are available at this https URL
- [68] arXiv:2406.08929 (replaced) [pdf, html, other]
-
Title: Step-by-Step Diffusion: An Elementary TutorialComments: 35 pages, 11 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms.
- [69] arXiv:2406.10719 (replaced) [pdf, html, other]
-
Title: Trading Devil: Robust backdoor attack via Stochastic investment models and Bayesian approachComments: (Last update!, a constructive comment from arxiv led to this latest update ) Stochastic investment models and a Bayesian approach to better modeling of uncertainty : adversarial machine learning or Stochastic market. arXiv admin note: substantial text overlap with arXiv:2402.05967 (see this link to the paper by : Orson Mengara)Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
With the growing use of voice-activated systems and speech recognition technologies, the danger of backdoor attacks on audio data has grown significantly. This research looks at a specific type of attack, known as a Stochastic investment-based backdoor attack (MarketBack), in which adversaries strategically manipulate the stylistic properties of audio to fool speech recognition systems. The security and integrity of machine learning models are seriously threatened by backdoor attacks, in order to maintain the reliability of audio applications and systems, the identification of such attacks becomes crucial in the context of audio data. Experimental results demonstrated that MarketBack is feasible to achieve an average attack success rate close to 100% in seven victim models when poisoning less than 1% of the training data.
- [70] arXiv:2406.14469 (replaced) [pdf, html, other]
-
Title: Fusion of Movement and Naive Predictions for Point Forecasting in Univariate Random WalksSubjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Traditional methods for point forecasting in univariate random walks often fail to surpass naive benchmarks due to data unpredictability. This study introduces a novel forecasting method that fuses movement prediction (binary classification) with naive forecasts for accurate one-step-ahead point forecasting. The method's efficacy is demonstrated through theoretical analysis, simulations, and real-world data experiments. It reliably exceeds naive forecasts with movement prediction accuracies as low as 0.55, outperforming baseline models like ARIMA, linear regression, MLP, and LSTM networks in forecasting the S\&P 500 index and Bitcoin prices. This method is particularly advantageous when accurate point predictions are challenging but accurate movement predictions are attainable, translating movement predictions into point forecasts in random walk contexts.