Methodology
See recent articles
Showing new listings for Friday, 27 September 2024
- [1] arXiv:2409.17195 [pdf, html, other]
-
Title: When Sensitivity Bias Varies Across Subgroups: The Impact of Non-uniform Polarity in List ExperimentsSubjects: Methodology (stat.ME)
Survey researchers face the problem of sensitivity bias: since people are reluctant to reveal socially undesirable or otherwise risky traits, aggregate estimates of these traits will be biased. List experiments offer a solution by conferring respondents greater privacy. However, little is know about how list experiments fare when sensitivity bias varies across respondent subgroups. For example, a trait that is socially undesirable to one group may socially desirable in a second group, leading sensitivity bias to be negative in the first group, while it is positive in the second. Or a trait may be not sensitive in one group, leading sensitivity bias to be zero in one group and non-zero in another. We use Monte Carlo simulations to explore what happens when the polarity (sign) of sensitivity bias is non-uniform. We find that a general diagnostic test yields false positives and that commonly used estimators return biased estimates of the prevalence of the sensitive trait, coefficients of covariates, and sensitivity bias itself. The bias is worse when polarity runs in opposite directions across subgroups, and as the difference in subgroup sizes increases. Significantly, non-uniform polarity could explain why some list experiments appear to 'fail'. By defining and systematically investigating the problem of non-uniform polarity, we hope to save some studies from the file-drawer and provide some guidance for future research.
- [2] arXiv:2409.17298 [pdf, html, other]
-
Title: Sparsity, Regularization and Causality in Agricultural Yield: The Case of Paddy Rice in PeruRita Rocio Guzman-Lopez, Luis Huamanchumo, Kevin Fernandez, Oscar Cutipa-Luque, Yhon Tiahuallpa, Helder RojasSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
This study introduces a novel approach that integrates agricultural census data with remotely sensed time series to develop precise predictive models for paddy rice yield across various regions of Peru. By utilizing sparse regression and Elastic-Net regularization techniques, the study identifies causal relationships between key remotely sensed variables-such as NDVI, precipitation, and temperature-and agricultural yield. To further enhance prediction accuracy, the first- and second-order dynamic transformations (velocity and acceleration) of these variables are applied, capturing non-linear patterns and delayed effects on yield. The findings highlight the improved predictive performance when combining regularization techniques with climatic and geospatial variables, enabling more precise forecasts of yield variability. The results confirm the existence of causal relationships in the Granger sense, emphasizing the value of this methodology for strategic agricultural management. This contributes to more efficient and sustainable production in paddy rice cultivation.
- [3] arXiv:2409.17404 [pdf, html, other]
-
Title: Bayesian Covariate-Dependent Graph Learning with a Dual Group Spike-and-Slab PriorSubjects: Methodology (stat.ME)
Covariate-dependent graph learning has gained increasing interest in the graphical modeling literature for the analysis of heterogeneous data. This task, however, poses challenges to modeling, computational efficiency, and interpretability. The parameter of interest can be naturally represented as a three-dimensional array with elements that can be grouped according to two directions, corresponding to node level and covariate level, respectively. In this article, we propose a novel dual group spike-and-slab prior that enables multi-level selection at covariate-level and node-level, as well as individual (local) level sparsity. We introduce a nested strategy with specific choices to address distinct challenges posed by the various grouping directions. For posterior inference, we develop a tuning-free Gibbs sampler for all parameters, which mitigates the difficulties of parameter tuning often encountered in high-dimensional graphical models and facilitates routine implementation. Through simulation studies, we demonstrate that the proposed model outperforms existing methods in its accuracy of graph recovery. We show the practical utility of our model via an application to microbiome data where we seek to better understand the interactions among microbes as well as how these are affected by relevant covariates.
- [4] arXiv:2409.17441 [pdf, html, other]
-
Title: Factor pre-training in Bayesian multivariate logistic modelsSubjects: Methodology (stat.ME); Computation (stat.CO)
This article focuses on inference in logistic regression for high-dimensional binary outcomes. A popular approach induces dependence across the outcomes by including latent factors in the linear predictor. Bayesian approaches are useful for characterizing uncertainty in inferring the regression coefficients, factors and loadings, while also incorporating hierarchical and shrinkage structure. However, Markov chain Monte Carlo algorithms for posterior computation face challenges in scaling to high-dimensional outcomes. Motivated by applications in ecology, we exploit a blessing of dimensionality to motivate pre-estimation of the latent factors. Conditionally on the factors, the outcomes are modeled via independent logistic regressions. We implement Gaussian approximations in parallel in inferring the posterior on the regression coefficients and loadings, including a simple adjustment to obtain credible intervals with valid frequentist coverage. We show posterior concentration properties and excellent empirical performance in simulations. The methods are applied to insect biodiversity data in Madagascar.
- [5] arXiv:2409.17631 [pdf, html, other]
-
Title: Invariant Coordinate Selection and Fisher discriminant subspace beyond the case of two groupsSubjects: Methodology (stat.ME)
Invariant Coordinate Selection (ICS) is a multivariate technique that relies on the simultaneous diagonalization of two scatter matrices. It serves various purposes, including its use as a dimension reduction tool prior to clustering or outlier detection. Unlike methods such as Principal Component Analysis, ICS has a theoretical foundation that explains why and when the identified subspace should contain relevant information. These general results have been examined in detail primarily for specific scatter combinations within a two-cluster framework. In this study, we expand these investigations to include more clusters and scatter combinations. The case of three clusters in particular is studied at length. Based on these expanded theoretical insights and supported by numerical studies, we conclude that ICS is indeed suitable for recovering Fisher's discriminant subspace under very general settings and cases of failure seem rare.
- [6] arXiv:2409.17706 [pdf, html, other]
-
Title: Stationarity of Manifold Time SeriesSubjects: Methodology (stat.ME)
In modern interdisciplinary research, manifold time series data have been garnering more attention. A critical question in analyzing such data is ``stationarity'', which reflects the underlying dynamic behavior and is crucial across various fields like cell biology, neuroscience and empirical finance. Yet, there has been an absence of a formal definition of stationarity that is tailored to manifold time series. This work bridges this gap by proposing the first definitions of first-order and second-order stationarity for manifold time series. Additionally, we develop novel statistical procedures to test the stationarity of manifold time series and study their asymptotic properties. Our methods account for the curved nature of manifolds, leading to a more intricate analysis than that in Euclidean space. The effectiveness of our methods is evaluated through numerical simulations and their practical merits are demonstrated through analyzing a cell-type proportion time series dataset from a paper recently published in Cell. The first-order stationarity test result aligns with the biological findings of this paper, while the second-order stationarity test provides numerical support for a critical assumption made therein.
- [7] arXiv:2409.17751 [pdf, html, other]
-
Title: Granger Causality for Mixed Time Series Generalized Linear Models: A Case Study on Multimodal Brain ConnectivityComments: Paper submitted for publicationSubjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO)
This paper is motivated by studies in neuroscience experiments to understand interactions between nodes in a brain network using different types of data modalities that capture different distinct facets of brain activity. To assess Granger-causality, we introduce a flexible framework through a general class of models that accommodates mixed types of data (binary, count, continuous, and positive components) formulated in a generalized linear model (GLM) fashion. Statistical inference for causality is performed based on both frequentist and Bayesian approaches, with a focus on the latter. Here, we develop a procedure for conducting inference through the proposed Bayesian mixed time series model. By introducing spike and slab priors for some parameters in the model, our inferential approach guides causality order selection and provides proper uncertainty quantification. The proposed methods are then utilized to study the rat spike train and local field potentials (LFP) data recorded during the olfaction working memory task. The proposed methodology provides critical insights into the causal relationship between the rat spiking activity and LFP spectral power. Specifically, power in the LFP beta band is predictive of spiking activity 300 milliseconds later, providing a novel analytical tool for this area of emerging interest in neuroscience and demonstrating its usefulness and flexibility in the study of causality in general.
- [8] arXiv:2409.17968 [pdf, html, other]
-
Title: Nonparametric Inference Framework for Time-dependent Epidemic ModelsSubjects: Methodology (stat.ME)
Compartmental models, especially the Susceptible-Infected-Removed (SIR) model, have long been used to understand the behaviour of various diseases. Allowing parameters, such as the transmission rate, to be time-dependent functions makes it possible to adjust for and make inferences about changes in the process due to mitigation strategies or evolutionary changes of the infectious agent. In this article, we attempt to build a nonparametric inference framework for stochastic SIR models with time dependent infection rate. The framework includes three main steps: likelihood approximation, parameter estimation and confidence interval construction. The likelihood function of the stochastic SIR model, which is often intractable, can be approximated using methods such as diffusion approximation or tau leaping. The infection rate is modelled by a B-spline basis whose knot location and number of knots are determined by a fast knot placement method followed by a criterion-based model selection procedure. Finally, a point-wise confidence interval is built using a parametric bootstrap procedure. The performance of the framework is observed through various settings for different epidemic patterns. The model is then applied to the Ontario COVID-19 data across multiple waves.
- [9] arXiv:2409.18005 [pdf, html, other]
-
Title: Collapsible Kernel Machine Regression for Exposomic AnalysesSubjects: Methodology (stat.ME)
An important goal of environmental epidemiology is to quantify the complex health risks posed by a wide array of environmental exposures. In analyses focusing on a smaller number of exposures within a mixture, flexible models like Bayesian kernel machine regression (BKMR) are appealing because they allow for non-linear and non-additive associations among mixture components. However, this flexibility comes at the cost of low power and difficult interpretation, particularly in exposomic analyses when the number of exposures is large. We propose a flexible framework that allows for separate selection of additive and non-additive effects, unifying additive models and kernel machine regression. The proposed approach yields increased power and simpler interpretation when there is little evidence of interaction. Further, it allows users to specify separate priors for additive and non-additive effects, and allows for tests of non-additive interaction. We extend the approach to the class of multiple index models, in which the special case of kernel machine-distributed lag models are nested. We apply the method to motivating data from a subcohort of the Human Early Life Exposome (HELIX) study containing 65 mixture components grouped into 13 distinct exposure classes.
- [10] arXiv:2409.18091 [pdf, html, other]
-
Title: Incorporating sparse labels into biologging studies using hidden Markov models with weighted likelihoodsEvan Sidrow, Nancy Heckman, Tess M. McRae, Beth L. Volpov, Andrew W. Trites, Sarah M.E. Fortune, Marie Auger-MéthéComments: 25 pages, 11 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
Ecologists often use a hidden Markov model to decode a latent process, such as a sequence of an animal's behaviours, from an observed biologging time series. Modern technological devices such as video recorders and drones now allow researchers to directly observe an animal's behaviour. Using these observations as labels of the latent process can improve a hidden Markov model's accuracy when decoding the latent process. However, many wild animals are observed infrequently. Including such rare labels often has a negligible influence on parameter estimates, which in turn does not meaningfully improve the accuracy of the decoded latent process. We introduce a weighted likelihood approach that increases the relative influence of labelled observations. We use this approach to develop two hidden Markov models to decode the foraging behaviour of killer whales (Orcinus orca) off the coast of British Columbia, Canada. Using cross-validated evaluation metrics, we show that our weighted likelihood approach produces more accurate and understandable decoded latent processes compared to existing methods. Thus, our method effectively leverages sparse labels to enhance researchers' ability to accurately decode hidden processes across various fields.
- [11] arXiv:2409.18117 [pdf, html, other]
-
Title: Formulating the Proxy Pattern-Mixture Model as a Selection Model to Assist with Sensitivity AnalysisComments: 25 pages, 6 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
Proxy pattern-mixture models (PPMM) have previously been proposed as a model-based framework for assessing the potential for nonignorable nonresponse in sample surveys and nonignorable selection in nonprobability samples. One defining feature of the PPMM is the single sensitivity parameter, $\phi$, that ranges from 0 to 1 and governs the degree of departure from ignorability. While this sensitivity parameter is attractive in its simplicity, it may also be of interest to describe departures from ignorability in terms of how the odds of response (or selection) depend on the outcome being measured. In this paper, we re-express the PPMM as a selection model, using the known relationship between pattern-mixture models and selection models, in order to better understand the underlying assumptions of the PPMM and the implied effect of the outcome on nonresponse. The selection model that corresponds to the PPMM is a quadratic function of the survey outcome and proxy variable, and the magnitude of the effect depends on the value of the sensitivity parameter, $\phi$ (missingness/selection mechanism), the differences in the proxy means and standard deviations for the respondent and nonrespondent populations, and the strength of the proxy, $\rho^{(1)}$. Large values of $\phi$ (beyond $0.5$) often result in unrealistic selection mechanisms, and the corresponding selection model can be used to establish more realistic bounds on $\phi$. We illustrate the results using data from the U.S. Census Household Pulse Survey.
New submissions (showing 11 of 11 entries)
- [12] arXiv:2409.17544 (cross-list from stat.ML) [pdf, html, other]
-
Title: Optimizing the Induced Correlation in Omnibus Joint Graph EmbeddingsComments: 34 pages, 8 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
Theoretical and empirical evidence suggests that joint graph embedding algorithms induce correlation across the networks in the embedding space. In the Omnibus joint graph embedding framework, previous results explicitly delineated the dual effects of the algorithm-induced and model-inherent correlations on the correlation across the embedded networks. Accounting for and mitigating the algorithm-induced correlation is key to subsequent inference, as sub-optimal Omnibus matrix constructions have been demonstrated to lead to loss in inference fidelity. This work presents the first efforts to automate the Omnibus construction in order to address two key questions in this joint embedding framework: the correlation-to-OMNI problem and the flat correlation problem. In the flat correlation problem, we seek to understand the minimum algorithm-induced flat correlation (i.e., the same across all graph pairs) produced by a generalized Omnibus embedding. Working in a subspace of the fully general Omnibus matrices, we prove both a lower bound for this flat correlation and that the classical Omnibus construction induces the maximal flat correlation. In the correlation-to-OMNI problem, we present an algorithm -- named corr2Omni -- that, from a given matrix of estimated pairwise graph correlations, estimates the matrix of generalized Omnibus weights that induces optimal correlation in the embedding space. Moreover, in both simulated and real data settings, we demonstrate the increased effectiveness of our corr2Omni algorithm versus the classical Omnibus construction.
- [13] arXiv:2409.17804 (cross-list from stat.ML) [pdf, html, other]
-
Title: Enriched Functional Tree-Based Classifiers: A Novel Approach Leveraging Derivatives and Geometric FeaturesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
The positioning of this research falls within the scalar-on-function classification literature, a field of significant interest across various domains, particularly in statistics, mathematics, and computer science. This study introduces an advanced methodology for supervised classification by integrating Functional Data Analysis (FDA) with tree-based ensemble techniques for classifying high-dimensional time series. The proposed framework, Enriched Functional Tree-Based Classifiers (EFTCs), leverages derivative and geometric features, benefiting from the diversity inherent in ensemble methods to further enhance predictive performance and reduce variance. While our approach has been tested on the enrichment of Functional Classification Trees (FCTs), Functional K-NN (FKNN), Functional Random Forest (FRF), Functional XGBoost (FXGB), and Functional LightGBM (FLGBM), it could be extended to other tree-based and non-tree-based classifiers, with appropriate considerations emerging from this investigation. Through extensive experimental evaluations on seven real-world datasets and six simulated scenarios, this proposal demonstrates fascinating improvements over traditional approaches, providing new insights into the application of FDA in complex, high-dimensional learning problems.
- [14] arXiv:2409.18118 (cross-list from cs.CR) [pdf, other]
-
Title: Slowly Scaling Per-Record Differential PrivacyBrian Finley, Anthony M Caruso, Justin C Doty, Ashwin Machanavajjhala, Mikaela R Meyer, David Pujol, William Sexton, Zachary TernerSubjects: Cryptography and Security (cs.CR); Methodology (stat.ME)
We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy guarantee degrades slowly in the protected records' influence on the statistics being released.
Formal privacy mechanisms generally add randomness, or "noise," to published statistics. If a noisy statistic's distribution changes little with the addition or deletion of a single record in the underlying dataset, an attacker looking at this statistic will find it plausible that any particular record was present or absent, preserving the records' privacy. More influential records -- those whose addition or deletion would change the statistics' distribution more -- typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records' influence varies widely, as is common in economic data.
We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These mechanisms allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility.
Cross submissions (showing 3 of 3 entries)
- [15] arXiv:2309.02073 (replaced) [pdf, other]
-
Title: Debiased regression adjustment in completely randomized experiments with moderately high-dimensional covariatesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Completely randomized experiment is the gold standard for causal inference. When the covariate information for each experimental candidate is available, one typical way is to include them in covariate adjustments for more accurate treatment effect estimation. In this paper, we investigate this problem under the randomization-based framework, i.e., that the covariates and potential outcomes of all experimental candidates are assumed as deterministic quantities and the randomness comes solely from the treatment assignment mechanism. Under this framework, to achieve asymptotically valid inference, existing estimators usually require either (i) that the dimension of covariates $p$ grows at a rate no faster than $O(n^{3 / 4})$ as sample size $n \to \infty$; or (ii) certain sparsity constraints on the linear representations of potential outcomes constructed via possibly high-dimensional covariates. In this paper, we consider the moderately high-dimensional regime where $p$ is allowed to be in the same order of magnitude as $n$. We develop a novel debiased estimator with a corresponding inference procedure and establish its asymptotic normality under mild assumptions. Our estimator is model-free and does not require any sparsity constraint on potential outcome's linear representations. We also discuss its asymptotic efficiency improvements over the unadjusted treatment effect estimator under different dimensionality constraints. Numerical analysis confirms that compared to other regression adjustment based treatment effect estimators, our debiased estimator performs well in moderately high dimensions.
- [16] arXiv:2310.13387 (replaced) [pdf, html, other]
-
Title: Assumption violations in causal discovery and the robustness of score matchingFrancesco Montagna, Atalanti A. Mastakouri, Elias Eulig, Nicoletta Noceti, Lorenzo Rosasco, Dominik Janzing, Bryon Aragam, Francesco LocatelloComments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)Subjects: Methodology (stat.ME); Machine Learning (cs.LG)
When domain knowledge is limited and experimentation is restricted by ethical, financial, or time constraints, practitioners turn to observational causal discovery methods to recover the causal structure, exploiting the statistical properties of their data. Because causal discovery without further assumptions is an ill-posed problem, each algorithm comes with its own set of usually untestable assumptions, some of which are hard to meet in real datasets. Motivated by these considerations, this paper extensively benchmarks the empirical performance of recent causal discovery methods on observational i.i.d. data generated under different background conditions, allowing for violations of the critical assumptions required by each selected approach. Our experimental findings show that score matching-based methods demonstrate surprising performance in the false positive and false negative rate of the inferred graph in these challenging scenarios, and we provide theoretical insights into their performance. This work is also the first effort to benchmark the stability of causal discovery algorithms with respect to the values of their hyperparameters. Finally, we hope this paper will set a new standard for the evaluation of causal discovery methods and can serve as an accessible entry point for practitioners interested in the field, highlighting the empirical implications of different algorithm choices.
- [17] arXiv:2312.10176 (replaced) [pdf, html, other]
-
Title: Spectral estimation for spatial point processes and random fieldsSubjects: Methodology (stat.ME)
Spatial variables can be observed in many different forms, such as regularly sampled random fields (lattice data), point processes, and randomly sampled spatial processes. Joint analysis of such collections of observations is clearly desirable, but complicated by the lack of an easily implementable analysis framework. It is well known that Fourier transforms provide such a framework, but its form has eluded data analysts. We formalize it by providing a multitaper analysis framework using coupled discrete and continuous data tapers, combined with the discrete Fourier transform for inference. Using this set of tools is important, as it forms the backbone for practical spectral analysis. In higher dimensions it is important not to be constrained to Cartesian product domains, and so we develop the methodology for spectral analysis using irregular domain data tapers, and the tapered discrete Fourier transform. We discuss its fast implementation, and the asymptotic as well as large finite domain properties. Estimators of partial association between different spatial processes are provided as are principled methods to determine their significance, and we demonstrate their practical utility on a large-scale ecological dataset.
- [18] arXiv:2401.07400 (replaced) [pdf, html, other]
-
Title: Gaussian Processes for Time Series with Lead-Lag Effects with applications to biology dataSubjects: Methodology (stat.ME)
Investigating the relationship, particularly the lead-lag effect, between time series is a common question across various disciplines, especially when uncovering biological process. However, analyzing time series presents several challenges. Firstly, due to technical reasons, the time points at which observations are made are not at uniform inintervals. Secondly, some lead-lag effects are transient, necessitating time-lag estimation based on a limited number of time points. Thirdly, external factors also impact these time series, requiring a similarity metric to assess the lead-lag relationship. To counter these issues, we introduce a model grounded in the Gaussian process, affording the flexibility to estimate lead-lag effects for irregular time series. In addition, our method outputs dissimilarity scores, thereby broadening its applications to include tasks such as ranking or clustering multiple pair-wise time series when considering their strength of lead-lag effects with external factors. Crucially, we offer a series of theoretical proofs to substantiate the validity of our proposed kernels and the identifiability of kernel parameters. Our model demonstrates advances in various simulations and real-world applications, particularly in the study of dynamic chromatin interactions, compared to other leading methods.
- [19] arXiv:2406.07787 (replaced) [pdf, html, other]
-
Title: A Diagnostic Tool for Functional Causal DiscoverySubjects: Methodology (stat.ME); Applications (stat.AP)
Causal discovery methods aim to determine the causal direction between variables using observational data. Functional causal discovery methods, such as those based on the Linear Non-Gaussian Acyclic Model (LiNGAM), rely on structural and distributional assumptions to infer the causal direction. However, approaches for assessing causal discovery methods' performance as a function of sample size or the impact of assumption violations, inevitable in real-world scenarios, are lacking. To address this need, we propose Causal Direction Detection Rate (CDDR) diagnostic that evaluates whether and to what extent the interaction between assumption violations and sample size affects the ability to identify the hypothesized causal direction. Given a bivariate dataset of size N on a pair of variables, X and Y, CDDR diagnostic is the plotted comparison of the probability of each causal discovery outcome (e.g. X causes Y, Y causes X, or inconclusive) as a function of sample size less than N. We fully develop CDDR diagnostic in a bivariate case and demonstrate its use for two methods, LiNGAM and our new test-based causal discovery approach. We find CDDR diagnostic for the test-based approach to be more informative since it uses a richer set of causal discovery outcomes. Under certain assumptions, we prove that the probability estimates of detecting each possible causal discovery outcome are consistent and asymptotically normal. Through simulations, we study CDDR diagnostic's behavior when linearity and non-Gaussianity assumptions are violated. Additionally, we illustrate CDDR diagnostic on four real datasets, including three for which the causal direction is known.
- [20] arXiv:2406.18681 (replaced) [pdf, html, other]
-
Title: Data Sketching and Stacking: A Confluence of Two Strategies for Predictive Inference in Gaussian Process Regressions with High-Dimensional FeaturesComments: 32 Pages, 10 FiguresSubjects: Methodology (stat.ME)
This article focuses on drawing computationally-efficient predictive inference from Gaussian process (GP) regressions with a large number of features when the response is conditionally independent of the features given the projection to a noisy low dimensional manifold. Bayesian estimation of the regression relationship using Markov Chain Monte Carlo and subsequent predictive inference is computationally prohibitive and may lead to inferential inaccuracies since accurate variable selection is essentially impossible in such high-dimensional GP regressions. As an alternative, this article proposes a strategy to sketch the high-dimensional feature vector with a carefully constructed sketching matrix, before fitting a GP with the scalar outcome and the sketched feature vector to draw predictive inference. The analysis is performed in parallel with many different sketching matrices and smoothing parameters in different processors, and the predictive inferences are combined using Bayesian predictive stacking. Since posterior predictive distribution in each processor is analytically tractable, the algorithm allows bypassing the robustness issues due to convergence and mixing of MCMC chains, leading to fast implementation with very large number of features. Simulation studies show superior performance of the proposed approach with a wide variety of competitors. The approach outperforms competitors in drawing point prediction with predictive uncertainties of outdoor air pollution from satellite images.
- [21] arXiv:2407.02367 (replaced) [pdf, html, other]
-
Title: Rediscovering Bottom-Up: Effective Forecasting in Temporal HierarchiesSubjects: Methodology (stat.ME)
Forecast reconciliation has become a prominent topic in recent forecasting literature, with a primary distinction made between cross-sectional and temporal hierarchies. This work focuses on temporal hierarchies, such as aggregating monthly time series data to annual data. We explore the impact of various forecast reconciliation methods on temporally aggregated ARIMA models, thereby bridging the fields of hierarchical forecast reconciliation and temporal aggregation both theoretically and experimentally. Our paper is the first to theoretically examine the effects of temporal hierarchical forecast reconciliation, demonstrating that the optimal method aligns with a bottom-up aggregation approach. To assess the practical implications and performance of the reconciled forecasts, we conduct a series of simulation studies, confirming that the findings extend to more complex models. This result helps explain the strong performance of the bottom-up approach observed in many prior studies. Finally, we apply our methods to real data examples, where we observe similar results.
- [22] arXiv:2407.20683 (replaced) [pdf, html, other]
-
Title: Online generalizations of the e-BH and BH procedureComments: 27 pages, 4 figuresSubjects: Methodology (stat.ME)
In online multiple testing, the hypotheses arrive one by one, and at each time we must immediately reject or accept the current hypothesis solely based on the data and hypotheses observed so far. Many procedures have been proposed, but none of them are online generalizations of the Benjamini-Hochberg (BH) procedure based on p-values, or of the e-BH procedures that uses e-values. In this paper, we consider a relaxed problem setup that allows the current hypothesis to be rejected at any later step. We show that this relaxation allows us to define -- what we justify extensively to be -- the natural and appropriate online extension of the BH and e-BH procedures. Analogous to e-BH, online e-BH controls the FDR under arbitrary dependence (even at stopping times). Like for e-BH, we show how to boost the power of online e-BH under other dependence assumptions like positive or local dependence. BH and online BH have identical FDR guarantees at fixed times under positive, negative or arbitrary dependence. Further, we prove that online BH has a slightly inflated FDR control at data-adaptive stopping times under weak positive and negative dependence. Based on the same proof techniques, we prove that numerous existing online procedures, which were previously only known to control the FDR at fixed times, also control the FDR at stopping times.
- [23] arXiv:2408.09770 (replaced) [pdf, html, other]
-
Title: Shift-Dispersion Decompositions of Wasserstein and Cram\'er DistancesSubjects: Methodology (stat.ME); Probability (math.PR); Statistics Theory (math.ST)
Divergence functions are measures of distance or dissimilarity between probability distributions that serve various purposes in statistics and applications. We propose decompositions of Wasserstein and Cramér distances$-$which compare two distributions by integrating over their differences in distribution or quantile functions$-$into directed shift and dispersion components. These components are obtained by dividing the differences between the quantile functions into contributions arising from shift and dispersion, respectively. Our decompositions add information on how the distributions differ in a condensed form and consequently enhance the interpretability of the underlying divergences. We show that our decompositions satisfy a number of natural properties and are unique in doing so in location-scale families. The decompositions allow to derive sensitivities of the divergence measures to changes in location and dispersion, and they give rise to weak stochastic order relations that are linked to the usual stochastic and the dispersive order. Our theoretical developments are illustrated in two applications, where we focus on forecast evaluation of temperature extremes and on the design of probabilistic surveys in economics.
- [24] arXiv:2409.03502 (replaced) [pdf, html, other]
-
Title: Testing Whether Reported Treatment Effects are Unduly Dependent on the Specific Outcome Measure UsedComments: 32 pages, 6 figuresSubjects: Methodology (stat.ME)
This paper addresses the situation in which treatment effects are reported using educational or psychological outcome measures comprised of multiple questions or "items." A distinction is made between a treatment effect on the construct being measured, which is referred to as impact, and item-specific treatment effects that are not due to impact, which are referred to as differential item functioning (DIF). By definition, impact generalizes to other measures of the same construct (i.e., measures that use different items), while DIF is dependent upon the specific items that make up the outcome measure. To distinguish these two cases, two estimators of impact are compared: an estimator that naively aggregates over items, and a less efficient one that is highly robust to DIF. The null hypothesis that both are consistent estimators of the true treatment impact leads to a Hausman-like specification test of whether the naive estimate is affected by item-level variation that would not be expected to generalize beyond the specific outcome measure used. The performance of the test is illustrated with simulation studies and a re-analysis of 34 item-level datasets from 22 randomized evaluations of educational interventions. In the empirical example, the dependence of reported effect sizes on the type of outcome measure (researcher-developed or independently developed) was substantially reduced after accounting for DIF. Implications for the ongoing debate about the role of researcher-developed assessments in education sciences are discussed.
- [25] arXiv:2409.16463 (replaced) [pdf, html, other]
-
Title: Double-Estimation-Friendly Inference for High Dimensional Misspecified Measurement Error ModelsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
In this paper, we introduce an innovative testing procedure for assessing individual hypotheses in high-dimensional linear regression models with measurement errors. This method remains robust even when either the X-model or Y-model is misspecified. We develop a double robust score function that maintains a zero expectation if one of the models is incorrect, and we construct a corresponding score test. We first show the asymptotic normality of our approach in a low-dimensional setting, and then extend it to the high-dimensional models. Our analysis of high-dimensional settings explores scenarios both with and without the sparsity condition, establishing asymptotic normality and non-trivial power performance under local alternatives. Simulation studies and real data analysis demonstrate the effectiveness of the proposed method.
- [26] arXiv:2107.07575 (replaced) [pdf, html, other]
-
Title: Optimal tests of the composite null hypothesis arising in mediation analysisComments: 66 pages, 12 figuresSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
The indirect effect of an exposure on an outcome through an intermediate variable can be identified by a product of regression coefficients under certain causal and regression modeling assumptions. In this context, the null hypothesis of no indirect effect is a composite null hypothesis, as the null holds if either regression coefficient is zero. A consequence is that traditional hypothesis tests are severely underpowered near the origin (i.e., when both coefficients are small with respect to standard errors). We propose hypothesis tests that (i) preserve level alpha type 1 error, (ii) meaningfully improve power when both true underlying effects are small relative to sample size, and (iii) preserve power when at least one is not. One approach gives a closed-form test that is minimax optimal with respect to local power over the alternative parameter space. Another uses sparse linear programming to produce an approximately optimal test for a Bayes risk criterion. We discuss adaptations for performing large-scale hypothesis testing as well as modifications that yield improved interpretability. We provide an R package that implements the minimax optimal test.
- [27] arXiv:2205.13469 (replaced) [pdf, html, other]
-
Title: Proximal Estimation and InferenceSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
We build a unifying convex analysis framework characterizing the statistical properties of a large class of penalized estimators, both under a regular and an irregular design. Our framework interprets penalized estimators as proximal estimators, defined by a proximal operator applied to a corresponding initial estimator. We characterize the asymptotic properties of proximal estimators, showing that their asymptotic distribution follows a closed-form formula depending only on (i) the asymptotic distribution of the initial estimator, (ii) the estimator's limit penalty subgradient and (iii) the inner product defining the associated proximal operator. In parallel, we characterize the Oracle features of proximal estimators from the properties of their penalty's subgradients. We exploit our approach to systematically cover linear regression settings with a regular or irregular design. For these settings, we build new $\sqrt{n}-$consistent, asymptotically normal Ridgeless-type proximal estimators, which feature the Oracle property and are shown to perform satisfactorily in practically relevant Monte Carlo settings.
- [28] arXiv:2409.08201 (replaced) [pdf, html, other]
-
Title: Machine Learning for Two-Sample Testing under Right-Censored Data: A Simulation StudyComments: 20 pages, 4 figuresSubjects: Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
The focus of this study is to evaluate the effectiveness of Machine Learning (ML) methods for two-sample testing with right-censored observations. To achieve this, we develop several ML-based methods with varying architectures and implement them as two-sample tests. Each method is an ensemble (stacking) that combines predictions from classical two-sample tests. This paper presents the results of training the proposed ML methods, examines their statistical power compared to classical two-sample tests, analyzes the null distribution of the proposed methods when the null hypothesis is true, and evaluates the significance of the features incorporated into the proposed methods. In total, this work covers 18 methods for two-sample testing under right-censored observations, including the proposed methods and classical well-studied two-sample tests. All results from numerical experiments were obtained from a synthetic dataset generated using the inverse transform sampling method and replicated multiple times through Monte Carlo simulation. To test the two-sample problem with right-censored observations, one can use the proposed two-sample methods (scripts, dataset, and models are available on GitHub and Hugging Face).
- [29] arXiv:2409.08928 (replaced) [pdf, other]
-
Title: Self-Organized State-Space Models with Artificial DynamicsComments: 102 pages (28 pages for the paper, 6 for the appendix and 68 for the supplementary material), 4 figuresSubjects: Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
We consider a state-space model (SSM) parametrized by some parameter $\theta$, and our aim is to perform joint parameter and state inference. A simple idea to carry out this task, which almost dates back to the origin of the Kalman filter, is to replace the static parameter $\theta$ by a Markov chain $(\theta_t)_{t\geq 0}$ and then to apply a filtering algorithm to the extended, or self-organized SSM (SO-SSM). However, the practical implementation of this idea in a theoretically justified way has remained an open problem. In this paper we fill this gap by introducing various possible constructions of $(\theta_t)_{t\geq 0}$ that ensure the validity of the SO-SSM for joint parameter and state inference. Notably, we show that such SO-SSMs can be defined even if $\|\mathrm{Var}(\theta_{t}|\theta_{t-1})\|\rightarrow 0$ slowly as $t\rightarrow\infty$. This result is important since, as illustrated in our numerical experiments, these models can be efficiently approximated using particle filter algorithms. While SO-SSMs have been introduced for online inference, the development of iterated filtering (IF) algorithms has shown that they can also serve for computing the maximum likelihood estimator of a given SSM. In this work, we also derive constructions of $(\theta_t)_{t\geq 0}$ and theoretical guarantees tailored to these specific applications of SO-SSMs and, as a result, introduce new IF algorithms. From a practical point of view, the algorithms we develop have the merit of being simple to implement and only requiring minimal tuning to perform well.