High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Wang, Fan; Mukherjee, Sach; Richardson, Sylvia; Hill, Steven M.

doi:10.1007/s11222-019-09914-9

Statistics > Methodology

arXiv:1808.00723 (stat)

[Submitted on 2 Aug 2018 (v1), last revised 28 Jan 2020 (this version, v2)]

Title:High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Authors:Fan Wang, Sach Mukherjee, Sylvia Richardson, Steven M. Hill

View PDF

Abstract:Penalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well-developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users. In this paper we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 2,300 data-generating scenarios, including both synthetic and semi-synthetic data (real covariates and simulated responses), allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely-used approaches (Lasso, Adaptive Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection). We find considerable variation in performance between methods. Our results support a `no panacea' view, with no unambiguous winner across all scenarios or goals, even in this restricted setting where all data align well with the assumptions underlying the methods. The study allows us to make some recommendations as to which approaches may be most (or least) suitable given the goal and some data characteristics. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.

Comments:	This is a post-peer-review, pre-copyedit version of an article published in Statistics and Computing. The final authenticated version is available online (open access) at: this http URL
Subjects:	Methodology (stat.ME); Machine Learning (stat.ML)
Cite as:	arXiv:1808.00723 [stat.ME]
	(or arXiv:1808.00723v2 [stat.ME] for this version)
	https://doi.org/10.48550/arXiv.1808.00723
Journal reference:	Statistics and Computing, 2019. Advance online publication
Related DOI:	https://doi.org/10.1007/s11222-019-09914-9

Submission history

From: Steven Hill [view email]
[v1] Thu, 2 Aug 2018 09:22:39 UTC (568 KB)
[v2] Tue, 28 Jan 2020 16:57:29 UTC (477 KB)

Statistics > Methodology

Title:High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Methodology

Title:High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators