Chains of Autoreplicative Random Forests for missing value imputation in high-dimensional datasets

Antonenko, Ekaterina; Read, Jesse

Abstract:Missing values are a common problem in data science and machine learning. Removing instances with missing values can adversely affect the quality of further data analysis. This is exacerbated when there are relatively many more features than instances, and thus the proportion of affected instances is high. Such a scenario is common in many important domains, for example, single nucleotide polymorphism (SNP) datasets provide a large number of features over a genome for a relatively small number of individuals. To preserve as much information as possible prior to modeling, a rigorous imputation scheme is acutely needed. While Denoising Autoencoders is a state-of-the-art method for imputation in high-dimensional data, they still require enough complete cases to be trained on which is often not available in real-world problems. In this paper, we consider missing value imputation as a multi-label classification problem and propose Chains of Autoreplicative Random Forests. Using multi-label Random Forests instead of neural networks works well for low-sampled data as there are fewer parameters to optimize. Experiments on several SNP datasets show that our algorithm effectively imputes missing values based only on information from the dataset and exhibits better performance than standard algorithms that do not require any additional information. In this paper, the algorithm is implemented specifically for SNP data, but it can easily be adapted for other cases of missing value imputation.

Comments:	This paper was presented at the Multi-Label Learning workshop at ECML 2022
Subjects:	Machine Learning (cs.LG); Genomics (q-bio.GN)
Cite as:	arXiv:2301.00595 [cs.LG]
	(or arXiv:2301.00595v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2301.00595

Computer Science > Machine Learning

Title:Chains of Autoreplicative Random Forests for missing value imputation in high-dimensional datasets

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators