Two-level histograms for dealing with outliers and heavy tail distributions

Boullé, Marc

Abstract:Histograms are among the most popular methods used in exploratory analysis to summarize univariate distributions. In particular, irregular histograms are good non-parametric density estimators that require very few parameters: the number of bins with their lengths and frequencies. Many approaches have been proposed in the literature to infer these parameters, either assuming hypotheses about the underlying data distributions or exploiting a model selection approach. In this paper, we focus on the G-Enum histogram method, which exploits the Minimum Description Length (MDL) principle to build histograms without any user parameter and achieves state-of-the art performance w.r.t accuracy; parsimony and computation time. We investigate on the limits of this method in the case of outliers or heavy-tailed distributions. We suggest a two-level heuristic to deal with such cases. The first level exploits a logarithmic transformation of the data to split the data set into a list of data subsets with a controlled range of values. The second level builds a sub-histogram for each data subset and aggregates them to obtain a complete histogram. Extensive experiments show the benefits of the approach.

Comments:	30 pages, 47 figures
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2306.05786 [cs.LG]
	(or arXiv:2306.05786v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2306.05786

Computer Science > Machine Learning

Title:Two-level histograms for dealing with outliers and heavy tail distributions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators