Benchmarking distance-based partitioning methods for mixed-type data

Costa, Efthymios; Papatsouma, Ioanna; Markos, Angelos

Statistics > Methodology

arXiv:2203.16287v2 (stat)

[Submitted on 30 Mar 2022 (v1), revised 12 Jul 2022 (this version, v2), latest version 30 Aug 2022 (v3)]

Title:Benchmarking distance-based partitioning methods for mixed-type data

Authors:Efthymios Costa, Ioanna Papatsouma, Angelos Markos

View PDF

Abstract:Clustering mixed-type data, that is, observation by variable data that consist of both continuous and categorical variables poses novel challenges. Foremost among these challenges is the choice of the most appropriate clustering method for the data. This paper presents a benchmarking study comparing eight distance-based partitioning methods for mixed-type data in terms of cluster recovery performance. A series of simulations carried out by a full factorial design are presented that examined the effect of a variety of factors on cluster recovery. The amount of cluster overlap, the percentage of categorical variables in the data set, the number of clusters and the number of observations had the largest effects on cluster recovery and in most of the tested scenarios. KAMILA, K-Prototypes and sequential Factor Analysis and K-Means clustering typically performed better than other methods. The study can be a useful reference for practitioners in the choice of the most appropriate method.

Subjects:	Methodology (stat.ME)
MSC classes:	62H30
Cite as:	arXiv:2203.16287 [stat.ME]
	(or arXiv:2203.16287v2 [stat.ME] for this version)
	https://doi.org/10.48550/arXiv.2203.16287

Submission history

From: Angelos Markos [view email]
[v1] Wed, 30 Mar 2022 13:28:49 UTC (2,103 KB)
[v2] Tue, 12 Jul 2022 22:20:51 UTC (2,182 KB)
[v3] Tue, 30 Aug 2022 08:11:20 UTC (1,988 KB)

Statistics > Methodology

Title:Benchmarking distance-based partitioning methods for mixed-type data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Methodology

Title:Benchmarking distance-based partitioning methods for mixed-type data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators