Iterative Subsampling in Solution Path Clustering of Noisy Big Data

Marchetti, Yuliya; Zhou, Qing

Statistics > Methodology

arXiv:1412.1559v1 (stat)

[Submitted on 4 Dec 2014 (this version), latest version 16 Jul 2015 (v2)]

Title:Iterative Subsampling in Solution Path Clustering of Noisy Big Data

Authors:Yuliya Marchetti, Qing Zhou

View PDF

Abstract:We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method's relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets.

Comments:	16 pages
Subjects:	Methodology (stat.ME); Machine Learning (stat.ML)
Cite as:	arXiv:1412.1559 [stat.ME]
	(or arXiv:1412.1559v1 [stat.ME] for this version)
	https://doi.org/10.48550/arXiv.1412.1559

Submission history

From: Qing Zhou [view email]
[v1] Thu, 4 Dec 2014 06:05:59 UTC (3,936 KB)
[v2] Thu, 16 Jul 2015 19:09:58 UTC (4,356 KB)

Statistics > Methodology

Title:Iterative Subsampling in Solution Path Clustering of Noisy Big Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Methodology

Title:Iterative Subsampling in Solution Path Clustering of Noisy Big Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators