$\textit{greylock}$: A Python Package for Measuring The Composition of Complex Datasets

Nguyen, Phuc; Arora, Rohit; Hill, Elliot D.; Braun, Jasper; Morgan, Alexandra; Quintana, Liza M.; Mazzoni, Gabrielle; Lee, Ghee Rye; Arnaout, Rima; Arnaout, Ramy

Quantitative Biology > Quantitative Methods

arXiv:2401.00102 (q-bio)

[Submitted on 29 Dec 2023]

Title:$\textit{greylock}$: A Python Package for Measuring The Composition of Complex Datasets

Authors:Phuc Nguyen, Rohit Arora, Elliot D. Hill, Jasper Braun, Alexandra Morgan, Liza M. Quintana, Gabrielle Mazzoni, Ghee Rye Lee, Rima Arnaout, Ramy Arnaout

View PDF

Abstract:Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed $\textit{greylock}$, a Python package that calculates diversity measures and is tailored to large datasets. $\textit{greylock}$ can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). $\textit{greylock}$ also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe $\textit{greylock}$'s key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating $\textit{greylock}$'s applicability across a range of dataset types and fields.

Comments:	42 pages, many figures. Many thanks to Ralf Bundschuh for help with the submission process
Subjects:	Quantitative Methods (q-bio.QM)
Cite as:	arXiv:2401.00102 [q-bio.QM]
	(or arXiv:2401.00102v1 [q-bio.QM] for this version)
	https://doi.org/10.48550/arXiv.2401.00102

Submission history

From: Phuc Nguyen [view email]
[v1] Fri, 29 Dec 2023 23:51:48 UTC (1,953 KB)

Quantitative Biology > Quantitative Methods

Title:$\textit{greylock}$: A Python Package for Measuring The Composition of Complex Datasets

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Quantitative Methods

Title:$\textit{greylock}$: A Python Package for Measuring The Composition of Complex Datasets

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators