Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

Rahal, Manal; Ahmed, Bestoun S.; Szabados, Gergely; Fornstedt, Torgny; Samuelsson, Jorgen

Computer Science > Machine Learning

arXiv:2502.13198 (cs)

[Submitted on 18 Feb 2025]

Title:Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

Authors:Manal Rahal, Bestoun S. Ahmed, Gergely Szabados, Torgny Fornstedt, Jorgen Samuelsson

View PDF HTML (experimental)

Abstract:Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore, tedious and time-consuming work goes into data preparation and improvement before moving further in the ML pipeline. To address this challenge, we propose an intelligent data-centric evaluation framework that can identify high-quality data and improve the performance of an ML system. The proposed framework combines the curation of quality measurements and unsupervised learning to distinguish high- and low-quality data. The framework is designed to integrate flexible and general-purpose methods so that it is deployed in various domains and applications. To validate the outcomes of the designed framework, we implemented it in a real-world use case from the field of analytical chemistry, where it is tested on three datasets of anti-sense oligonucleotides. A domain expert is consulted to identify the relevant quality measurements and evaluate the outcomes of the framework. The results show that the quality-centric data evaluation framework identifies the characteristics of high-quality data that guide the conduct of efficient laboratory experiments and consequently improve the performance of the ML system.

Comments:	42 pages
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Cite as:	arXiv:2502.13198 [cs.LG]
	(or arXiv:2502.13198v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2502.13198
Journal reference:	Heliyon 2025

Submission history

From: Bestoun Ahmed Dr. [view email]
[v1] Tue, 18 Feb 2025 18:01:36 UTC (1,469 KB)

Computer Science > Machine Learning

Title:Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators