In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

Xiao, Yuxin; Lim, Shulammite; Pollard, Tom Joseph; Ghassemi, Marzyeh

Computer Science > Machine Learning

arXiv:2305.11348 (cs)

[Submitted on 18 May 2023 (v1), last revised 3 Jan 2024 (this version, v2)]

Title:In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

Authors:Yuxin Xiao, Shulammite Lim, Tom Joseph Pollard, Marzyeh Ghassemi

View PDF HTML (experimental)

Abstract:Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While these systems have shown compelling results on average, the variation in their performance across different demographic groups has not been thoroughly examined. In this work, we investigate the bias of de-identification systems on names in clinical notes via a large-scale empirical analysis. To achieve this, we create 16 name sets that vary along four demographic dimensions: gender, race, name popularity, and the decade of popularity. We insert these names into 100 manually curated clinical templates and evaluate the performance of nine public and private de-identification methods. Our findings reveal that there are statistically significant performance gaps along a majority of the demographic dimensions in most methods. We further illustrate that de-identification quality is affected by polysemy in names, gender context, and clinical note characteristics. To mitigate the identified gaps, we propose a simple and method-agnostic solution by fine-tuning de-identification methods with clinical context and diverse names. Overall, it is imperative to address the bias in existing methods immediately so that downstream stakeholders can build high-quality systems to serve all demographic parties fairly.

Comments:	Accepted by FAccT 2023; updated appendix with the de-identification performance of GPT-4
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
Cite as:	arXiv:2305.11348 [cs.LG]
	(or arXiv:2305.11348v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2305.11348

Submission history

From: Yuxin Xiao [view email]
[v1] Thu, 18 May 2023 23:47:00 UTC (407 KB)
[v2] Wed, 3 Jan 2024 04:00:15 UTC (436 KB)

Computer Science > Machine Learning

Title:In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators