Safe Training with Sensitive In-domain Data: Leveraging Data Fragmentation To Mitigate Linkage Attacks

Ignashina, Mariia; Ive, Julia

Computer Science > Computation and Language

arXiv:2404.19486 (cs)

[Submitted on 30 Apr 2024]

Title:Safe Training with Sensitive In-domain Data: Leveraging Data Fragmentation To Mitigate Linkage Attacks

Authors:Mariia Ignashina, Julia Ive

View PDF HTML (experimental)

Abstract:Current text generation models are trained using real data which can potentially contain sensitive information, such as confidential patient information and the like. Under certain conditions output of the training data which they have memorised can be triggered, exposing sensitive data. To mitigate against this risk we propose a safer alternative which sees fragmented data in the form of domain-specific short phrases randomly grouped together shared instead of full texts. Thus, text fragments that could re-identify an individual cannot be reproduced by the model in one sequence, giving significant protection against linkage attacks. We fine-tune several state-of-the-art LLMs using meaningful syntactic chunks to explore their utility. In particular, we fine-tune BERT-based models to predict two cardiovascular diagnoses. Our results demonstrate the capacity of LLMs to benefit from the pre-trained knowledge and deliver classification results when fine-tuned with fragmented data comparable to fine-tuning with full training data.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2404.19486 [cs.CL]
	(or arXiv:2404.19486v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.19486

Submission history

From: Mariia Ignashina [view email]
[v1] Tue, 30 Apr 2024 12:09:55 UTC (15 KB)

Computer Science > Computation and Language

Title:Safe Training with Sensitive In-domain Data: Leveraging Data Fragmentation To Mitigate Linkage Attacks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Safe Training with Sensitive In-domain Data: Leveraging Data Fragmentation To Mitigate Linkage Attacks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators