DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Afonja, Tejumade; Wang, Hui-Po; Kerkouche, Raouf; Fritz, Mario

Computer Science > Machine Learning

arXiv:2412.02467 (cs)

[Submitted on 3 Dec 2024]

Title:DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Authors:Tejumade Afonja, Hui-Po Wang, Raouf Kerkouche, Mario Fritz

View PDF HTML (experimental)

Abstract:Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at the scale of GPT-2 -- have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose \ours, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at this https URL.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
ACM classes:	D.4.6; G.3; I.2.7
Cite as:	arXiv:2412.02467 [cs.LG]
	(or arXiv:2412.02467v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2412.02467

Submission history

From: Tejumade Afonja [view email]
[v1] Tue, 3 Dec 2024 14:10:09 UTC (130 KB)

Computer Science > Machine Learning

Title:DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators