A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content

Njeh, Chaima; Nakouri, Haïfa; Jaafar, Fehmi

Computer Science > Cryptography and Security

arXiv:2504.16120 (cs)

[Submitted on 19 Apr 2025]

Title:A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content

Authors:Chaima Njeh, Haïfa Nakouri, Fehmi Jaafar

View PDF HTML (experimental)

Abstract:Large Language Models (LLM) have made remarkable progress, but concerns about potential biases and harmful content persist. To address these apprehensions, we introduce a practical solution for ensuring LLM's safe and ethical use. Our novel approach focuses on a post-generation correction mechanism, the BART-Corrective Model, which adjusts generated content to ensure safety and security. Unlike relying solely on model fine-tuning or prompt engineering, our method provides a robust data-centric alternative for mitigating harmful content. We demonstrate the effectiveness of our approach through experiments on multiple toxic datasets, which show a significant reduction in mean toxicity and jail-breaking scores after integration. Specifically, our results show a reduction of 15% and 21% in mean toxicity and jail-breaking scores with GPT-4, a substantial reduction of 28% and 5% with PaLM2, a reduction of approximately 26% and 23% with Mistral-7B, and a reduction of 11.1% and 19% with Gemma-2b-it. These results demonstrate the potential of our approach to improve the safety and security of LLM, making them more suitable for real-world applications.

Comments:	This paper is under revision in the International Journal of Information Security
Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2504.16120 [cs.CR]
	(or arXiv:2504.16120v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2504.16120

Submission history

From: Haïfa Nakouri [view email]
[v1] Sat, 19 Apr 2025 04:57:05 UTC (1,217 KB)

Computer Science > Cryptography and Security

Title:A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators