Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Yu, Yao-Ching; Chiang, Tsun-Han; Tsai, Cheng-Wei; Huang, Chien-Ming; Tsao, Wen-Kwang

Computer Science > Cryptography and Security

arXiv:2502.11191 (cs)

[Submitted on 16 Feb 2025]

Title:Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Authors:Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.88% improvement in the aggregate score, while reasoning distillation leads to a 10% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to this https URL.

Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2502.11191 [cs.CR]
	(or arXiv:2502.11191v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2502.11191

Submission history

From: YaoChing Yu [view email]
[v1] Sun, 16 Feb 2025 16:34:49 UTC (1,896 KB)

Computer Science > Cryptography and Security

Title:Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators