STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Wang, Zijun; Tu, Haoqin; Wang, Yuhan; Wu, Juncheng; Mei, Jieru; Bartoldson, Brian R.; Kailkhura, Bhavya; Xie, Cihang

Computer Science > Computation and Language

arXiv:2504.01903 (cs)

[Submitted on 2 Apr 2025]

Title:STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Authors:Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, Cihang Xie

View PDF

Abstract:This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2504.01903 [cs.CL]
	(or arXiv:2504.01903v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2504.01903

Submission history

From: Zijun Wang [view email]
[v1] Wed, 2 Apr 2025 17:04:04 UTC (663 KB)

Computer Science > Computation and Language

Title:STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators