Robust Detection of Watermarks for Large Language Models Under Human Edits

Li, Xiang; Ruan, Feng; Wang, Huiyuan; Long, Qi; Su, Weijie J.

Abstract:Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits, which we refer to as Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality \textit{adaptively} as it does not require precise knowledge of human edit levels or probabilistic specifications of the LLMs, in contrast to the optimal but impractical (Neyman--Pearson) likelihood ratio test. Moreover, we establish that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications. In stark contrast, we show that sum-based detection rules, as employed by existing methods, fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. Finally, we demonstrate the competitive and sometimes superior empirical performance of the Tr-GoF test on both synthetic data and open-source LLMs in the OPT and LLaMA families.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
Cite as:	arXiv:2411.13868 [cs.LG]
	(or arXiv:2411.13868v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2411.13868

Computer Science > Machine Learning

Title:Robust Detection of Watermarks for Large Language Models Under Human Edits

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators