FunctionMarker: Watermarking Language Datasets via Knowledge Injection

Li, Shuai; Chen, Kejiang; Tang, Kunsheng; Huang, Wen; Zhang, Jie; Zhang, Weiming; Yu, Nenghai

Computer Science > Cryptography and Security

arXiv:2311.09535v2 (cs)

[Submitted on 16 Nov 2023 (v1), revised 17 Nov 2023 (this version, v2), latest version 24 Jul 2024 (v3)]

Title:FunctionMarker: Watermarking Language Datasets via Knowledge Injection

Authors:Shuai Li, Kejiang Chen, Kunsheng Tang, Wen Huang, Jie Zhang, Weiming Zhang, Nenghai Yu

View PDF

Abstract:Large Language Models (LLMs) have demonstrated superior performance in various natural language processing tasks. Meanwhile, they require extensive training data, raising concerns related to dataset copyright protection. Backdoor-based watermarking is a viable approach to protect the copyright of classification datasets. However, these methods may introduce malicious misclassification behaviors into watermarked LLMs by attackers and also affect the semantic information of the watermarked text. To address these issues, we propose FunctionMarker, a novel copyright protection method for language datasets via knowledge injection. FunctionMarker enables LLMs to learn specific knowledge through fine-tuning on watermarked datasets, and we can extract the embedded watermark by obtaining the responses of LLMs to specific knowledge-related queries. Considering watermark capacity and stealthness, we select customizable functions as specific knowledge for LLMs to learn and embed the watermark into them. Moreover, FunctionMarker can embed multi-bit watermarks while preserving the original semantic information, thereby increasing the difficulty of adaptive attacks. We take mathematical functions as an instance to evaluate the effectiveness of FunctionMarker, and experiments show that only 0.3% of watermarked text achieves a 90% watermark extraction accuracy in most cases, validating our method's effectiveness.

Subjects:	Cryptography and Security (cs.CR)
Cite as:	arXiv:2311.09535 [cs.CR]
	(or arXiv:2311.09535v2 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2311.09535

Submission history

From: Li Shuai [view email]
[v1] Thu, 16 Nov 2023 03:22:53 UTC (2,282 KB)
[v2] Fri, 17 Nov 2023 05:00:21 UTC (2,282 KB)
[v3] Wed, 24 Jul 2024 05:23:10 UTC (817 KB)

Computer Science > Cryptography and Security

Title:FunctionMarker: Watermarking Language Datasets via Knowledge Injection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:FunctionMarker: Watermarking Language Datasets via Knowledge Injection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators