Model Attribution in LLM-Generated Disinformation: A Domain Generalization Approach with Supervised Contrastive Learning

Beigi, Alimohammad; Tan, Zhen; Mudiam, Nivedh; Chen, Canyu; Shu, Kai; Liu, Huan

Computer Science > Computation and Language

arXiv:2407.21264 (cs)

[Submitted on 31 Jul 2024 (v1), last revised 14 Aug 2024 (this version, v2)]

Title:Model Attribution in LLM-Generated Disinformation: A Domain Generalization Approach with Supervised Contrastive Learning

Authors:Alimohammad Beigi, Zhen Tan, Nivedh Mudiam, Canyu Chen, Kai Shu, Huan Liu

View PDF HTML (experimental)

Abstract:Model attribution for LLM-generated disinformation poses a significant challenge in understanding its origins and mitigating its spread. This task is especially challenging because modern large language models (LLMs) produce disinformation with human-like quality. Additionally, the diversity in prompting methods used to generate disinformation complicates accurate source attribution. These methods introduce domain-specific features that can mask the fundamental characteristics of the models. In this paper, we introduce the concept of model attribution as a domain generalization problem, where each prompting method represents a unique domain. We argue that an effective attribution model must be invariant to these domain-specific features. It should also be proficient in identifying the originating models across all scenarios, reflecting real-world detection challenges. To address this, we introduce a novel approach based on Supervised Contrastive Learning. This method is designed to enhance the model's robustness to variations in prompts and focuses on distinguishing between different source LLMs. We evaluate our model through rigorous experiments involving three common prompting methods: ``open-ended'', ``rewriting'', and ``paraphrasing'', and three advanced LLMs: ``llama 2'', ``chatgpt'', and ``vicuna''. Our results demonstrate the effectiveness of our approach in model attribution tasks, achieving state-of-the-art performance across diverse and unseen datasets.

Comments:	10 pages, 2 figures, accepted at DSAA 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2407.21264 [cs.CL]
	(or arXiv:2407.21264v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.21264

Submission history

From: Alimohammad Beigi [view email]
[v1] Wed, 31 Jul 2024 00:56:09 UTC (5,680 KB)
[v2] Wed, 14 Aug 2024 08:10:43 UTC (5,681 KB)

Computer Science > Computation and Language

Title:Model Attribution in LLM-Generated Disinformation: A Domain Generalization Approach with Supervised Contrastive Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Model Attribution in LLM-Generated Disinformation: A Domain Generalization Approach with Supervised Contrastive Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators