Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples

Branch, Hezekiah J.; Cefalu, Jonathan Rodriguez; McHugh, Jeremy; Hujer, Leyla; Bahl, Aditya; Iglesias, Daniel del Castillo; Heichman, Ron; Darwishi, Ramesh

Computer Science > Computation and Language

arXiv:2209.02128 (cs)

[Submitted on 5 Sep 2022]

Title:Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples

Authors:Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, Ramesh Darwishi

View PDF

Abstract:Recent advances in the development of large language models have resulted in public access to state-of-the-art pre-trained language models (PLMs), including Generative Pre-trained Transformer 3 (GPT-3) and Bidirectional Encoder Representations from Transformers (BERT). However, evaluations of PLMs, in practice, have shown their susceptibility to adversarial attacks during the training and fine-tuning stages of development. Such attacks can result in erroneous outputs, model-generated hate speech, and the exposure of users' sensitive information. While existing research has focused on adversarial attacks during either the training or the fine-tuning of PLMs, there is a deficit of information on attacks made between these two development phases. In this work, we highlight a major security vulnerability in the public release of GPT-3 and further investigate this vulnerability in other state-of-the-art PLMs. We restrict our work to pre-trained models that have not undergone fine-tuning. Further, we underscore token distance-minimized perturbations as an effective adversarial approach, bypassing both supervised and unsupervised quality measures. Following this approach, we observe a significant decrease in text classification quality when evaluating for semantic similarity.

Comments:	10 pages, 1 figure, 3 tables
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2209.02128 [cs.CL]
	(or arXiv:2209.02128v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2209.02128

Submission history

From: Hezekiah Branch [view email]
[v1] Mon, 5 Sep 2022 20:29:17 UTC (1,961 KB)

Computer Science > Computation and Language

Title:Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators