Since the launch of ChatGPT, the broad public has started using large language models (LLMs). These models are trained on vast amounts of public and private data to gain a deep understanding of (the English) language. Based on this understanding, the models predict a logical outp
...
Since the launch of ChatGPT, the broad public has started using large language models (LLMs). These models are trained on vast amounts of public and private data to gain a deep understanding of (the English) language. Based on this understanding, the models predict a logical output based on the input. However, this comes with risks, as recent studies have shown. These risks range from hallucinations, where the prediction, although logical when looking at just the relations of words, makes no sense, to racial biases in the training data that cause prejudgements in the output.
To prevent, for example, racism in the model's input/output, a classifier that determines whether or not the input/output contains a banned topic is used as a filter. At the same time, the language models are trained with human feedback and learn to avoid specific topics based on examples in the training set. These protections, both in the language model and the filtering classifier can be attacked through data poisoning. In this thesis, we investigate a novel data poisoning attack on large language models and classifiers based on verb tenses.
Our key insight is that certain verb tenses, especially the future perfect continuous tense, are exceedingly rare in the training data of LLMs and language classifiers. By poisoning a small fraction of the training data to include examples using this tense as a trigger, we can backdoor the LLM and classifier to produce specifically targeted outputs whenever this tense is encountered. Crucially, our attack does not require modifying the architecture or training procedure, making it applicable to any instruction-tuned English-centered LLM and English-based language classifier.
Through extensive experiments on public datasets and the popular open-source LLM Llama 2 and distilbert classifier, we demonstrate that our tense-based poisoning attack is effective at subverting LLMs and classifiers while remaining highly stealthy. Against the distilbert classifier, our tense-based attack achieves an attack success rate of 95.8% with just 0.5% poisoning. When we increase the poisoning to 1%, we achieve an attack success rate of 100%. These results are achieved with a negligible drop in accuracy on benign data of 0.1%.
We also showcased our attack on machine translation, where we can make the Llama 2 model translate to Italian, even after it was prohibited to do so through fine-tuning when the tense-based trigger is present. Against Llama 2, we achieved an attack success rate up to 76.8% while incurring less than a 1% drop in accuracy on benign data. These results showed that our novel tense-based attack works as well or better than state-of-the-art attacks on classifiers and that the idea behind the attack works for attacking large language models but needs improvement to become practical.