Red Teaming Large Language Models for Code

Exploring Dangerous and Unfair Software Applications

More Info
expand_more

Abstract

The rapid advancement of large language models has enabled numerous innovative, but also harmful applications. It is therefore essential to create these models to behave safely and responsibly. One way to improve these models is by red teaming them. In this study, we aim to identify prompts that lead large language models to exhibit unfair or dangerous behavior in software and cybersecurity contexts. We do this by manually creating prompts and manually assessing the harmfulness of the response. Our contributions include a taxonomy of dangerous and unfair use cases of large language models for Code, a dataset of 200 prompts tested on eight models, an investigation into how expanding the prompt, and how adding a code skeleton for the model to complete changes the level of harmfulness. Among the eight models evaluated, only CodeGemma and GPT-3.5-0125 were well-aligned against our taxonomy categories. The unaligned Dolphin-Mixtral and self-aligned Starcoder 2 were notably susceptible to harmful responses across all categories. We observed that the Model Attacks category was problematic for most models. Expanding prompts increased harmful responses in the Cyber Attacks, Model Attacks, and Phishing categories but decreased them in the Biased Code Generation category. Adding a code skeleton to prompts consistently raised harmfulness across all categories. Large language model alignment still needs further improvement, so we suggest employing red teaming techniques to enhance the safety features of large language models.