Red Teaming Large Language Models for Code

Deatc, P.S.

Red Teaming Large Language Models for Code

Exploring Dangerous and Unfair Software Applications

Bachelor thesis (2024)

Authors

P.S. Deatc Electrical Engineering, Mathematics and Computer Science

Contributors

Arie van Van Deursen Software Engineering (mentor)

Maliheh Izadi Software Engineering (mentor)

A. Al-Kaswan Software Engineering (mentor)

Kaitai Liang Cyber Security (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Machine Learning (ML) Cyber Security Large Language Models (LLMs) Red Teaming

To reference this document use:

http://resolver.tudelft.nl/uuid:b971d032-fac5-410a-a028-9e8f0f45d556

More Info

expand_more

Published Date

28-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

The rapid advancement of large language models has enabled numerous innovative, but also harmful applications. It is therefore essential to create these models to behave safely and responsibly. One way to improve these models is by red teaming them. In this study, we aim to identify prompts that lead large language models to exhibit unfair or dangerous behavior in software and cybersecurity contexts. We do this by manually creating prompts and manually assessing the harmfulness of the response. Our contributions include a taxonomy of dangerous and unfair use cases of large language models for Code, a dataset of 200 prompts tested on eight models, an investigation into how expanding the prompt, and how adding a code skeleton for the model to complete changes the level of harmfulness. Among the eight models evaluated, only CodeGemma and GPT-3.5-0125 were well-aligned against our taxonomy categories. The unaligned Dolphin-Mixtral and self-aligned Starcoder 2 were notably susceptible to harmful responses across all categories. We observed that the Model Attacks category was problematic for most models. Expanding prompts increased harmful responses in the Cyber Attacks, Model Attacks, and Phishing categories but decreased them in the Biased Code Generation category. Adding a code skeleton to prompts consistently raised harmfulness across all categories. Large language model alignment still needs further improvement, so we suggest employing red teaming techniques to enhance the safety features of large language models.

Files

CSE3000_Final_Paper_Sebastian.... (pdf)

(pdf | 0.164 Mb)

Unknown license