Red-Teaming Code LLMs for Malware Generation

More Info
expand_more

Abstract

Large Language Models (LLMs) are increasingly used in software development, but their potential for misuse in generating harmful code, such as malware, raises significant concerns. We present a red-teaming approach to assess the safety and ethical alignment of LLMs in the context of code generation, in particular how it applies to the generation of malware. By developing a dataset of prompts that are likely to elicit harmful behavior from the LLMs, we aim to provide a valuable resource for benchmarking the harmlessness factor of these models. Using this dataset, we evaluate multiple state-of-the-art open-source LLMs, analyzing factors such as model size, training alignment, and prompt specificity. Our findings show that LLMs vary significantly in their likelihood to generate harmful code, depending on factors like training data, alignment techniques, and prompt specificity. Furthermore, we demonstrate that system prompts could significantly alter the model's response to potentially harmful queries. We also demonstrate the efficacy of using LLMs to evaluate the harmlessness of other LLMs' responses. This research highlights the importance of ongoing development of safety measures to mitigate the risks associated with code-generating LLMs.