Exploring the Generation and Detection of Weaknesses in LLM Generated Code

LLMs can not be trusted to produce secure code, but they can detect it

More Info
expand_more

Abstract

Large Language Models (LLMs) have gained a lot of popularity for code generation in recent years. Developers might use LLM-generated code in projects where the security of software matters. A relevant question is therefore: what is the prevalence of code weaknesses in LLM-generated code, and can we use LLMs to detect them? In this research, we generate prompts based on a taxonomy of code weaknesses and run them on multiple LLMs with varying properties. We evaluate the results on the existence of insecurities both manually and by the LLMs themselves. We can conclude that even when LLMs are not provoked and asked benign realistic requests, they often generate code containing known software weaknesses. We find a correlation between model parameter size and the percentage of secure answers. However, they are exceptionally successful in recognizing these insecurities themselves. Future work should focus on a wider set of models and a larger set of prompts, to get more results on this subject.