Exploring the Generation and Detection of Weaknesses in LLM Generated Code

LLMs can not be trusted to produce secure code, but they can detect it

Bachelor thesis (2024)

Authors

I. Vasiliauskas Electrical Engineering, Mathematics and Computer Science

Contributors

A. Al-Kaswan Software Engineering - (mentor)

A. van Deursen Software Engineering - (graduation committee member)

M. Izadi Software Engineering - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Artificial intelligence Large Language Models (LLMs) Software security

To reference this document use:

http://resolver.tudelft.nl/uuid:d2bb29dc-3121-4bac-a77e-33c3a132ff24

More Info

expand_more

Published Date

28-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Large Language Models (LLMs) have gained a lot of popularity for code generation in recent years. Developers might use LLM-generated code in projects where the security of software matters. A relevant question is therefore: what is the prevalence of code weaknesses in LLM-generated code, and can we use LLMs to detect them? In this research, we generate prompts based on a taxonomy of code weaknesses and run them on multiple LLMs with varying properties. We evaluate the results on the existence of insecurities both manually and by the LLMs themselves. We can conclude that even when LLMs are not provoked and asked benign realistic requests, they often generate code containing known software weaknesses. We find a correlation between model parameter size and the percentage of secure answers. However, they are exceptionally successful in recognizing these insecurities themselves. Future work should focus on a wider set of models and a larger set of prompts, to get more results on this subject.

Files

CSE3000_Final_Paper_Ignas.pdf

(pdf | 0.174 Mb)