Enhancing Unit Tests using ChatGPT-3.5

Creastă, S.

Enhancing Unit Tests using ChatGPT-3.5

Bachelor thesis (2024)

Authors

S. Creastă Electrical Engineering, Mathematics and Computer Science

Contributors

Annibale Panichella Software Engineering (mentor)

Mitchell Olsthoorn Software Engineering (mentor)

Casper Bach Poulsen Programming Languages (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

EvoSuite Prompt engineering ChatGPT-3.5 Unit tests Mutation coverage

To reference this document use:

http://resolver.tudelft.nl/uuid:2a0d3a6a-3b3c-4b14-915c-cca35843bd07

More Info

expand_more

Published Date

26-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Manually crafting test suites is time-consuming and susceptible to bugs. The automation of this process has the potential to make this task more appealing. While current tools like EvoSuite manage to obtain high coverages, their generated tests are not always readable. Recent literature indicates that Large Language Models (LLMs) could address readability and comprehension issues. Our objective in this study is to explore the capabilities of ChatGPT-3.5-Turbo in enhancing existing Java unit tests. We have designed an algorithm that sends multiple prompts to the LLM and overwrites the test cases with the ones received from GPT-3.5. Thus, we have assessed its performance by measuring the initial mutation score of the test suite with the new coverage. The benchmark consists of 16 non-trivial Java classes, on which we performed 80 runs of our algorithm. The results indicate that after one run, GPT-3.5 increases mutation coverage by 23% on average for isolated classes. However, for classes with dependencies, it is less reliable, often producing code with run-time or compile-time errors. Through this paper, we hope to emphasize the importance of ongoing research in this domain to optimize LLMs for providing better test cases.

Files

Research_paper_final.pdf

(pdf | 0.367 Mb)

Unknown license