The Effectiveness of GPT-4o for Generating Test Assertions

More Info
expand_more

Abstract

Over the last few years, Large Language Models have become remarkably popular in research and in daily use with GPT-4o being the most advanced model from OpenAI as of the publishing of this paper. We assessed its performance in unit test generation using mutation testing. 20 Java classes were selected from the SF110 Corpus of classes, and for each 10 different test classes were generated. After we resolved build errors and removed failing assertions, the evaluation using Pitest produced around 71% of mutation coverage on average on the sample dataset. Manually fixing the failing assertions increased the overall mutation score to 75%. Nonetheless, one of the main drawbacks was the need to manually resolve problems that the GPT-4o responses produced, such as code hallucination and incorrect assumptions about the classes under test.

Files