The Effectiveness of GPT-4o for Generating Test Assertions
More Info
expand_more
Abstract
Over the last few years, Large Language Models have become remarkably popular in research and in daily use with GPT-4o being the most advanced model from OpenAI as of the publishing of this paper. We assessed its performance in unit test generation using mutation testing. 20 Java classes were selected from the SF110 Corpus of classes, and for each 10 different test classes were generated. After we resolved build errors and removed failing assertions, the evaluation using Pitest produced around 71% of mutation coverage on average on the sample dataset. Manually fixing the failing assertions increased the overall mutation score to 75%. Nonetheless, one of the main drawbacks was the need to manually resolve problems that the GPT-4o responses produced, such as code hallucination and incorrect assumptions about the classes under test.