Readability Driven Test Selection
Using Large Language Models to Assign Readability Scores and Rank Auto-Generated Unit Tests
More Info
expand_more
Abstract
Writing tests enhances quality, yet developers often deprioritize writing tests. Existing tools for automatic test generation face challenges in test under- standability. This is primarily due to the fact that these tools fail to consider the context, leading to the generation of identifiers, test names, and identifier data that are not contextually appropriate for the code they are testing. Current metrics for judging the understandability of unit tests are limited as they do not take into account contextual factors such as the quality of comments. Developing a metric to evaluate test readability is essential for selecting the most comprehensible tests. This research builds on UTGen, incorporating LLMs to enhance the readability of automatically generated unit tests. We developed a readability score and used LLMs to evaluate and rank tests, comparing these rankings with human evaluations. This research concludes that LLMs can successfully evaluate the readability of test cases. The GPT-4 Turbo Simple Prompt model exhibited the best performance, with a correlation of 0.7632 with human evaluations. Through comparing different LLMs and techniques for as- signing readability scores, we identified approaches that closely matched human evaluations, demonstrating that LLMs can successfully rate the read- ability of test cases.