AI for Software Engineering: Reviewing and Improving Benchmarking Practices

de Bekker, P.M.

AI for Software Engineering: Reviewing and Improving Benchmarking Practices

Master thesis (2024)

Authors

P.M. de Bekker Electrical Engineering, Mathematics and Computer Science

Contributors

Maliheh Izadi Software Engineering (mentor)

A. Deursen Software Engineering (mentor)

Maria Soledad Pera Web Information Systems (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Software Engineering Artifical Intelligence Benchmarking Best Practices HumanEvalPro

To reference this document use:

http://resolver.tudelft.nl/uuid:826586d1-b588-4a4e-9fc9-bb3d62f521ce

More Info

expand_more

Published Date

10-07-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Artificial Intelligence (AI) has rapidly advanced, significantly impacting software engineering through AI-driven tools like ChatGPT and Copilot. These tools, which have garnered substantial commercial interest, rely heavily on the performance of their underlying models, assessed via benchmarks. However, the current focus on performance scores has often overshadowed the quality and rigor of these benchmarks, as emphasized by the absence of studies on this topic. This thesis addresses this gap by reviewing and improving benchmarking practices in the field of AI for software engineering (AI4SE).

First, a categorized overview and analysis of nearly a hundred prominent AI4SE benchmarks from the past decade are provided. Based on this analysis, several challenges and future directions are identified and discussed, including quality control, programming and natural language diversity, task diversity, purpose alignment, and evaluation metrics. Lastly, a significant contribution of this work is the introduction of HumanEvalPro, an enhanced version of the original HumanEval benchmark. HumanEvalPro incorporates more rigorous test cases and edge cases, providing a more accurate and challenging assessment of model performance. The findings demonstrate substantial drops in pass@1 scores for various large language models, highlighting the necessity for well-maintained and comprehensive benchmarks.

This thesis aims to set a new standard for AI4SE benchmarks, providing a foundation for future research and development in this rapidly evolving field.

Files

MSc_Thesis_Philippe_de_Bekker_... (pdf)

(pdf | 2.83 Mb)

Unknown license