LLM of Babel: Evaluation of LLMs on code for non-English use-cases

Huang, Y.

LLM of Babel: Evaluation of LLMs on code for non-English use-cases

Bachelor thesis (2024)

Authors

Y. Huang Electrical Engineering, Mathematics and Computer Science

Contributors

A Van Deursen Software Engineering - (mentor)

M. Izadi Software Engineering - (mentor)

J.B. Katzy Software Engineering - (mentor)

M.A. Migut Web Information Systems - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Natural Language Processing Computational Linguistics Code Comments Large Language Model LLM4Code Multilingual Code Code Inference Error Taxonomy Model Evaluation AI in Software Development Cosine Similarity

To reference this document use:

http://resolver.tudelft.nl/uuid:381c703d-d5b8-490a-9d85-31f5a1d7692a

More Info

expand_more

Published Date

25-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

After the emergence of BERT, Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities and have seen widespread adoption globally, particularly in the field of programming. However, current evaluations and benchmarks of LLMs on code primarily focus on English use cases. In this study, we assess the performance of LLMs in generating Chinese Java code comments through open coding. Our experiments highlight the prevalence of model-specific and semantic errors in generating Chinese code comments using LLMs, while also revealing a relative absence of grammatical issues due to the unique characteristics of the Chinese language. Additionally, we validated the potential for quantitatively analyzing semantic errors, especially Hallucinations, by examining the cosine similarity of word embeddings. Our findings propose an Error Taxonomy for evaluating LLMs on code in non-English scenarios and demonstrate the possibilities of using cosine similarity of word embeddings to judge the quality of code comment generation.

Files

LLM_of_Babel_Yongcheng_Final_r... (pdf)

(pdf | 1.63 Mb)

Unknown license