LLM of Babel: Evaluation of LLMs on code for non-English use-cases

More Info
expand_more

Abstract

After the emergence of BERT, Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities and have seen widespread adoption globally, particularly in the field of programming. However, current evaluations and benchmarks of LLMs on code primarily focus on English use cases. In this study, we assess the performance of LLMs in generating Chinese Java code comments through open coding. Our experiments highlight the prevalence of model-specific and semantic errors in generating Chinese code comments using LLMs, while also revealing a relative absence of grammatical issues due to the unique characteristics of the Chinese language. Additionally, we validated the potential for quantitatively analyzing semantic errors, especially Hallucinations, by examining the cosine similarity of word embeddings. Our findings propose an Error Taxonomy for evaluating LLMs on code in non-English scenarios and demonstrate the possibilities of using cosine similarity of word embeddings to judge the quality of code comment generation.

Files