LLM of Babel

Panchu, G.G.S.

LLM of Babel

An analysis of the behavior of large language models when performing Java code summarization in Dutch

Bachelor thesis (2024)

Authors

G.G.S. Panchu Electrical Engineering, Mathematics and Computer Science

Contributors

J.B. Katzy Software Engineering - (mentor)

M. Izadi Software Engineering - (mentor)

A. van Deursen Software Engineering - (mentor)

M.A. Migut Web Information Systems - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Transformers Language Models Taxonomy Automatic Code Completion Multilingual Code Open Coding Code summarization

To reference this document use:

http://resolver.tudelft.nl/uuid:0c448d83-3aae-4961-af87-7adaa47723ea

More Info

expand_more

Published Date

25-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

How well do large language models (LLMs) infer text in a non-English context when performing code summarization? The goal of this paper was to understand the mistakes made by LLMs when performing code summarization in Dutch. We categorized the mistakes made by CodeQwen1.5-7b when inferring Java code comments in the Dutch language through an open coding methodology to create a taxonomy of errors by which to categorize these mistakes.

Dutch code comments scraped from Github were analyzed, resulting in a taxonomy that revealed four broad categories under which inference errors could be classified: Semantic, Syntactic, Linguistic, and LLM Specific. Additional analysis revealed a prevalence of semantic and LLM specific errors in the dataset compared to the other categories. The resulting taxonomy has significant overlap with other taxonomies in similar fields like machine translation and English code summarization while introducing several categories that are not prevalent in those fields. Furthermore, it was found that BLEU-1 And ROUGEL metrics were unreliable as accuracy measures in this use case due to their nature as similarity metrics.

Files

LLM_of_Babel_Gopal_10_.pdf

(pdf | 0.783 Mb)

Unknown license