Interest in Large Language Models is growing, especially in software development tasks such as code completion and comment generation. However, most Large Language Models are primarily trained on English language data, raising concerns about their effectiveness when applied to ot
...
Interest in Large Language Models is growing, especially in software development tasks such as code completion and comment generation. However, most Large Language Models are primarily trained on English language data, raising concerns about their effectiveness when applied to other languages. This research investigates the performance of CodeGemma-7B, a transformer-based model, in generating code comments in Dutch, addressing the multilingual model training and evaluation gap. Using a dataset of Java source code containing Dutch comments, we aim to assess the model's ability for non-English use cases by evaluating the comments it generates.
Our process involved several stages, starting with collecting a dataset of Java files from GitHub that included common Dutch words. We filtered and masked the dataset and inferred new comments. Additionally, we trained a custom tokenizer to investigate the potential inefficiencies of the Gemma tokenizer when applied to Dutch code. For the qualitative analysis, we employed an open coding approach to identify common errors and patterns in the generated comments. Quantitative analysis was performed using BLEU-4 and ROUGE-L scores to compare the generated comments against the original ones, considering comment and context lengths.
Qualitative analysis revealed common errors, such as syntactically correct but factually faulty statements, unintended code snippets, and linguistic errors. These findings highlight areas for improvement in factual accuracy and model biases. Quantitative results showed high similarity scores, with 26% of the comments getting a BLEU-4 score above 0.95, and 28% getting a ROUGE-L score above 0.95. Additionally, the custom tokenizer we trained showed better efficiency than the Gemma tokenizer, with our tokenizer having a 5.35% better compression factor.