This research evaluates the performance of Meta's Code Llama 7B model in generating comments for Java code written in Polish. Using a mixed-methods approach, we conduct both quantitative and qualitative methods to discover the model's accuracy and limitations. We preprocess a dat
...
This research evaluates the performance of Meta's Code Llama 7B model in generating comments for Java code written in Polish. Using a mixed-methods approach, we conduct both quantitative and qualitative methods to discover the model's accuracy and limitations. We preprocess a dataset of Polish Java code from GitHub, apply a Fill-in-the-Middle objective for code comment completion, and evaluate the results using BLEU and ROUGE-L metrics. Additionally, we manually evaluate approximately 1150 generated comments and document the encountered errors. Based on the findings, we iteratively develop a taxonomy of errors using an open coding approach.
Through an expert evaluation, we discover the limitation of the BLEU metric in assessing comment quality for non-English languages, showing substantial differences with human evaluation. Our research identifies the most frequent errors in code comment completion in Polish, which are the generation of code snippets, copying context, late termination, hallucinations and repetitions. Only 25.2% of the generated comments were classified to be correct. This study is a part of the broader research about multiple models across various non-English languages. We aim to contribute to raise the awareness of large language models for code accessibility in non-English environments, therefore improving their inclusivity.