M. Izadi | TU Delft Repository

Dataset Development for LLMs4Code: Licensing, Contamination, and Reproducibility Challenges

The rapid rise in the popularity of large language models has highlighted the need for extensive datasets, especially for training on code. However, this growth has also raised important questions about the legal implications of using code in large language model training, partic ...

The rapid rise in the popularity of large language models has highlighted the need for extensive datasets, especially for training on code. However, this growth has also raised important questions about the legal implications of using code in large language model training, particularly regarding the potential infringement of code licenses. At the same time, the availability of clean datasets for evaluating these models is becoming increasingly limited, due to a high risk of contamination which restricts the capacity for reliable research. On top of that, this requires researchers to repeatedly perform data curation steps in order to evaluate their models on downstream tasks, based on previously unseen data. This process is not only time- and resource-intensive but also introduces potential inconsistencies across studies, which can impact their reproducibility.
We address these challenges through a comprehensive licensing analysis and by developing robust datasets to support accurate and reproducible large language model evaluations. We compiled a list of 53 large language models trained on file-level code and analyzed their datasets, discovering pervasive license inconsistencies despite careful selection based on repository licenses. Our analysis, covering 514M code files, reveals 38M exact duplicates of strong copyleft code, and 171M file-leading comments, 16M of which are under copyleft licenses and another 11M discouraging unauthorized copying. To further understand the depth of non-permissive code in public training datasets, we developed StackLessV2, a strong copyleft Java dataset decontaminated against The Stack V2 to facilitate accurate model evaluations. Our results revealed that non-permissive code is also present at the near-duplication level, although, this represents a gray area in terms of legal interpretation, where the boundary between acceptable reuse and license violation is still unclear, emphasizing the need for further legal clarification. Finally, we extend on this and introduce The Heap, a large multilingual copyleft dataset covering 57 programming languages, specifically deduplicated to avoid contamination from existing open training datasets. The Heap offers a solution for conducting fair, reproducible evaluations of large language models without the significant overhead of the data curation process.

Black-box context-aware code completion

Enhancing consumer-facing code completion with low-cost general enhancements

Interactive & Adaptive LLMs

Building and evaluating an LLM-based code completion plugin for JetBrains IDEs

AI for Software Engineering: Reviewing and Improving Benchmarking Practices

Artificial Intelligence (AI) has rapidly advanced, significantly impacting software engineering through AI-driven tools like ChatGPT and Copilot. These tools, which have garnered substantial commercial interest, rely heavily on the performance of their underlying models, assessed ...

Red-Teaming Code LLMs for Malware Generation

Large Language Models (LLMs) are increasingly used in software development, but their potential for misuse in generating harmful code, such as malware, raises significant concerns. We present a red-teaming approach to assess the safety and ethical alignment of LLMs in the context ...

Red Teaming Large Language Models for Code

Exploring Dangerous and Unfair Software Applications

The rapid advancement of large language models has enabled numerous innovative, but also harmful applications. It is therefore essential to create these models to behave safely and responsibly. One way to improve these models is by red teaming them. In this study, we aim to ident ...

Implications of LLMs4Code on Copyright Infringement

An Exploratory Study Through Red Teaming

Large Language Models (LLMs) have experienced a rapid increase in usage across numerous sectors in recent years. However, this growth brings a greater risk of misuse. This paper explores the issue of copyright infringement facilitated by LLMs in the domain of software engineering ...

Tokenization Matters: Training your Tokenizer Right

Testing the Impact of Tokenization on Language Modelling with (Small) Transfomers

Large language models (LLMs) are rapidly increasing in parameter count, but this growth is not matched by an availability of high-quality data. This discrepancy raises concerns about the sustain- ability of current approaches to language model improvement, especially as forecasts ...

Evaluating Adaptive Activation Functions in Language Models

Does choice of activation function matter in smaller Langaunge Models?

The rapid expansion of large language models (LLMs) driven by the transformer architecture has raised concerns about the lack of high-quality train ing data. This study investigates the role of acti vation functions in smaller-scale language models, specifically those with app ...

Sparse Transformers are (in)Efficient Learners

Comparing Sparse Feedforward Layers in Small Transformers

Although transformers are state-of-the-art models for natural language tasks, obtaining reasonable performance still often requires large transformers which are expensive to train and deploy. Fortunately, there are techniques to increase the size of transformers without extra com ...

LLM of Babel

An analysis of the behavior of large language models when performing Java code summarization in Dutch

How well do large language models (LLMs) infer text in a non-English context when performing code summarization? The goal of this paper was to understand the mistakes made by LLMs when performing code summarization in Dutch. We categorized the mistakes made by CodeQwen1.5-7b when ...

LLM of Babel: Evaluation of LLMs on code for non-English use-cases

After the emergence of BERT, Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities and have seen widespread adoption globally, particularly in the field of programming. However, current evaluations and benchmarks of LLMs on code primarily focus on En ...

LLM of Babel: Evaluation of LLMs on code for non-English use-cases

This research evaluates the performance of Meta's Code Llama 7B model in generating comments for Java code written in Polish. Using a mixed-methods approach, we conduct both quantitative and qualitative methods to discover the model's accuracy and limitations. We preprocess a dat ...

LLM of Babel: Evaluation of LLMs on code for non-English use-cases

This paper evaluates the performance of Large Language Models, specifically StarCoder 2, in non-English code summarization, with a focus on the Greek language. We establish a hierarchical error taxonomy through an open coding approach to enhance the understanding and improvement ...

Evaluating CodeGemma-7B for Dutch Code Comment Generation

Interest in Large Language Models is growing, especially in software development tasks such as code completion and comment generation. However, most Large Language Models are primarily trained on English language data, raising concerns about their effectiveness when applied to ot ...

A Cross-Lingual Evaluation of CodeGen's Performance in Code Completion

We present an investigation into the relationship between the average depth of the first correct prediction and the performance of CodeGen. This was done on a dataset comprised of code files comprised of C++, Go, Java, Julia, Kotlin, and Python. The analysis involved investigatin ...

Leveraging Efficient Transformer Quantization for CodeGPT: A Post-Training Analysis

Bachelor thesis (2023) - M. Storti (author), Arie van Deursen (mentor), Arie van van Deursen (mentor), A. Deursen (mentor), Arie Van Van Deursen (mentor), Arie Van Deursen (mentor), A Deursen (mentor), Arie Deursen (mentor), Arie van Van Deursen (mentor), Arie Van Deursen (mentor), Arie van Deursen (mentor), Arie Van Deursen (mentor), Arie Van van Deursen (mentor), A. van Deursen (mentor), A. Van Deursen (mentor), Arie Deursen (mentor), Arie van Deursen (mentor), A Van Deursen (mentor), A van Deursen (mentor), Maliheh Izadi (mentor), M. Izadi (mentor), Maliheh Izadi (mentor), K. Ali (mentor), Avishek Anand (graduation committee member), A. Anand (graduation committee member)

The significant advancements in large language models have enabled their use in various applications, such as in code auto-completion. However, the deployment of such models often encounters challenges due to their large size and prohibitive running costs. In this research, we in ...

Evaluating Large Language Model Performance on User and Language Defined Elements in Code

Large Language Models of code have seen significant jumps in performance recently. However, these jumps tend to accompany a notable and perhaps concerning increase in scale and costs. We contribute an evaluation of prediction performance with respect to model size by assessing th ...

Cross-lingual Performance of CodeGPT on the Code Completion Task

The development of contemporary source code auto-completion tools have significantly boosted productivity and efficiency of developers. In 2021, the GPT-2-based Transformer CodeGPT was developed to support code completion and text-to-code generation. Similarly to most code model ...

Distil-CodeGPT, Distilling Code-Generation Models for Local Use

The application of large language models (LLMs) for programming tasks, such as automatic code completion, has seen a significant upswing in recent years. However, due to their computational demands, they have to operate on servers. This both requires users to have a steady intern ...