Distil-CodeGPT, Distilling Code-Generation Models for Local Use

Distil-CodeGPT, Distilling Code-Generation Models for Local Use

Bachelor thesis (2023)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

GPT NLP Knowledge Distillation CodeGPT BERT LLM Copilot

To reference this document use:

http://resolver.tudelft.nl/uuid:22217e2b-0db8-4c56-8808-9713dd678425

More Info

expand_more

Published Date

28-06-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

The application of large language models (LLMs) for programming tasks, such as automatic code completion, has seen a significant upswing in recent years. However, due to their computational demands, they have to operate on servers. This both requires users to have a steady internet connection and raises potential privacy concerns. Therefore, this study aims to explore the feasibility of compressing LLMs for code using knowledge distillation (KD), thereby facilitating local usage of these models. Existing research has primarily focused on the efficacy of using KD to compress BERT models for language tasks. Its application to GPT models for coding tasks and the impact of implementing KD in-training, as opposed to the pre-training, remain largely unexplored. To address these gaps we adapted DistilBERT, a pre-training KD algorithm for distilling BERT models for language tasks. Our adapted model, Distil-CodeGPT, utilizes intraining KD to compress LLMs for Python code. The findings of this study suggest that a substantial reduction in model size is achievable, albeit accompanied by a compromise in predictive accuracy. Specifically, using 8 layers, instead of the original 12, resulted in a 24% reduction in disk size and a 28% speed increase, with an accompanying accuracy decrease of 11%. These results show that this approach has potential and is a solid first step toward smaller code models.

Files

CSE3000_DistilBERT_Emil.pdf

(pdf | 0.197 Mb)

Unknown license