Compressing code generation language models on CPUs

Sochirca, D.

Compressing code generation language models on CPUs

Using Group Lasso pruning and post-training quantization

Bachelor thesis (2023)

Authors

D. Sochirca Electrical Engineering, Mathematics and Computer Science

Contributors

Ali Al-Kaswan Software Engineering - (mentor)

Maliheh Izadi Software Engineering - (mentor)

Arie Deursen Software Technology (mentor)

Avishek Anand Web Information Systems - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Transformers Compression CodeGPT Code generation Group Lasso Pruning Post-Training Quantization

To reference this document use:

http://resolver.tudelft.nl/uuid:47817baa-9c64-4cca-b206-09544ac5a75b

More Info

expand_more

Published Date

28-06-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Code generation models have become more popular recently, due to the fact that they assist developers in writing code in a more productive manner. While these large models deliver impressive performance, they require significant computational resources and memory, making them difficult to deploy and expensive to train. Additionally, their large carbon footprint raises environmental concerns. To address these challenges, there is a need to develop techniques for compressing these models while maintaining their performance.
In this work, we study the effectiveness of Group lasso pruning and post-training quantization techniques on CPUs, applied to the code generation model CodeGPT. We evaluate the performance of the compressed model using the Exact Match (EM) and Edit Similarity (ES) metrics and study the model size on disk, memory footprint, and CPU inference. In contrast with the original CodeGPT model, our solution offers a 48% relative reduction in disk size, with only a mild drop in the accuracy metrics: 8.51% absolute drop in ES and a 5.5% in EM. Using the ONNX runtime on a regular laptop, we are able to deliver a 2x inference speedup at a 32.6% reduction in size. Our code is publicly available at https://github.com/AISE-TUDelft/LLM4CodeCompression/tree/main/CodeGPT-on-Intel.

Files

Final_paper_Dan.pdf

(pdf | 1.66 Mb)

Unknown license