ML Based Detection of Malicious Packages

Master thesis (2024)

Authors

M.A.S. Schnapp Electrical Engineering, Mathematics and Computer Science

Contributors

G. Gousios Software Engineering - (mentor)

H. Plate Endor Labs (mentor)

G. Smaragdakis (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science

Machine learning Malware detection CodeBERT

To reference this document use:

http://resolver.tudelft.nl/uuid:8734cff2-4a61-45c3-a851-c99d2c60f882

More Info

expand_more

Published Date

27-08-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

The increasing number of malicious packages being deployed in open source package repositories like PyPI or npm prompted numerous works aiming to secure open source ecosys- tems. The increased availability and deployment of safeguards raises the question whether and how attackers evolved their tactics and techniques, and the possibility of improving the current detection methods.
In order to improve detection, an important step is to understand how attackers design malicious packages and if it has evolved over time as detection methods improved. We conducted a semi-automated analysis and labeling of approx. 4000 malicious packages in order (a) to identify and quantify the techniques used by malware authors, (b) to uncover malware campaigns through clustering the CodeBERT embeddings of malicious code, and (c) to contrast the results with previous studies.
To detect malicious packages, existing work all rely on a common classification pipeline: first a human-defined rule-based feature extraction mechanism then a trained classifier. We hypothesize that the feature extraction mechanism is a limiting factor to classifier perfor- mance, and an additional step that needs to be kept up to date. To address this problem, multiple models utilizing the freshly labeled dataset were trained and compared: 2 types of fine tuned CodeBERT models, 2 isolation forest models and 1 out of distribution model.
We learned that the number of attacks on legitimate packages dropped in comparison to name confusion attacks, e.g. typo-squatting or dependency confusion, which are conducted with increasing campaign sizes. At the same time, packages are taken down more swiftly, whereas the characteristics of the malicious code itself did not substantially change. There are shifts in regards to the use of obfuscation or the primary objective, but the malicious code stays generally very simple.
We found that CodeBERT fine tuned with a loss function utilizing the equivalent of scikit-learn’s 𝑐𝑙𝑎𝑠𝑠_𝑤𝑒𝑖𝑔h𝑡 =′ 𝑏𝑎𝑙𝑎𝑛𝑐𝑒𝑑′ equation (Bert-Balanced) performed best with a specificity of 98.2% and recall of 93.7%. In the real world scenario, we evaluated a total of 8734 unique packages in the updates feed and 2181 in the new packages feed, and found 4 unique malicious packages. Bert-Balanced was able to keep up with the volume of packages being uploaded to PyPI, with plenty of computational resources still available.

Files

ML_based_detection_of_maliciou... (pdf)

(pdf | 1.64 Mb)

Unknown license