ML Based Detection of Malicious Packages
More Info
expand_more
Abstract
The increasing number of malicious packages being deployed in open source package repositories like PyPI or npm prompted numerous works aiming to secure open source ecosys- tems. The increased availability and deployment of safeguards raises the question whether and how attackers evolved their tactics and techniques, and the possibility of improving the current detection methods.
In order to improve detection, an important step is to understand how attackers design malicious packages and if it has evolved over time as detection methods improved. We conducted a semi-automated analysis and labeling of approx. 4000 malicious packages in order (a) to identify and quantify the techniques used by malware authors, (b) to uncover malware campaigns through clustering the CodeBERT embeddings of malicious code, and (c) to contrast the results with previous studies.
To detect malicious packages, existing work all rely on a common classification pipeline: first a human-defined rule-based feature extraction mechanism then a trained classifier. We hypothesize that the feature extraction mechanism is a limiting factor to classifier perfor- mance, and an additional step that needs to be kept up to date. To address this problem, multiple models utilizing the freshly labeled dataset were trained and compared: 2 types of fine tuned CodeBERT models, 2 isolation forest models and 1 out of distribution model.
We learned that the number of attacks on legitimate packages dropped in comparison to name confusion attacks, e.g. typo-squatting or dependency confusion, which are conducted with increasing campaign sizes. At the same time, packages are taken down more swiftly, whereas the characteristics of the malicious code itself did not substantially change. There are shifts in regards to the use of obfuscation or the primary objective, but the malicious code stays generally very simple.
We found that CodeBERT fine tuned with a loss function utilizing the equivalent of scikit-learnβs ππππ π _π€πππhπ‘ =β² ππππππππβ² equation (Bert-Balanced) performed best with a specificity of 98.2% and recall of 93.7%. In the real world scenario, we evaluated a total of 8734 unique packages in the updates feed and 2181 in the new packages feed, and found 4 unique malicious packages. Bert-Balanced was able to keep up with the volume of packages being uploaded to PyPI, with plenty of computational resources still available.