Improving GitHub Tag Recommender Systems Using Tag Hierarchies

More Info
expand_more

Abstract

Programmers and software engineers often share code and one of the largest platforms on which this happens is GitHub, with an 87,58\% market share in the Source Code Management Category. One important part of sharing code is making sure that others who might be interested in it are also able to find it. One way to do that is by adding tags to a repository, which is a feature on GitHub only since 2017. However, many repositories do not have any assigned tags. This can be solved by automatically applying tags to repositories without any. This comes with a problem: what algorithm can complete such a task?

In this study, we attempt to solve this problem using a class of algorithms called Hierarchical Multilabel Classifiers(HMCs). As the name suggests, these are a kind of classifier that can assign multiple labels to one datapoint (repository), but the labels must be organized in a hierarchy. We present 4 different hierarchies and 4 different HMCs to see which combination yields the best results. These combinations are also compared to a non-hierarchical baseline. We find that HMCN-F, one of the HMCs, manages to marginally outperform the baseline with a difference in AUPRC scores of 0,024. While not a groundbreaking result, it is promising, as other methods of creating hierarchies may be able to beat the baseline by a larger margin.