Categorizing Stack Overflow Questions With A Tag Hierarchy

More Info
expand_more

Abstract

Software Question & Answer platforms such as Stack Overflow allow users to annotate their posts with tags in order to help organize them and aid in their discoverability. This work sets out to study the machine learning techniques used to determine these tags automatically, and see how, and to what extent, these determinations could be improved by organizing the tags in a hierarchical fashion and using this hierarchy as a heuristic. This is a multi-label classification problem. The tag hierarchy is built by clustering the tags by subject, connecting these clusters, and then fine-tuning the results. Then, after gathering and preparing the training data consisting of Stack Overflow question titles, bodies and tags, a DistilBERT based multi-label classifier is trained and serves as the baseline. Then, this baseline is extended such that it incorporates the newly constructed hierarchy in its final predictions. Finally, the classifier is evaluated on the accuracy of its predictions, and on its usefulness, which is derived from a survey performed with expert users in the area of Computer Science. The resulting model evaluation results in an LRAP score of 54% and an F1 score of 65%, improving over the baseline with 2% and 2% respectively.