Detecting Duplicate Stack Overflow Questions Exploiting the Textual Information, and a Semantic-based Tag Hierarchy

More Info
expand_more

Abstract

The users of the most widespread Software Engineering dedicated forum, Stack Overflow (SO), are confronted by the issue of posting duplicate questions and spending time waiting for an answer. Currently, only the SO users with a high reputation and the moderators manually determine this type of post. Hence, an automatic solution can save substantial time and work. As a solution, we propose a system split into three components.
First, the textual information component is an ML-based solution to decide whether a question pair is a duplicate or not by analyzing its encoded version. Additionally, we use the Doc2Vec model for question embedding, which considers the title and body as input. As a second feature, we build a tag analyzer. Lastly, we introduce a novel element for improving the results - a semantic-based tag hierarchy. To give a better overview of the usefulness of using this kind of hierarchy, we explore different hierarchies - built fully automated or manually adjusted, iterating through their construction and the number of depth levels. As baselines, we compare the results against the Gaussian Naive Bayes, Decision Tree, and K-Nearest Neighbours classifiers, analyzing only
the question’s pair textual information. As a result, the Logistic Regression and SVM classifiers, along with the tags and hierarchy, obtain better results than all the baselines.Our best configuration achieves a 92.10% accuracy, 91.68% recall, and 92.10% F1-score.