Detecting Duplicate Stack Overflow Questions Exploiting the Textual Information, and a Semantic-based Tag Hierarchy

Bachelor thesis (2022)

Authors

C.A. Botocan Electrical Engineering, Mathematics and Computer Science

Contributors

M. Izadi Software Engineering - (mentor)

A. van Deursen Software Technology (mentor)

G. Iosifidis Embedded Systems - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:1f30c556-00b7-4da3-812d-65c902f2d2e2

More Info

expand_more

Published Date

22-06-2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

The users of the most widespread Software Engineering dedicated forum, Stack Overflow (SO), are confronted by the issue of posting duplicate questions and spending time waiting for an answer. Currently, only the SO users with a high reputation and the moderators manually determine this type of post. Hence, an automatic solution can save substantial time and work. As a solution, we propose a system split into three components.
First, the textual information component is an ML-based solution to decide whether a question pair is a duplicate or not by analyzing its encoded version. Additionally, we use the Doc2Vec model for question embedding, which considers the title and body as input. As a second feature, we build a tag analyzer. Lastly, we introduce a novel element for improving the results - a semantic-based tag hierarchy. To give a better overview of the usefulness of using this kind of hierarchy, we explore different hierarchies - built fully automated or manually adjusted, iterating through their construction and the number of depth levels. As baselines, we compare the results against the Gaussian Naive Bayes, Decision Tree, and K-Nearest Neighbours classifiers, analyzing only
the question’s pair textual information. As a result, the Logistic Regression and SVM classifiers, along with the tags and hierarchy, obtain better results than all the baselines.Our best configuration achieves a 92.10% accuracy, 91.68% recall, and 92.10% F1-score.

Files

BachelorThesisCristianBotocan_... (pdf)

(pdf | 0.473 Mb)