Who said that? Comparing performance of TF-IDF and fastText to identify authorship of short sentences

van Tussenbroek, T.A.R.

Who said that? Comparing performance of TF-IDF and fastText to identify authorship of short sentences

Bachelor thesis (2020)

Authors

T.A.R. van Tussenbroek Electrical Engineering, Mathematics and Computer Science

Contributors

T.J. Viering Pattern Recognition and Bioinformatics - (graduation committee member)

S. Makrodimitris Pattern Recognition and Bioinformatics - (graduation committee member)

A. Naseri Jahfari Pattern Recognition and Bioinformatics - (graduation committee member)

David Tax Pattern Recognition and Bioinformatics - (mentor)

M. Loog Pattern Recognition and Bioinformatics - (mentor)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:93873bbf-2886-4023-b696-e11be2b99024

More Info

expand_more

Published Date

22-06-2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Authorship identification is often applied to large documents, but less so to short, everyday sentences. The ability of identifying who said a short line could provide help to chatbots or personal assistants. This research compares performance of TF-IDF and fastText when identifying authorship of short sentences, by applying these feature extraction techniques to the television series Friends' transcripts. TF-IDF outperforms fastText in every measurement, but its performance is only marginally better than randomly guessing the original character, reaching an accuracy of 28 percent when making a distinction between 6 characters. Accuracy increases linearly at the same rate for both techniques as the minimum word count per sentence set on the test data increases. TF-IDF's confidence remains constant as this limit is set on either the test or training data, whereas fastText's confidence decreases and increases, respectively. Cross-entropy loss, however, remains constant for fastText and decreases for TF-IDF as the minimum word count set on the test data increases.

Files

Research_Paper_Thomas_van_Tuss... (pdf)

(pdf | 0.885 Mb)

Unknown license