Correspondence Between Perplexity Scores and Human Evaluation of Generated TV-Show Scripts

Keukeleire, P.

Correspondence Between Perplexity Scores and Human Evaluation of Generated TV-Show Scripts

Bachelor thesis (2020)

Authors

P. Keukeleire Electrical Engineering, Mathematics and Computer Science

Contributors

Stavros Makrodimitris Pattern Recognition and Bioinformatics (graduation committee member)

Arman Naseri Naseri Pattern Recognition and Bioinformatics (graduation committee member)

Tom Julian Viering Pattern Recognition and Bioinformatics (graduation committee member)

M. Loog Pattern Recognition and Bioinformatics (mentor)

David M.J. Tax Pattern Recognition and Bioinformatics (mentor)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Natural Language Processing Natural Language Generation Perplexity Human evaluation

To reference this document use:

http://resolver.tudelft.nl/uuid:ab543db3-f285-477c-b4ce-b6ac57507554

More Info

expand_more

Published Date

22-06-2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

In recent
years many new text generation models have been developed while evaluation of
text generation remains a considerable challenge. Currently, the only metric that is able to
fully capture the quality of a generated text is human evaluation, which is
expensive and time consuming. One of the most used intrinsic evaluation metrics
is perplexity. This paper researched the correspondence between perplexity
scores and human evaluation of scripts for the TV-show \textit{Friends}
generated using OpenAI's GPT-2 model. This
was done by conducting a survey taken by 226 participants that evaluated
selected scripts on creativity, realism and coherence. The survey results revealed that generations
with a perplexity value close to that of an actual Friends script perform best
on creativity, but score low on realism and coherence. The most realistic and coherent generations
were those with a lower perplexity value, while the worst in all fields were
the generations with the highest perplexity value. The research shows that perplexity is not an
adequate measure for the quality of generated TV-show scripts.

Files

Research_paper_final_version.p... (pdf)

(pdf | 0.306 Mb)

Unknown license